Performance bottleneck in Disposition Lifecycle job caused by large database transaction

Description

DESCRIPTION

The disposition lifecyle job processes record actions in batches of 1000 in ACS, but the underlying DB actions are processed in a single transaction. This causes scalability problems when a single execution of the job has to process large numbers of records. It results in a single transaction holding an increasing number of exclusive locks, steadily consuming resources in the DB leading to an overall slowdown, and affecting concurrency for other DB processes that may be blocked by any of the locks being held.

DEBUG analysis from a recent customer example.

  • This job run processed 125k records

  • The log extract covers the first 22k records being processed

  • started at 04:10, last entry at 04:57

  • Timestamps show the job processing the first few thousand at approximately 18 records per second

  • By the time its over 20k, processing is down to 4 per second

Logging from another job run showed records being processed as slowly as > 1 second per record

REPRODUCTION

  1. disable the scheduled execution for the scheduledDispositionLifecyceleJobDetail cron

  2. create a new file plan

  3. create a disposition/retention schedule to cut off immediately

  4. create a large number of records in the plan (5k-10k)

  5. Enable the following debug logging in admin console - org.alfresco.module.org_alfresco_module_rm.job.DispositionLifecycleJobExecuter

  6. trigger the scheduled job manually from the admin console or JMX

The db will show a single open transaction for the duration of the disposition schedule. Checking the locks against this transaction will show an increasing number of locks being held by it

EXPECTED
The dispostion job should process nodes in more regular and smaller database transactions

OBSERVED
The effect of a single large transaction and a lot of records to process is a steady decline in performance on the database and ACS as memory/heap/locks increases with no commit in the database to free up resources and allow GC top clean up the ACS heap

ANALYSIS
Before RM-1413, the job was processing each record in an individual transaction.
https://github.com/Alfresco/governance-services/commit/966d3ae94f74aaa0917c5b05f4aa223e0aa1b5bc

Following RM-1413, it changed to a single transaction for all records

REQUEST
Implement database transaction batching, which is exposed to be configurable, while not allowing it to be less than the 1000 of the disposition batch processing size

Environment

None

Testcase ID

None

Activity

Show:
Cassandra Panayiotou
4 days ago

Thank you and !!!

Alexandru Epure
April 1, 2021, 2:51 PM

Quick Update:

I`ve just picked up this issue, currently I`m setting up the dev environment, tomorrow I`ll proceed with reproducing the issue and start debugging to get a better understand of the parts involved.

Assignee

Alexandru Epure

Reporter

Mark Tunmer

Labels

None

Security Issue

None

Escalated By

CSM

Hot Fix Version

AGS 3.2.0

ACT Numbers

00335009, 00371136

Build Location

None

Regression Since

None

Premier Customer

None

Work Funnel End

None

Patch Attached

None

Dependent Version/s

None

Prioritization Score

None

Delivery Team

Customer Excellence

Bug Priority

Category 2

Components

Fix versions

Affects versions