Performance bottleneck in Disposition Lifecycle job caused by large database transaction
Description
DESCRIPTION
The disposition lifecyle job processes record actions in batches of 1000 in ACS, but the underlying DB actions are processed in a single transaction. This causes scalability problems when a single execution of the job has to process large numbers of records. It results in a single transaction holding an increasing number of exclusive locks, steadily consuming resources in the DB leading to an overall slowdown, and affecting concurrency for other DB processes that may be blocked by any of the locks being held.
DEBUG analysis from a recent customer example.
This job run processed 125k records
The log extract covers the first 22k records being processed
started at 04:10, last entry at 04:57
Timestamps show the job processing the first few thousand at approximately 18 records per second
By the time its over 20k, processing is down to 4 per second
Logging from another job run showed records being processed as slowly as > 1 second per record
REPRODUCTION
disable the scheduled execution for the scheduledDispositionLifecyceleJobDetail cron
create a new file plan
create a disposition/retention schedule to cut off immediately
create a large number of records in the plan (5k-10k)
Enable the following debug logging in admin console - org.alfresco.module.org_alfresco_module_rm.job.DispositionLifecycleJobExecuter
trigger the scheduled job manually from the admin console or JMX
The db will show a single open transaction for the duration of the disposition schedule. Checking the locks against this transaction will show an increasing number of locks being held by it
EXPECTED
The dispostion job should process nodes in more regular and smaller database transactions
OBSERVED
The effect of a single large transaction and a lot of records to process is a steady decline in performance on the database and ACS as memory/heap/locks increases with no commit in the database to free up resources and allow GC top clean up the ACS heap
ANALYSIS
Before RM-1413, the job was processing each record in an individual transaction.
https://github.com/Alfresco/governance-services/commit/966d3ae94f74aaa0917c5b05f4aa223e0aa1b5bc
Following RM-1413, it changed to a single transaction for all records
REQUEST
Implement database transaction batching, which is exposed to be configurable, while not allowing it to be less than the 1000 of the disposition batch processing size
Environment
Testcase ID
Activity
Thank you and !!!
Quick Update:
I`ve just picked up this issue, currently I`m setting up the dev environment, tomorrow I`ll proceed with reproducing the issue and start debugging to get a better understand of the parts involved.