CLONE - Performance bottleneck in Disposition Lifecycle job caused by large database transaction

Description

DESCRIPTION

The disposition lifecyle job processes record actions in batches of 1000 in ACS, but the underlying DB actions are processed in a single transaction. This causes scalability problems when a single execution of the job has to process large numbers of records. It results in a single transaction holding an increasing number of exclusive locks, steadily consuming resources in the DB leading to an overall slowdown, and affecting concurrency for other DB processes that may be blocked by any of the locks being held.

DEBUG analysis from a recent customer example.

  • This job run processed 125k records

  • The log extract covers the first 22k records being processed

  • started at 04:10, last entry at 04:57

  • Timestamps show the job processing the first few thousand at approximately 18 records per second

  • By the time its over 20k, processing is down to 4 per second

Logging from another job run showed records being processed as slowly as > 1 second per record

REPRODUCTION

  1. disable the scheduled execution for the scheduledDispositionLifecyceleJobDetail cron

  2. create a new file plan

  3. create a disposition/retention schedule to cut off immediately

  4. create a large number of records in the plan (5k-10k)

  5. Enable the following debug logging in admin console - org.alfresco.module.org_alfresco_module_rm.job.DispositionLifecycleJobExecuter

  6. trigger the scheduled job manually from the admin console or JMX

The db will show a single open transaction for the duration of the disposition schedule. Checking the locks against this transaction will show an increasing number of locks being held by it

EXPECTED
The dispostion job should process nodes in more regular and smaller database transactions

OBSERVED
The effect of a single large transaction and a lot of records to process is a steady decline in performance on the database and ACS as memory/heap/locks increases with no commit in the database to free up resources and allow GC top clean up the ACS heap

ANALYSIS
Before RM-1413, the job was processing each record in an individual transaction.
https://github.com/Alfresco/governance-services/commit/966d3ae94f74aaa0917c5b05f4aa223e0aa1b5bc

Following RM-1413, it changed to a single transaction for all records

REQUEST
Implement database transaction batching, which is exposed to be configurable, while not allowing it to be less than the 1000 of the disposition batch processing size

Environment

None

Testcase ID

None

Activity

Show:
Cassandra Panayiotou
April 14, 2021, 3:11 PM

Thank you, will do. it looks like it’s an issue for PaaS too, but when I challenged further, its not as urgent so it could potentially be slipped into a Service Pack. I’ll address this with the Product team and let you know the outcome. Thank you!

Alexandru Epure
April 13, 2021, 2:43 PM

Fixed merged in this PR#1409

AGS 3.3.0.5 has been released containing this fix, the artifact can be found on :

Michael Wallach
April 9, 2021, 2:57 PM

Per the case, version Cassie needs HF for is: 3.2.0.11. I will verify AGS version with Customer.

Michael Wallach
April 9, 2021, 2:50 PM

Let me verify version HF is needed on. HF is primarily for one of the two customers, cases

Fixed

Assignee

Alexandru Epure

Reporter

Mark Tunmer

Escalated By

CSM

Hot Fix Version

AGS 3.3.0.5

ACT Numbers

00335009

Delivery Team

Customer Excellence

Bug Priority

Category 2