Query performance problem with 100% CPU when SQL Server automatically changes execution plan from a good one to a bad one
SQL performance of concurrent upload degrades throughout the day as customer continues to ingest data. CPU stayed at 100% with disk IO and waits spiked when this happened. We scanned queries on the SQL Server and identified the following query running with a bad execution plan:
The query is believed to be used by audit trail and attribute services. Therefore we believe you will need to enable audit to reproduce this. Customer has audit turned on and is capturing audit trails for uploads, so we haven't tried otherwise but will do so and let you know in the comments section below.
We could reproduce this within 10-20 minutes of running 1200 concurrent upload using Alfresco’s Benchmark Framework in the customer’s test environment and was able to see the same SQL query causing slowness appearing on the top of other expensive queries. Solr was turned off throughout the entire exercise so we can rule it out. We have also reduced the amount of RAM assigned to SQL Server (via Management Studio) from 16G to 8G to reproduce this quicker.
This was the query we used to list out the expensive queries in SQL Server’s Management Studio:
As per our investigation we found that if data ingestion continues, after some time, SQL Server will suddenly switch from a good execution plan to a bad execution plan and slowness begins.
We noticed if we update statistics in SQL Server we could bring the CPU back down to normal because SQL Server would then switch back to use the good query execution plan again. However, if the load continues, SQL Server will again switch to use the bad execution plan and the issue will occur again. The customer is wondering whether Alfresco Engineering has noticed this in its benchmark lab and whether there is a permanent fix to it than having to update the statistics every time this happens.
When performance is good, this is the execution plan used by the SQL Server on the query (good plan). The optimizer has chosen to scan for the clustered index on the alf_prop_string_value table FIRST (read the chat from top down).
When performance is slow and degraded, below is the execution plan used on the query (bad plan) as it scans the alf_prop_value table FIRST. We believe this is the reason that had caused SQL Server to do slower scans instead of high-performing index seeks.
We have also noticed that the indexes on alf_prop_string_value table become 99% fragmented a lot quicker than those in the alf_prop_value table as ingestion continues.
The customer would like to ask Alfresco Engineering why SQL Server would switch execution plans from good to bad all of a sudden by itself and investigate whether there is a way (say by adding additional indexes or append an optimization option hint to the back of the SQL statement, etc) to prevent SQL Server from automatically switching from good to bad execution plan. The customer has an environment and is happy to demo it to Alfresco Engineering if needed.
SQL Server 2016
All - DOCS ticket created and linked to this ticket. Can we close this ticket?
This problem can be solved by turning on Automatic Plan Correction in MS SQL Server. But this feature is only available for MS SQL Server Enterprise edition, but not in the standard edition.
We will definitely not be able to solve this in the ACS7 timeframe.
Even after ACS7, it is not guaranteed that we will be able to restructure the query.