[PaaS] MINHASH field in gz solr files is not being updated correctly
We have detected some .gz files from Solr contentstore that have a massive amount of hashes in MINHASH field. We think that the field is not being updated correctly and is growing without control.
When querying one of the affected notes to get the whole list of fields the output generated is about 76Mb of data which 99.9% are just hashes in MINHASH field:
We don't have steps to reproduce but after speaking with some engineers from SS team we are most likely hitting SEARCH-2065 which hasn't been fixed yet for 1.4.X branch.
One more question, once we are running the SS version that includes this fix, what is the procedure to sort out nodes that are affected. Is the fix going to detect problematic nodes and fix them automatically? If not, should we detect those nodes, delete the gz files and reindex them?
by default, we fix service pack requests on master and merge them back to all supported branches, ready to be picked up in upcoming service pack releases. As to which releases are being planned, you would need to speak with the PM . Thanks.
Which releases are going to have this fix?
I am able to reproduce the issue by updating the content of any archive file (tested with .zip and .odt).
First upload of document: 512 items in MINHASH field and each time I update the content of the document I see a 512 items increment.
When I open the file through webdav in LibreOffice, every time I hit save, even without any changes to the document, it triggers the index and I have an increment of tokens.
The number of increments is always 512 in my env independently from how many files are in my archive or how big they are.