[PaaS] MINHASH field in gz solr files is not being updated correctly

Description

We have detected some .gz files from Solr contentstore that have a massive amount of hashes in MINHASH field. We think that the field is not being updated correctly and is growing without control.

Evidence:

When querying one of the affected notes to get the whole list of fields the output generated is about 76Mb of data which 99.9% are just hashes in MINHASH field:

We don't have steps to reproduce but after speaking with some engineers from SS team we are most likely hitting SEARCH-2065 which hasn't been fixed yet for 1.4.X branch.

Environment

None

Testcase ID

None

Activity

Show:
Francisco Olcina Grande
February 10, 2021, 9:45 AM

Thanks !


One more question, once we are running the SS version that includes this fix, what is the procedure to sort out nodes that are affected. Is the fix going to detect problematic nodes and fix them automatically? If not, should we detect those nodes, delete the gz files and reindex them?

Indy Sandhu
February 5, 2021, 11:43 AM

by default, we fix service pack requests on master and merge them back to all supported branches, ready to be picked up in upcoming service pack releases. As to which releases are being planned, you would need to speak with the PM . Thanks.

Francisco Olcina Grande
February 4, 2021, 4:27 PM

Which releases are going to have this fix?

Eva Vasquez
January 20, 2021, 3:15 PM

I am able to reproduce the issue by updating the content of any archive file (tested with .zip and .odt).

First upload of document: 512 items in MINHASH field and each time I update the content of the document I see a 512 items increment.

When I open the file through webdav in LibreOffice, every time I hit save, even without any changes to the document, it triggers the index and I have an increment of tokens.

The number of increments is always 512 in my env independently from how many files are in my archive or how big they are.

Fixed

Assignee

Tiago Salvado

Reporter

Francisco Olcina Grande

Labels

Escalated By

None

Security Issue

None

ACT Numbers

PaaS

Premier Customer

None

Code Branch

None

Build Location

None

Regression Since

None

Work Funnel End

None

Patch Attached

None

Dependent Version/s

None

Cloud or Enterprise

None

Prioritization Score

None

Delivery Team

Customer Excellence

Bug Priority

Category 2

Story Points

5

Components

Affects versions