Metadata extraction fails for certain documents when using legacy transformations

Description

When only legacy transformations are enabled in ACS 6.2.2, metadata extraction fails for certain documents due to a Tika library version incompatibility with Poi library version.

ACS 6.2 uses version 4.1.1 of the POI library and version 1.21 of Tika. However, it appears that Tika was only upgraded to POI 4.1.1 in release 1.23 so the currently used Tika version appears to be incompatible with the version of POI (see https://issues.apache.org/jira/browse/TIKA-2851).

When the included tika-parsers-1.21.jar in web-server\webapps\alfresco\WEB-INF\lib is replaced with version 1.24 prior to starting ACS, metadata extraction succeeds without errors.

With ATS deployed and enabled, metadata extraction succeeds as ATS is using Tika 1.24.1 in version 1.3.1 upward.

Steps to reproduce

  1. Configure ACS 6.2.2 so that only legacy transformations are enabled, i.e. set the following in alfresco-global.properties:

  2. Set the following logging class:
    log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=debug

  3. Start ACS

  4. Open Share and upload the attached file "testDocument.docx"

Expected Behaviour
The document is uploaded and metadata is extracted with no errors appearing in the logs. Metadata properties like the document author are populated on the node.

Observed Behaviour
The document is uploaded, preview is generated but metadata extraction fails with the below error in the Alfresco log. Metadata properties like the document author are not filled in.

Environment

None

Testcase ID

None

Activity

Show:
Alexandru Epure
November 19, 2020, 3:50 PM

Sorry for late response, it took longer than expected to create a DEV env with legacy transformers.

The issue is not reproducible on master (7.0) and 6.2.N, I later discovered that tika was upgraded in for those branches.

I have created a PR for the fix →

I could not cherry-pick the changes done in as the projects were refactored, but the same changes are included in the above PR.

I`ll proceed with merging, re-check the fix then release.

Elisabeth Wetchy
November 16, 2020, 2:13 PM

Moved a second attachment to ftp://ftp.alfresco.com/support/Jira_Related/MNT-22055/alfresco-tika-1.24.log as the log contained some customer-identifying information.

Scott Ashcraft
November 16, 2020, 1:49 PM

Attachment with customer-identifying information moved to ftp://ftp.alfresco.com/support/Jira_Related/MNT-22055/testDocument.docx

Fixed

Assignee

Unassigned

Reporter

Elisabeth Wetchy

Labels

None

Security Issue

None

Escalated By

None

Hot Fix Version

ACT Numbers

01018420

Regression Since

None

Premier Customer

Yes

Work Funnel End

None

Patch Attached

None

Dependent Version/s

None

Prioritization Score

None

Delivery Team

Customer Excellence

Bug Priority

Category 2

Fix versions

Affects versions