Metadata extraction fails for certain documents when using legacy transformations
When only legacy transformations are enabled in ACS 6.2.2, metadata extraction fails for certain documents due to a Tika library version incompatibility with Poi library version.
ACS 6.2 uses version 4.1.1 of the POI library and version 1.21 of Tika. However, it appears that Tika was only upgraded to POI 4.1.1 in release 1.23 so the currently used Tika version appears to be incompatible with the version of POI (see https://issues.apache.org/jira/browse/TIKA-2851).
When the included tika-parsers-1.21.jar in web-server\webapps\alfresco\WEB-INF\lib is replaced with version 1.24 prior to starting ACS, metadata extraction succeeds without errors.
With ATS deployed and enabled, metadata extraction succeeds as ATS is using Tika 1.24.1 in version 1.3.1 upward.
Steps to reproduce
Configure ACS 6.2.2 so that only legacy transformations are enabled, i.e. set the following in alfresco-global.properties:
Set the following logging class:
Open Share and upload the attached file "testDocument.docx"
The document is uploaded and metadata is extracted with no errors appearing in the logs. Metadata properties like the document author are populated on the node.
The document is uploaded, preview is generated but metadata extraction fails with the below error in the Alfresco log. Metadata properties like the document author are not filled in.
Sorry for late response, it took longer than expected to create a DEV env with legacy transformers.
The issue is not reproducible on master (7.0) and 6.2.N, I later discovered that tika was upgraded in for those branches.
I have created a PR for the fix →
I could not cherry-pick the changes done in as the projects were refactored, but the same changes are included in the above PR.
I`ll proceed with merging, re-check the fix then release.