Customer is having issues with pdf text extractions. They continue getting the following errors for certain documents:
"WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic
Feb 10, 2021 11:40:46 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
SEVERE: Can't read the embedded Type1 font FDFBJU+NewsGothic
java.io.IOException: Expected INTEGER or REAL but got NAME"
Steps to reproduce:
Upload the file attached in the following location:
Review logs in catalina.out
Should extract text without issue.
When the pdf is uploaded we get the following error
Alfresco 6.2.2, MySQL, Tomcat
Analysis to date:
This was also tested in the latest release of pdfbox "pdfbox-app-2.0.22.jar".
Also reproduced with ATS AIO 2.3.7. Only parts of the document get indexed.