Issues with pdfbox - Invalid ToUnicode CMap in font

Description

Description:
Customer is having issues with pdf text extractions. They continue getting the following errors for certain documents:
"WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic
Feb 10, 2021 11:40:46 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
SEVERE: Can't read the embedded Type1 font FDFBJU+NewsGothic
java.io.IOException: Expected INTEGER or REAL but got NAME"

Steps to reproduce:
Upload the file attached in the following location:

ftp://ftp.alfresco.com/support/Jira_Related/MNT-22194

Review logs in catalina.out

Expected Beaviour:
Should extract text without issue.

Observed Behaviour:
When the pdf is uploaded we get the following error

Environment Reproduction:
Alfresco 6.2.2, MySQL, Tomcat

Analysis to date:
This was also tested in the latest release of pdfbox "pdfbox-app-2.0.22.jar".

Environment

ACS 6.2.2

Testcase ID

None

Activity

Show:
Scott Ashcraft
February 13, 2021, 12:51 AM

Also reproduced with ATS AIO 2.3.7. Only parts of the document get indexed.

Assignee

Unassigned

Reporter

David Almazan

Labels

None

ACT Numbers

00335599

Security Issue

None

Patch Attached

None

Premier Customer

None

Prioritization Score

None

Delivery Team

None

Build Location

None

Cloud or Enterprise

None

Bug Priority

Category 2

Work Funnel End

None

Escalated By

None

Dependent Version/s

None

Regression Since

None

Code Branch

None

Fix versions

Affects versions