Search Services -Error PDF Type1 Font

Description

Certain PDF files fail to index with error:

2021-02-10 22:16:57,425 ERROR [org.apache.pdfbox.pdmodel.font.PDType1Font] [pool-4-thread-2] Can't read the embedded Type1 font FDFBJU+NewsGothic
java.io.IOException: Expected INTEGER or REAL but got NAME
at org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256)
at org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168)
at org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139)
at org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61)

Steps to reproduce:

1 Install and configure ACS 6.2.2 and Search Services 2.0.1
2 Upload the sample pdf document to Share

Observed Behaviour:

File is indexed and we find the below error in the log:
Only filename is searchable, however content inside the file is not searchable.

2021-02-10 22:16:57,425 ERROR [org.apache.pdfbox.pdmodel.font.PDType1Font] [pool-4-thread-2] Can't read the embedded Type1 font FDFBJU+NewsGothic
java.io.IOException: Expected INTEGER or REAL but got NAME
at org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256)
at org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168)
at org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139)
at org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61)
at org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85)
at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:262)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter.extractRaw(TikaPoweredMetadataExtracter.java:399)
at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter$ExtractRawCallable.call(AbstractMappingMetadataExtracter.java:2005)
at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter$ExtractRawCallable.call(AbstractMappingMetadataExtracter.java:1)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

Expected Behaviour:

The content is indexed and searchable without any error.

Environment

None

Testcase ID

None

Activity

Show:
Alex Mukha
February 15, 2021, 8:38 PM

The issue was raised against search component, but it is nothing to do with search. This a transformation issue in PDFbox

Assignee

Unassigned

Reporter

Shilpa Tupe

Labels

None

ACT Numbers

00335598

Security Issue

None

Patch Attached

None

Premier Customer

None

Prioritization Score

None

Delivery Team

None

Build Location

None

Cloud or Enterprise

None

Bug Priority

Category 2

Work Funnel End

None

Escalated By

None

Dependent Version/s

None

Regression Since

None

Code Branch

None

Fix versions

Affects versions