Transformation of PDF created with ilovepdf continues indefinitely

Description

Description:
The attached PDF continues to be transformed until the hard disk runs out of space

Steps to reproduce:

1 Setup T Engine 1.3 with AIO Transformer at http://localhost:8090/
2 Go to http://localhost:8090/
3 In the Tika test section set source application/pdf
4 Upload the PDF attached
5 Click Transform

Expected behaviour:

The transformation completes successfully.

Observed behaviour:

The transformation does not complete and the hard disk fills up.

Notes:

I tested on Windows with the new T-Engine. Customer used Linux and Legacy.

Note (astrachan) - attachments (thread dumps and test PDF) are located in ftp.alfresco.com/support/Jira_Related/MNT-22082 and removed from this ticket.

Environment

Windows, Linux

Testcase ID

None

Activity

Show:
Kristian Dimitrov
April 16, 2021, 1:34 PM
Edited

Confirmed - Fixed.

Tested with docker image built from most recent transform-core master.

Command used: docker run -p 8090:8090 -e PDFBOX_NOTEXTRACTBOOKMARKS_DEFAULT='true' <AIO docker container id>

Note: Bug still reproduces if above flag is not set when the app/docker is deployed.

David Edwards
April 7, 2021, 1:20 PM
Edited


Exposes a new variable to the Tika and AIO T-engines to control the default behaviour of the notExtractBookmarksText request parameter, similar to the previous repo workaround.
This variable can be set in 1 of 2 ways:

  1. Through the application-default.yamlfile of the T-engine. Update/add the following variable:

  2. Through Environment Variable (this can be passed through to helm/ docker-compose):
    PDFBOX_NOTEXTRACTBOOKMARKS_DEFAULT
    docker-compose example snippet (““ quote marks are required here):

The default value for this variable is false so that previous functionality is maintained. i.e. if notExtractBookmarksText is not passed then the transformation will, as it always has, attempt to extract the bookmarks text.

Marina Oliveira
March 30, 2021, 8:11 AM

can you suggest potential options and if my help is needed, specific how I can help?

Scott Ashcraft
March 29, 2021, 7:28 PM

ftp.alfresco.com seems to have been broken for the last couple months. Issue is now with but I don't know current status.

David Edwards
March 26, 2021, 3:15 PM

It looks like the content.transformer.PdfBox.extractBookmarksText=false property has been removed in ACS 7.0.0 and I can confirm that there is currently no way to set extractBookmarksText to false by default. I’m currently looking into updating the tika T-engine, to accept such a parameter, which will also update the AIO engine.

Flagged
Fixed

Assignee

Unassigned

Reporter

Marco Tonelli

Labels

ACT Numbers

01019210

Delivery Team

Team 6

Bug Priority

Category 2