You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/11/29 05:09:00 UTC

[jira] [Comment Edited] (PDFBOX-4389) Excessive load times for large pdfs

    [ https://issues.apache.org/jira/browse/PDFBOX-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702723#comment-16702723 ] 

Tilman Hausherr edited comment on PDFBOX-4389 at 11/29/18 5:08 AM:
-------------------------------------------------------------------

What you could try is to change the for loop (which include a call to {{doc.getPage\(i)}}) to a foreach loop which uses the page objects from {{doc.getPages()}}.


was (Author: tilman):
What you could try is to change the for loop (which include a call to {{doc.getPage(i)}}) to a foreach loop which uses the page objects from {{doc.getPages()}}.

> Excessive load times for large pdfs
> -----------------------------------
>
>                 Key: PDFBOX-4389
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4389
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>         Environment: OpenJDK 10, Ubuntu
> pdfbox v2.0.12
> jbig2-imageio v3.0.2
>            Reporter: Ben Manes
>            Priority: Major
>         Attachments: PdfComponent.java
>
>
> We render preview images for pdfs being uploaded. This is usually quite fast, as often these are short PDFs (e.g. shipments). One customer has a habit of uploading 6,000+ pages, which I believe is their historicals. This can take a while, though I am currently seeing over a minute per page:
> {{Processed page 940 / 1930 for pdf 1d2c0351-6c1f-4198-bd0b-6728927d7d00 within f1816bb9-3da2-4b61-a3d2-3ca9c419598e in 1.443 min}}
> The operation is safely parallelized by reading the number of pages, enqueuing a task per page index, opening the pdf in the task, and rendering the page index. Each task creates a new {{MemoryUsageSetting}} at 2mb memory an unlimited disk. When monitoring this upload, which will take 32 hours at this rate, the active scratch files are over 500mb. 
> {{$ du -h /tmp/cache_12639792278559363345/session_2059639776597126303/f1816bb9-3da2-4b61-a3d2-3ca9c419598e/component/pdf/pdfbox/1d2c0351-6c1f-4198-bd0b-6728927d7d00 | cut -f1 | sort -u}}
> {{2.3G}}
> {{4.0K}}
> {{524M}}
> {{531M}}
> {{552M}}
> {{653M}}
> When polling the stack traces, the threads appear to be spending most of their time on expanding the temp file for the per-page task's loading of the pdf(s).
> Can you explain why this is so slow? My hope is that it could traverse to the page quickly, render it, and close. In this case I might try refactoring to pool the opened documents instead of loading anew, as previously the image rendering was performance problem (since {{KcmsServiceProvider}} is no longer available).
>  
> ----
> java.lang.Thread.State: RUNNABLE
>  at java.io.RandomAccessFile.setLength(java.base@10.0.1/Native Method)
>  at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:245)
>  locked <0x00000006f6268cc0> (a java.lang.Object)
>  at org.apache.pdfbox.io.ScratchFile.getNewPage(ScratchFile.java:167)
>  locked <0x00000006f6268f10> (a java.util.BitSet)
>  at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:126)
>  at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
>  at org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
>  at org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
>  at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:279)
>  at org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1299)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1127)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:913)
>  at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>  at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>  at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>  at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:949)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org