You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/01/12 19:17:00 UTC

[jira] [Comment Edited] (PDFBOX-4739) Memory issues when rendering pdf to image

    [ https://issues.apache.org/jira/browse/PDFBOX-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013852#comment-17013852 ] 

Tilman Hausherr edited comment on PDFBOX-4739 at 1/12/20 7:16 PM:
------------------------------------------------------------------

This is surprising, I was able to render the file with {{-Xmx80m}} on jdk13 and {{-Xmx200m}} on jdk8 when using PDFToImage with 300dpi and tiff output.

When running PDFDebugger on the profiler, 70MB are used before opening the file. This could be the fonts and the colors.


was (Author: tilman):
This is surprising, I was able to render the file with -Xmx80m on jdk13 and -Xmx200m on jdk8 when using PDFToImage with 300dpi and tiff output.

When running PDFDebugger on the profiler, 70MB are used before opening the file. This could be the fonts and the colors.

> Memory issues when rendering pdf to image
> -----------------------------------------
>
>                 Key: PDFBOX-4739
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4739
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.18
>            Reporter: Lior Yaffe
>            Priority: Blocker
>         Attachments: linkedinceoresume.pdf
>
>
> So I'm trying to write a web service which performs OCR on an input pdf files.
> The code is very simple - convert the pdf to tiff files using PDFBox, and then use tesseract on the tiff files to get text.
> code is very straight forward:
>  
> {code:java}
> private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
>     List<ByteArrayOutputStream> fileList = new ArrayList<>();
>     PDDocument doc = PDDocument.load(this.bytes);
>     doc.setResourceCache(new EmptyCache());
>     try {
>         PDFRenderer pdfRenderer = new PDFRenderer(doc);
>         for (int page = 0; page < doc.getNumberOfPages(); ++page) {
>             BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
>             calcImageSize(bufferedImage);
>             ByteArrayOutputStream os = new ByteArrayOutputStream();
>             ImageIO.write(bufferedImage, "tiff", os);
>             os.flush();
>             os.close();
>             bufferedImage.flush();
>             bufferedImage = null;
>             fileList.add(os);
>         }
>     } finally {
>         doc.close();
>     }
>     return fileList;
> }
> {code}
>  
> I'm trying to run a sample test which runs this concurrent with 5-6 different threads, but the app is crashing very fast.
>  
> I did some memory tests, and it seems that while the input file is around 70 kb, the 
> {code:java}
> pdfRenderer
> {code}
> object is around 300 MB!! no matter if i'm changing the DPI level, the object is still very large.
> in addition, only if I'm calling the GC I see the memory drops, even if I'm closing the doc object....
>  
> Basically when I'm running my server with -Xmx6GB with 6 threads in concurrent, after 3 runs the service is crashing....what am I missing here?
>  
>  * I attached the input pdf file
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org