You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/23 12:25:25 UTC

[jira] [Commented] (TIKA-1907) Big Pdf parsing to text - Out of memory

    [ https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208274#comment-15208274 ] 

Tim Allison commented on TIKA-1907:
-----------------------------------

Thank you for raising this issue.  As [~tilman] pointed out, there may be some areas for memory optimization within PDFBox.  However, to be fair, AcrobatReader consumed 500MB of memory when saving the file to text.  When you decode the doc with PDFBox app's WriteDecodedDoc, the file blossoms to 190MB.

pdftotext appears to have better memory consumption for this file.

If there's anything you can recommend we do on the Tika side to decrease the memory footprint, let us know...  

I plan to parameterize the scratch file usage, but as you found, that doesn't offer enormous savings.

> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
>                 Key: TIKA-1907
>                 URL: https://issues.apache.org/jira/browse/TIKA-1907
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.12
>            Reporter: Nicolas Daniels
>
> Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
> I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe PDFBox is not the appropriate lib to use in such case.
> Trying to read the same PDF using Tika leads to the same problem:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
>     InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
>     try {
>              StringWriter writer = new StringWriter();
> 	     FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));
> 	    BodyContentHandler handler = new BodyContentHandler(fileWriter);
> 	    Metadata metadata = new Metadata();
> 	    new PDFParser().parse(inputStream, handler, metadata, new ParseContext());
>              fileWriter.close();
>     } finally {
>         inputStream.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)