You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nicolas Daniels (JIRA)" <ji...@apache.org> on 2016/03/23 09:20:25 UTC

[jira] [Created] (TIKA-1907) Big Pdf parsing to text - Out of memory

Nicolas Daniels created TIKA-1907:
-------------------------------------

             Summary: Big Pdf parsing to text - Out of memory
                 Key: TIKA-1907
                 URL: https://issues.apache.org/jira/browse/TIKA-1907
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.12
            Reporter: Nicolas Daniels


Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]

I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe PDFBox is not the appropriate lib to use in such case.

Trying to read the same PDF using Tika leads to the same problem:

{code:title=Test.java|borderStyle=solid}
@Test
public void testParsePdf_Content_Memory() throws Exception {
{
    InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
    try {
             StringWriter writer = new StringWriter();
	     FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));

	    BodyContentHandler handler = new BodyContentHandler(fileWriter);
	    Metadata metadata = new Metadata();
	    new PDFParser().parse(inputStream, handler, metadata, new ParseContext());

             fileWriter.close();
    } finally {
        inputStream.close();
    }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)