You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nicolas Daniels (JIRA)" <ji...@apache.org> on 2016/03/23 09:20:25 UTC
[jira] [Created] (TIKA-1907) Big Pdf parsing to text - Out of
memory
Nicolas Daniels created TIKA-1907:
-------------------------------------
Summary: Big Pdf parsing to text - Out of memory
Key: TIKA-1907
URL: https://issues.apache.org/jira/browse/TIKA-1907
Project: Tika
Issue Type: Bug
Affects Versions: 1.12
Reporter: Nicolas Daniels
Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe PDFBox is not the appropriate lib to use in such case.
Trying to read the same PDF using Tika leads to the same problem:
{code:title=Test.java|borderStyle=solid}
@Test
public void testParsePdf_Content_Memory() throws Exception {
{
InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
try {
StringWriter writer = new StringWriter();
FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));
BodyContentHandler handler = new BodyContentHandler(fileWriter);
Metadata metadata = new Metadata();
new PDFParser().parse(inputStream, handler, metadata, new ParseContext());
fileWriter.close();
} finally {
inputStream.close();
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)