You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/12/20 13:37:00 UTC
[jira] [Comment Edited] (TIKA-2496) TIKA crashes / runs out of memory on simple PDF

    [ https://issues.apache.org/jira/browse/TIKA-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298513#comment-16298513 ] 

Tim Allison edited comment on TIKA-2496 at 12/20/17 1:36 PM:
-------------------------------------------------------------

bq. Able to replicate the issue with any zip file of size more than 2gb.

Funny you mention this, just yesterday, I wrote an "unraveler" for a [PST file|https://github.com/tballison/tika-addons/tree/1.17/unravel/src/main/java/org/tallison/tika/unravelers].  The idea is that when you have large archive files (pst, mbox, zip, tar), you either want to do some preprocessing to extract all of the attachments (and then run Tika on the extracted binaries) or you want to process the large archives specially so that each embedded file is extracted as its own standalone "extract", rather than as we now do, as one monstrous xhtml, or {{List<Metadata>}}.  If you are extracting text for search, for example, a user would not be thrilled to have a 2gb zip file treated as a single file, typically.

So, would it make sense to do some preprocessing on your large zips to extract the contents as binary files and then run Tika against those?

Eventually, I'd like to add the unraveler functionality into Tika, but that's a good way off.


was (Author: tallison@mitre.org):
bq. Able to replicate the issue with any zip file of size more than 2gb.

Funny you mention this, just yesterday, I wrote an "unraveler" for a [PST file|https://github.com/tballison/tika-addons/tree/1.17/unravel/src/main/java/org/tallison/tika/unravelers].  The idea is that when you have large archive files (pst, mbox, zip, tar), you either want to do some preprocessing to extract all of the attachments or you want to process them specially so that each embedded file is extracted as its own standalone "extract".  If you are extracting text for search, for example, a user would not be thrilled to have a 2gb zip file treated as a single file, typically.

So, would it make sense to do some preprocessing on your large zips to extract the contents?

Eventually, I'd like to add the unraveler functionality into Tika, but that's a good way off.

> TIKA crashes / runs out of memory on simple PDF
> -----------------------------------------------
>
>                 Key: TIKA-2496
>                 URL: https://issues.apache.org/jira/browse/TIKA-2496
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.15
>         Environment: Linux, Java 8
>            Reporter: chelambarasan
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF that results in OutOfMemory errors while being processed by TIKA.
> Tried with Xmx 5gb and pdf file sizes are approximately 50 mb. 
> Tika version: 1.15
> Error as below:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
> 	at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
> 	at org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
> 	at org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
> 	at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:266)
> 	at org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1142)
> 	at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:970)
> 	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
> 	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
> 	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
> 	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
> 	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> Please let us know how to fix this issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)