You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Jira)" <ji...@apache.org> on 2022/08/05 12:28:00 UTC

[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

    [ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575814#comment-17575814 ] 

Nick Burch commented on TIKA-3832:
----------------------------------

Any chance you could try with Apache PDFBox directly? They've got a handy command line tool you can use:

[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]

That will help us narrow down if it's a Tika bug, or one in the underlying PDFBox library

> Required array length is too large (OOM) error when reading a PDF file
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3832
>                 URL: https://issues.apache.org/jira/browse/TIKA-3832
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.1
>            Reporter: Lakatos Gyula
>            Priority: Major
>         Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf
>
>
> I'm working on a web crawler and it got obliterated with an OutOfMemory error by a random PDF from the internet.
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483638 + 14 is too large
> 	at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
> 	at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
> 	at java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
> 	at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
> 	at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
> 	at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
> 	at java.base/java.io.StringWriter.write(StringWriter.java:99)
> 	at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
> 	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
> 	at org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
> 	at org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
> 	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
> 	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
> 	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
> 	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
> 	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
> 	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
> 	at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
> {code}
> I reproduced the error in this repository:
> [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]
> Uploaded the PDF into the attachments as well. It can be opened and read by the PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)