You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/11/25 11:25:13 UTC

[jira] Resolved: (TIKA-557) Extract text file PDF error

     [ https://issues.apache.org/jira/browse/TIKA-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-557.
-----------------------------

    Resolution: Invalid

You've set a Write Limit on your ContentHandler, and the text in your PDF is too big

If you don't want to restrict yourself on the size of documents, use an unbounded handler. eg when creating a BodyContentHandler, don't specify a limit in the constructor

> Extract text file PDF error
> ---------------------------
>
>                 Key: TIKA-557
>                 URL: https://issues.apache.org/jira/browse/TIKA-557
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Them Ta
>         Attachments: QA.pdf
>
>
> File to extract text: QA.pdf
> I tested pdfbox 1.3.1 to extract in console and it worked fine, but by tika (just this file is error) the log error is:
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException
> 	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:120)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:153)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
> 	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
> 	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)
> 	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:287)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:113)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1819)
> 	at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:727)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321)
> 	at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.