You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/09/21 13:00:07 UTC

[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

    [ https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900508#comment-14900508 ] 

Tim Allison commented on TIKA-1737:
-----------------------------------

Thank you for raising this issue.  I don't think we've seen this increase in our Common Crawl slice nor govdocs1...this is not to say that I doubt your findings!

If you have a chance, would you be able to confirm that you were getting good text out of the files before (handful random selection)...sometimes a new exception is actually a good thing.

Also, if there is any way to share the triggering docs, that would help the PDFBox team, and we can test with PDFBox 2.0-trunk to see how that compares.  If I were to update my dev Tika wrapper around PDFBox 2.0 on github, would you be willing/able to test it on these docs?

[~tilman], do any of these stacktraces look familiar?  


> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
>                 Key: TIKA-1737
>                 URL: https://issues.apache.org/jira/browse/TIKA-1737
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.10
>         Environment: Linux, Solaris
>            Reporter: Alan Burlison
>         Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather than PDFBox being better it's actually far, far worse. With the same corpus, Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each time there's an error indexing a PDF file. It's so bad I'm going to switch to running pdftotext (part of Xpdf) as an external process. Note that many of the errors in PDFBox are clearly caused by programming errors, e.g. ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)