You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Roger Håkansson (JIRA)" <ji...@apache.org> on 2012/05/09 17:01:50 UTC

[jira] [Updated] (PDFBOX-1305) Text extraction takes huge amount of time on some files

     [ https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roger Håkansson updated PDFBOX-1305:
------------------------------------

    Attachment: 20020101ab3x012a.pdf
    
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
>                 Key: PDFBOX-1305
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result with JDK 7u4 and JDK 6u32
>            Reporter: Roger Håkansson
>         Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika, which is using PDFBox) and some of them takes between 20min up to an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same result, the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I can see a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal amount of time (up to a few seconds per file) to parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira