You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Roger Håkansson (JIRA)" <ji...@apache.org> on 2012/05/09 16:59:50 UTC

[jira] [Created] (PDFBOX-1305) Text extraction takes huge amount of time on some files

Roger Håkansson created PDFBOX-1305:
---------------------------------------

             Summary: Text extraction takes huge amount of time on some files
                 Key: PDFBOX-1305
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0
         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result with JDK 7u4 and JDK 6u32
            Reporter: Roger Håkansson


I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika, which is using PDFBox) and some of them takes between 20min up to an hour to index.
This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of that time was spent on just 279 files.

I've traced it to PDFBox taking a lot of time extracting the text from the documents.

I've tested extracting the text using pdfbox-app's ExtractText with the same result, the text is extracted but it takes forever...

The attached file took about 23min (using ExtractText) and from the result I can see a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal amount of time (up to a few seconds per file) to parse.

When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1305) Text extraction takes huge amount of time on some files

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273931#comment-13273931 ] 

Michael McCandless commented on PDFBOX-1305:
--------------------------------------------

I just tested this on PDFBox's current trunk (to be 1.7.0) and ExtractText ran in ~9 seconds (on a recent ivy bridge machine)...

It could be you are seeing the slowness that was fixed in PDFBOX-956?
                
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
>                 Key: PDFBOX-1305
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result with JDK 7u4 and JDK 6u32
>            Reporter: Roger Håkansson
>         Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika, which is using PDFBox) and some of them takes between 20min up to an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same result, the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I can see a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal amount of time (up to a few seconds per file) to parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1305) Text extraction takes huge amount of time on some files

Posted by "Roger Håkansson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roger Håkansson updated PDFBOX-1305:
------------------------------------

    Attachment: 20020101ab3x012a.pdf
    
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
>                 Key: PDFBOX-1305
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result with JDK 7u4 and JDK 6u32
>            Reporter: Roger Håkansson
>         Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika, which is using PDFBox) and some of them takes between 20min up to an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same result, the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I can see a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal amount of time (up to a few seconds per file) to parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira