You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Kevin Jackson (JIRA)" <ji...@apache.org> on 2011/02/05 03:37:31 UTC

[jira] Created: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Poor text extraction performance in PDFTextStripper.java
--------------------------------------------------------

                 Key: PDFBOX-956
                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.4.0
            Reporter: Kevin Jackson
             Fix For: 1.5.0


The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
The patch is to use a TreeMap to achieve O(N log N) performance.
The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.

BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by Ad...@swmc.com.
While I agree, and it's true that Java 1.5 has already passed its end of 
life, there are still a lot of people who are using it.  For example, the 
company I work for hasn't upgraded to 1.6 yet.

---- 
Thanks,
Adam





From:
Ted Dunning <te...@gmail.com>
To:
dev@pdfbox.apache.org
Date:
02/16/2011 12:13
Subject:
Re: [jira] Commented: (PDFBOX-956) Poor text extraction performance in 
PDFTextStripper.java



Of course, java 1.6 is the current version of Java.

How many people actually need to use 1.5?

On Wed, Feb 16, 2011 at 12:04 PM, Lars Torunski (JIRA) 
<ji...@apache.org>wrote:

>
>    [
> 
https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995491#comment-12995491
]
>
> Lars Torunski commented on PDFBOX-956:
> --------------------------------------
>
> Using NavigableMap in the patch will result in the dependency to Java 6.
>
> > Poor text extraction performance in PDFTextStripper.java
> > --------------------------------------------------------
> >
> >                 Key: PDFBOX-956
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
> >             Project: PDFBox
> >          Issue Type: Improvement
> >          Components: Text extraction
> >    Affects Versions: 1.4.0
> >            Reporter: Kevin Jackson
> >            Assignee: Andreas Lehmkühler
> >             Fix For: 1.5.0
> >
> >         Attachments: PDFBOX956-c4ce2fcd_69.txt,
> PDFTextStripper.java.patch, c4ce2fcd_69.pdf
> >
> >
> > The worst case performance of the suppressDuplicateOverlappingText 
logic
> in processTextPosition is O(n^2).
> > The patch is to use a TreeMap to achieve O(N log N) performance.
> > The example PDF took over 2 hours to extract the text before this 
patch
> and less than 10 minute after.
> > BTW:  The extracted text is also quite different compared to Adobe
> Reader.  Not sure which is correct but for this document it doesn't 
matter.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: 
http://www.atlassian.com/software/jira
>
>
>





- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: [jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by Ted Dunning <te...@gmail.com>.
Of course, java 1.6 is the current version of Java.

How many people actually need to use 1.5?

On Wed, Feb 16, 2011 at 12:04 PM, Lars Torunski (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995491#comment-12995491]
>
> Lars Torunski commented on PDFBOX-956:
> --------------------------------------
>
> Using NavigableMap in the patch will result in the dependency to Java 6.
>
> > Poor text extraction performance in PDFTextStripper.java
> > --------------------------------------------------------
> >
> >                 Key: PDFBOX-956
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
> >             Project: PDFBox
> >          Issue Type: Improvement
> >          Components: Text extraction
> >    Affects Versions: 1.4.0
> >            Reporter: Kevin Jackson
> >            Assignee: Andreas Lehmkühler
> >             Fix For: 1.5.0
> >
> >         Attachments: PDFBOX956-c4ce2fcd_69.txt,
> PDFTextStripper.java.patch, c4ce2fcd_69.pdf
> >
> >
> > The worst case performance of the suppressDuplicateOverlappingText logic
> in processTextPosition is O(n^2).
> > The patch is to use a TreeMap to achieve O(N log N) performance.
> > The example PDF took over 2 hours to extract the text before this patch
> and less than 10 minute after.
> > BTW:  The extracted text is also quite different compared to Adobe
> Reader.  Not sure which is correct but for this document it doesn't matter.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-956:
--------------------------------------

    Issue Type: Improvement  (was: Bug)

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Kevin Jackson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Jackson updated PDFBOX-956:
---------------------------------

    Attachment:     (was: 52b22bd6_69.pdf)

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Kevin Jackson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Jackson updated PDFBOX-956:
---------------------------------

    Attachment: c4ce2fcd_69.pdf

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: 52b22bd6_69.pdf, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Issue Comment Edited: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995708#comment-12995708 ] 

Andreas Lehmkühler edited comment on PDFBOX-956 at 2/17/11 7:54 AM:
--------------------------------------------------------------------

The minimum requirements for PDFBox are java 1.5 and I guess that won't be changed in a near future.

As I introduced the issue I'll take care of this.

      was (Author: lehmi):
    The minimum requirements for PDFBox are java 1.5 and I guess that won't be changed in a near future.

I'll take care of this.
  
> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146818#comment-13146818 ] 

Andreas Lehmkühler commented on PDFBOX-956:
-------------------------------------------

I improved the text extraction performance in revision 1199634. My changes are based on Kevins patch, I simply replaced the NavigableMap/Set with a TreeMap/Set. The results seem to be similar, the performance is way faster than before.

Thanks for the contribution!
                
> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Resolved: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-956.
---------------------------------------

    Resolution: Fixed
      Assignee: Andreas Lehmkühler

I added the patch in revision 1070125 as proposed.

Thanks for the contribution!!

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995491#comment-12995491 ] 

Lars Torunski commented on PDFBOX-956:
--------------------------------------

Using NavigableMap in the patch will result in the dependency to Java 6.

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Kevin Jackson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993338#comment-12993338 ] 

Kevin Jackson commented on PDFBOX-956:
--------------------------------------

I replaced the original sample PDF file with one that did not contain JavaScript.
Yes, Adobe Reader ALSO has performance problems with this evil file.
This fix addresses the performance problem in PDFBox.


> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Kevin Jackson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Jackson updated PDFBOX-956:
---------------------------------

    Attachment: 52b22bd6_69.pdf
                PDFTextStripper.java.patch

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: 52b22bd6_69.pdf, PDFTextStripper.java.patch
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-956:
--------------------------------------

    Attachment: PDFBOX956-c4ce2fcd_69.txt

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997177#comment-12997177 ] 

Andreas Lehmkühler commented on PDFBOX-956:
-------------------------------------------

I reverted the changes in revision 1072665. 

I'm working on a solution to solve this without using a NavigableMap, but I still have to fight some missing/addtional spaces ....

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141678#comment-13141678 ] 

Michael McCandless commented on PDFBOX-956:
-------------------------------------------

I'm also hitting this performance problem... it's quite severe: on my
test case (~550 various PDFs), with
setSuppressDuplicateOverlappingText on it takes 73.6 sec and with it
off it's 24.031 sec: 3X slower.

Looking at the code... I think we need some sort of spatial data
structure here (rtree, k-d tree, quadtree, or something?), to
efficiently query for overlapping rectangles for the new incoming
character.

But, even once we switch to a more efficient data structure... maybe
we could add some simple heuristics to restrict when we search for
dups.  For example, if the text is only ever "moving forward" (ie,
right to left or left to right, and "downwards", so that each glyph is
placed into a previously unused space) then we can know nothing can
overlap.  On seeing a glpyh "move backwards" (or, pu) then we could
turn on dup removal until it catches up to the unused space again...
I think this would mean most characters don't need to be further
checked.
                
> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-956.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
    
> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Stefan Magnus Landrø (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Magnus Landrø updated PDFBOX-956:
----------------------------------------

    Attachment: PDFTextStripper.pdf

PDF which is extremely slow to extract text from

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993966#comment-12993966 ] 

Andreas Lehmkühler commented on PDFBOX-956:
-------------------------------------------

The provided pdf contains a lot of crap. There is a section with round about 21000 lines like the following

(!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)Tj

That leads to round about 632000 "!" in the text (see the attached extraction result). That text is invisible because of its size, but it triggers the suppress duplicates algorithm od PDFBox which doesn't perform that good.

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Reopened: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reopened PDFBOX-956:
---------------------------------------


The minimum requirements for PDFBox are java 1.5 and I guess that won't be changed in a near future.

I'll take care of this.

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Lars Torunski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999263#comment-12999263 ] 

Lars Torunski commented on PDFBOX-956:
--------------------------------------

Kevin, is it possible to realise O(N log N) performance with Java 5 without new library dependencies like backport-util-concurrent?

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Stefan Magnus Landrø (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006424#comment-13006424 ] 

Stefan Magnus Landrø commented on PDFBOX-956:
---------------------------------------------

Added PDF which takes a long time to extract text from. Tested in 1.5 also - it seems like something happened after 1.2.1.

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991408#comment-12991408 ] 

Mel Martinez commented on PDFBOX-956:
-------------------------------------

Kevin,

I like the look of this optimization, though I haven't run it yet.

One comment:  The 52b22bd6.pdf file that you attached as a test case seems a bit problematic.  It apparently makes use of javascript.   For security reasons I normally run with javascript disabled on PDFs.   When I try to open this file with Adobe Reader from Firefox, it properly prompted me with whether to allow the javascript to run and before I could answer it hung up the Adobe Reader for about 90 seconds and then blew away both the Reader (v9.4.0) AND Firefox (v3.6.13).  This is on a 64bit Win XP Pro machine (latest patches as of a couple of days ago).

After restarting everything I tried once more (thinking maybe Firefox had gotten to a cluttered state) but it did it again right off the bat.

I'm not sure what happened at a micro level to cause that but I wonder if the file has some sort of corruption?


> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: 52b22bd6_69.pdf, PDFTextStripper.java.patch
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-956:
--------------------------------------

    Fix Version/s:     (was: 1.5.0)

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira