You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org> on 2011/11/01 23:15:32 UTC

[jira] [Commented] (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

    [ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141678#comment-13141678 ] 

Michael McCandless commented on PDFBOX-956:
-------------------------------------------

I'm also hitting this performance problem... it's quite severe: on my
test case (~550 various PDFs), with
setSuppressDuplicateOverlappingText on it takes 73.6 sec and with it
off it's 24.031 sec: 3X slower.

Looking at the code... I think we need some sort of spatial data
structure here (rtree, k-d tree, quadtree, or something?), to
efficiently query for overlapping rectangles for the new incoming
character.

But, even once we switch to a more efficient data structure... maybe
we could add some simple heuristics to restrict when we search for
dups.  For example, if the text is only ever "moving forward" (ie,
right to left or left to right, and "downwards", so that each glyph is
placed into a previously unused space) then we can know nothing can
overlap.  On seeing a glpyh "move backwards" (or, pu) then we could
turn on dup removal until it catches up to the unused space again...
I think this would mean most characters don't need to be further
checked.
                
> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, PDFTextStripper.pdf, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  Not sure which is correct but for this document it doesn't matter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira