You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Mel Martinez (JIRA)" <ji...@apache.org> on 2010/01/14 21:39:54 UTC

[jira] Created: (PDFBOX-600) PDFBox performance issue: PDFTextStripper performance tweak

PDFBox performance issue:  PDFTextStripper performance tweak
------------------------------------------------------------

                 Key: PDFBOX-600
                 URL: https://issues.apache.org/jira/browse/PDFBOX-600
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
         Environment: All
            Reporter: Mel Martinez


During text extraction, the PDFTextStripper needs to calculate textposition proximities in order to determine if text elements are overlapping either vertically or horizontally.

As part of this, the PDFTextStripper.within(float first, float second, float variance) method is used.

The current (0.8.0) version of this method uses the following test:   second > first - variance && second < first + variance

This is accurate, but slower in my test documents than if you flip the test order:        second < first + variance && second > first - variance

This is because the second test fails out faster on left-to-right text.   I believe that should be the default case.

Please change the PDFTextStripper.within() method to use the second form of the test.  I.E. to:


    private boolean within( float first, float second, float variance )
    {
        return second < first + variance && second > first - variance;
    }

Thanks!



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-600) PDFBox performance issue: PDFTextStripper performance tweak

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved PDFBOX-600.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Jukka Zitting

Simple yet effective, nice! Committed in revision 899474.

> PDFBox performance issue:  PDFTextStripper performance tweak
> ------------------------------------------------------------
>
>                 Key: PDFBOX-600
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-600
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: All
>            Reporter: Mel Martinez
>            Assignee: Jukka Zitting
>             Fix For: 1.0.0
>
>         Attachments: PDFTextStripper.java
>
>
> During text extraction, the PDFTextStripper needs to calculate textposition proximities in order to determine if text elements are overlapping either vertically or horizontally.
> As part of this, the PDFTextStripper.within(float first, float second, float variance) method is used.
> The current (0.8.0) version of this method uses the following test:   second > first - variance && second < first + variance
> This is accurate, but slower in my test documents than if you flip the test order:        second < first + variance && second > first - variance
> This is because the second test fails out faster on left-to-right text.   I believe that should be the default case.
> Please change the PDFTextStripper.within() method to use the second form of the test.  I.E. to:
>     private boolean within( float first, float second, float variance )
>     {
>         return second < first + variance && second > first - variance;
>     }
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-600) PDFBox performance issue: PDFTextStripper performance tweak

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-600:
--------------------------------

    Attachment: PDFTextStripper.java

flips the conditional expression component order in the within() method to speed up the test on left-to-right text.


> PDFBox performance issue:  PDFTextStripper performance tweak
> ------------------------------------------------------------
>
>                 Key: PDFBOX-600
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-600
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: All
>            Reporter: Mel Martinez
>         Attachments: PDFTextStripper.java
>
>
> During text extraction, the PDFTextStripper needs to calculate textposition proximities in order to determine if text elements are overlapping either vertically or horizontally.
> As part of this, the PDFTextStripper.within(float first, float second, float variance) method is used.
> The current (0.8.0) version of this method uses the following test:   second > first - variance && second < first + variance
> This is accurate, but slower in my test documents than if you flip the test order:        second < first + variance && second > first - variance
> This is because the second test fails out faster on left-to-right text.   I believe that should be the default case.
> Please change the PDFTextStripper.within() method to use the second form of the test.  I.E. to:
>     private boolean within( float first, float second, float variance )
>     {
>         return second < first + variance && second > first - variance;
>     }
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.