You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/04/29 09:57:30 UTC

[jira] Resolved: (PDFBOX-456) PDFTextStripperByArea never finds any text (pageNo check in PDFTextStripper always returns false)

     [ https://issues.apache.org/jira/browse/PDFBOX-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-456.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator

I've applied the patch. It works fine with version 769696.

Thanks to Hannes for the provided patch.

> PDFTextStripperByArea never finds any text (pageNo check in PDFTextStripper always returns false)
> -------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-456
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-456
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: 0.8.0-incubator as well as checkout from SVN (rev#767932).
> Not affected: lastest sf.net release (0.7.3)
>            Reporter: Hannes Erven
>             Fix For: 0.8.0-incubator
>
>         Attachments: PDFBOX-456-patch-he.diff
>
>
> PDFTextStripperByArea does not return any text from pages.
> This is due to a check in PDFTextStripper#processPage() (first line) that compares the currentPageNo number (initially 0) against the startPage (initially 1). Since PDFTextStripperByArea does not set startPage and/or currentPage, this comparison always gives false and no text is extracted.
> A possible fix is to include the following code in PDFTextStripperByArea#extractRegions right before the call to processPage():
> setStartPage(0)
> setEndPage(0)
> Since I'm not very familiar with the inner PDFbox workings, this might be more of a hack than a solid fix.
> The issue was introduced in PDFTextStripper 1.70 (old SF.net CSV), where the currentPage++ was removed from just before the check in processPage().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.