You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mahesh (JIRA)" <ji...@apache.org> on 2010/07/20 18:53:50 UTC

[jira] Created: (PDFBOX-781) PDFBOX 1.2.1 Text parsing issue

PDFBOX 1.2.1 Text parsing issue
-------------------------------

                 Key: PDFBOX-781
                 URL: https://issues.apache.org/jira/browse/PDFBOX-781
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.1
         Environment: Windows XP, Java
            Reporter: Mahesh
            Priority: Trivial


Hi,

Thanks a lot for PDFBOX.
I have been using pdfbox 1.2.1 for text parsing.I have customized my Text parsing class by extending PDFTextStripper class.
The issue is : Though i am able to get all required string data (such as x/y position,width ,height,font name,font size) , the text that is extracted using Textposition object's getCharacter() returns the full text line except for the last charater.This last character appears as next line text.

        Ex: (Line in PDF ): "My name is Mahesh"
               (Parsed data): "My name is Mahes"
                                           "h"
Please help me in this regard.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-781) PDFBOX 1.2.1 Text parsing issue

Posted by "Larry West (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906205#action_12906205 ] 

Larry West commented on PDFBOX-781:
-----------------------------------

I'm by no means an expert, but if you use the sample app included with PDFBox, "PrintTextLocations" (see the cookbook, or just run it), you'll see that many, perhaps most, text fields are split up this way.  Or just look at the raw PDF content stream.

I'd guess this is an artifact of the way the creating program generated the file.   I.e., I don't think this is a bug.

In any case, using PDFTextStripperByArea seems to be a workable solution (at least for me).   You may want to setSortByPosition(true) and setSuppressDuplicateOverlappingText(true).

> PDFBOX 1.2.1 Text parsing issue
> -------------------------------
>
>                 Key: PDFBOX-781
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-781
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows XP, Java
>            Reporter: Mahesh
>            Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
> Thanks a lot for PDFBOX.
> I have been using pdfbox 1.2.1 for text parsing.I have customized my Text parsing class by extending PDFTextStripper class.
> The issue is : Though i am able to get all required string data (such as x/y position,width ,height,font name,font size) , the text that is extracted using Textposition object's getCharacter() returns the full text line except for the last charater.This last character appears as next line text.
>         Ex: (Line in PDF ): "My name is Mahesh"
>                (Parsed data): "My name is Mahes"
>                                            "h"
> Please help me in this regard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.