You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Mahesh (JIRA)" <ji...@apache.org> on 2010/07/20 18:53:50 UTC
[jira] Created: (PDFBOX-781) PDFBOX 1.2.1 Text parsing issue
PDFBOX 1.2.1 Text parsing issue
-------------------------------
Key: PDFBOX-781
URL: https://issues.apache.org/jira/browse/PDFBOX-781
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.1
Environment: Windows XP, Java
Reporter: Mahesh
Priority: Trivial
Hi,
Thanks a lot for PDFBOX.
I have been using pdfbox 1.2.1 for text parsing.I have customized my Text parsing class by extending PDFTextStripper class.
The issue is : Though i am able to get all required string data (such as x/y position,width ,height,font name,font size) , the text that is extracted using Textposition object's getCharacter() returns the full text line except for the last charater.This last character appears as next line text.
Ex: (Line in PDF ): "My name is Mahesh"
(Parsed data): "My name is Mahes"
"h"
Please help me in this regard.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-781) PDFBOX 1.2.1 Text parsing issue
Posted by "Larry West (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906205#action_12906205 ]
Larry West commented on PDFBOX-781:
-----------------------------------
I'm by no means an expert, but if you use the sample app included with PDFBox, "PrintTextLocations" (see the cookbook, or just run it), you'll see that many, perhaps most, text fields are split up this way. Or just look at the raw PDF content stream.
I'd guess this is an artifact of the way the creating program generated the file. I.e., I don't think this is a bug.
In any case, using PDFTextStripperByArea seems to be a workable solution (at least for me). You may want to setSortByPosition(true) and setSuppressDuplicateOverlappingText(true).
> PDFBOX 1.2.1 Text parsing issue
> -------------------------------
>
> Key: PDFBOX-781
> URL: https://issues.apache.org/jira/browse/PDFBOX-781
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Environment: Windows XP, Java
> Reporter: Mahesh
> Priority: Trivial
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Hi,
> Thanks a lot for PDFBOX.
> I have been using pdfbox 1.2.1 for text parsing.I have customized my Text parsing class by extending PDFTextStripper class.
> The issue is : Though i am able to get all required string data (such as x/y position,width ,height,font name,font size) , the text that is extracted using Textposition object's getCharacter() returns the full text line except for the last charater.This last character appears as next line text.
> Ex: (Line in PDF ): "My name is Mahesh"
> (Parsed data): "My name is Mahes"
> "h"
> Please help me in this regard.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.