You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Hesham (JIRA)" <ji...@apache.org> on 2010/01/04 08:22:54 UTC
[jira] Updated: (PDFBOX-588) Problem extracting text in newline
characters
[ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hesham updated PDFBOX-588:
--------------------------
Attachment: Enters-sample.pdf
This is a sample file having this issue.
> Problem extracting text in newline characters
> ---------------------------------------------
>
> Key: PDFBOX-588
> URL: https://issues.apache.org/jira/browse/PDFBOX-588
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: Win XP
> Reporter: Hesham
> Attachments: Enters-sample.pdf
>
>
> Hello ,
>
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !!
> While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox.
> You can download the PDF file from here to try it :
> http://www.4shared.com/file/185259485/5d937eb/Enters-sample.html
>
> Is there a way to fix this ?
>
> Best regards ,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.