You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Hesham (JIRA)" <ji...@apache.org> on 2011/01/03 07:21:45 UTC

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

    [ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976625#action_12976625 ] 

Hesham commented on PDFBOX-588:
-------------------------------

I have tested this with PDFBox v1.4. It is getting worse(More wrong results), examples :

- The line "to alert John Hancock and Samuel Adams that" is read as : "toalertJohnHancockandSamuelAdamsthat".

- The line "the Regulars are coming out" is read as "theRegularsarecomingout"

- The line "the two were to be arrested" is read as "thetwoweretobearrested"

> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !!
> While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.