You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Hesham (JIRA)" <ji...@apache.org> on 2011/01/04 15:39:46 UTC

[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters

    [ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977281#action_12977281 ] 

Hesham edited comment on PDFBOX-588 at 1/4/11 9:39 AM:
-------------------------------------------------------

Thanks a lot Mel and Andreas for the investigation ... 'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have tested it on 5 PDFs & the best value for me was (0.3f). It mostly extracts all words right.

As for the attached PDF in this issue, the problem of spaces is now limited to the last words of the paragraph at the low left side like :
"be able to read about Paul Revere's midnight" -> "beabletoreadaboutPaulRevere'smidnight"
"journey only a" -> "journeyonlya"

If i used a spacing tolerance (0.1f), those words will be extracted right, but in return other words will appear wrong like :
"UNCENSORED REVOLUTIONARY WAR HISTORY" -> "U N C E N S O R E D R E V O L U T I O N A R Y W A R H I S T O R Y"

So i guess i will leave it with value (0.3)f which is much better. I will check now the Enters problem in PDFBox-521.

      was (Author: hesham):
    Thanks a lot Mel and Andreas for the investigation ... 'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have tested it on 5 PDFs & the best value for me was (0.3f). It mostly extracts all words right.

As for the attached PDF in this issue, the problem of spaces is now limited to the last words of the paragraph at the low left side like :
"able to" -> "ableto"
"in order" -> "inorder"
"But not" -> "Butnot"
"who set" -> "whoset"

I think this is because of the 'Enters' problem. I will check it now in PDFBox-521.
  
> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !!
> While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.