You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Villu Ruusmann (JIRA)" <ji...@apache.org> on 2010/01/06 23:14:54 UTC

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

    [ https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797349#action_12797349 ] 

Villu Ruusmann commented on PDFBOX-588:
---------------------------------------

As discussed in pdfbox-users mailing list [1], this issue relates to the naivety of PDFTextStripper's line detection algorithm.

It doesn't take much skill to correct for obvious line wraps. I've attached a sample patch file which does so by taking notice of TextPosition instances which are located significantly below and to the left of the previous TextPosition instance. The current threshold values are arbitrary (eg. 5 times the width of space in the X-direction), and should be replaced with something more meaningful.

[1] http://markmail.org/message/4b3bqpx7zznyqljh

> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !!
> While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.