You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 03:04:37 UTC

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

     [ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1542:
--------------------------------
    Attachment: Invoice1.pdf

> Whitespaces between words are not created
> -----------------------------------------
>
>                 Key: PDFBOX-1542
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 1.7.1
>            Reporter: Vitalie Bureanu
>            Priority: Minor
>         Attachments: Invoice1.pdf, Parser.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract "Unit Price" the result is "UnitPrice".
> But if I open the invoice in Adobe Reader and make "Copy/Past" into Notepad... I have the "Unit Price" with whitespaces!
> I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way "insert" whitespaces between words when it show content of the pdf.
>  
> Guys, can you please suggest me how I can have the strings with spaces after the parsing? 
> See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
> PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
> Many thanks,
> Vitalie



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)