You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Vitalie Bureanu (JIRA)" <ji...@apache.org> on 2013/03/15 12:56:12 UTC

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

     [ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitalie Bureanu updated PDFBOX-1542:
------------------------------------

    Description: 
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract "Unit Price" the result is "UnitPrice".
But if I open the invoice in Adobe Reader and make "Copy/Past" into Notepad... I have the "Unit Price" with whitespaces!
I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way "insert" whitespaces between words when it show content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

  was:
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract "Total Amount" the result is "TotalAmount".
But if I open the invoice in Adobe Reader and make "Copy/Past" into Notepad... I have the "Total Amount" with whitespaces!
I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way "insert" whitespaces between words when it show content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

    
> Whitespaces between words are not created
> -----------------------------------------
>
>                 Key: PDFBOX-1542
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 1.7.1
>            Reporter: Vitalie Bureanu
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract "Unit Price" the result is "UnitPrice".
> But if I open the invoice in Adobe Reader and make "Copy/Past" into Notepad... I have the "Unit Price" with whitespaces!
> I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way "insert" whitespaces between words when it show content of the pdf.
>  
> Guys, can you please suggest me how I can have the strings with spaces after the parsing? 
> See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
> PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
> Many thanks,
> Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira