You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Fulvio D'Antonio <fu...@gmail.com> on 2009/07/31 11:18:18 UTC
PDF to text problems
Hello everybody,
I'm using PDFTetStripper to extract plain text from a pdf.
The problem I encounter is that every occurrence of "fi","ffi" etc is
replaced by a "?".
I think is a problem of encoding but I can't figure out how to solve it.
Thank you in advance for your help.
Fulvio
Re: PDF to text problems
Posted by Iain Clapham <ia...@googlemail.com>.
Fulvio,
This is a mapping problem - some characters have been compounded ( fi ff fl ft Th )
Have a look in the file (Resources.afm.Times-Roman.afm) at the codes above 126.
You might need to change the PDFont encode to produce the result you require !
Cheers --- Iain
Fulvio D'Antonio wrote:
> Hello everybody,
> I'm using PDFTetStripper to extract plain text from a pdf.
> The problem I encounter is that every occurrence of "fi","ffi" etc is
> replaced by a "?".
> I think is a problem of encoding but I can't figure out how to solve it.
>
> Thank you in advance for your help.
>
> Fulvio
>
>