You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Fulvio D'Antonio <fu...@gmail.com> on 2009/07/31 11:18:18 UTC

PDF to text problems

Hello everybody,
I'm using PDFTetStripper to extract plain text from a pdf.
The problem I encounter is that every occurrence of "fi","ffi" etc is
replaced by a "?".
I think is a problem of encoding but I can't figure out how to solve it.

Thank you in advance for your help.

Fulvio

Re: PDF to text problems

Posted by Iain Clapham <ia...@googlemail.com>.
Fulvio,

This is a mapping problem -  some characters have been compounded ( fi ff fl ft Th )

Have a look in the  file  (Resources.afm.Times-Roman.afm) at the codes above 126.

You might need to change the PDFont encode to produce the result you require !

Cheers --- Iain








Fulvio D'Antonio wrote:
> Hello everybody,
> I'm using PDFTetStripper to extract plain text from a pdf.
> The problem I encounter is that every occurrence of "fi","ffi" etc is
> replaced by a "?".
> I think is a problem of encoding but I can't figure out how to solve it.
>
> Thank you in advance for your help.
>
> Fulvio
>
>