You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Jesse James Joson <je...@gmail.com> on 2017/11/06 06:04:49 UTC

Question on text extraction

Hi,

I encounter some issue regrding on the extraction of text using PDF box
2.0.7. When I open the pdf file using Acrobat I see the content, it can be
select and search. The specific character "-" cannot be read correctly,
when the file undergo PDFbox it retrieves "?" in replacement for the hyphen.

Thank you

Re: Question on text extraction

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 06.11.2017 um 07:04 schrieb Jesse James Joson:
> Hi,
>
> I encounter some issue regrding on the extraction of text using PDF box
> 2.0.7. When I open the pdf file using Acrobat I see the content, it can be
> select and search. The specific character "-" cannot be read correctly,
> when the file undergo PDFbox it retrieves "?" in replacement for the hyphen.
>
> Thank you
>

Somewhat answered here:

https://pdfbox.apache.org/2.0/faq.html#notext

Another useful read to see how tricky this is:

https://stackoverflow.com/questions/45895768/pdfbox-2-0-7-extracttext-not-working-but-1-8-13-does-and-pdfreader-as-well
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0

For a specific answer, please link to the PDF. But if Adobe can't 
extract it, then it's unlikely PDFBox can.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org