You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Zak Bennett <za...@gmail.com> on 2013/08/28 01:20:37 UTC
Japanese characters
Hi guys,
Firstly I apologise if this question has been repeated often. Having looked
around I have found a number of individuals with the same issue as myself.
Have you discovered any workarounds to the issue of returning Japanese text
information from a PDF using pdfbox? If not, would this be an issue which
the dev team is currently working to solve?
Best regards,
Zak
Re: Japanese characters
Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,
> Zak Bennett <za...@gmail.com> hat am 28. August 2013 um 01:20
> geschrieben:
>
>
> Hi guys,
>
> Firstly I apologise if this question has been repeated often. Having looked
> around I have found a number of individuals with the same issue as myself.
>
> Have you discovered any workarounds to the issue of returning Japanese text
> information from a PDF using pdfbox? If not, would this be an issue which
> the dev team is currently working to solve?
Please be more specific. There are 3 known cases:
- PDFBox can extract the text of pdfs containing foreign (non latin)
languages depending on the used font
- the text extraction doesn't work because of the used font and a
wrong/incomplete
Implementation in PDFBox
- the text can't be extracted, even the adobe test fails see [1]
So, the question is, did you ever try to extract text? If not, give it a try [2]
> Best regards,
>
> Zak
BR
Andreas Lehmkühler
[1] http://pdfbox.apache.org/userguide/faq.html#notext
[2] http://pdfbox.apache.org/commandline/