You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Zak Bennett <za...@gmail.com> on 2013/08/28 01:20:37 UTC

Japanese characters

Hi guys,

Firstly I apologise if this question has been repeated often. Having looked
around I have found a number of individuals with the same issue as myself.

Have you discovered any workarounds to the issue of returning Japanese text
information from a PDF using pdfbox? If not, would this be an issue which
the dev team is currently working to solve?

Best regards,

Zak

Re: Japanese characters

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

> Zak Bennett <za...@gmail.com> hat am 28. August 2013 um 01:20
> geschrieben:
>
>
> Hi guys,
>
> Firstly I apologise if this question has been repeated often. Having looked
> around I have found a number of individuals with the same issue as myself.
>
> Have you discovered any workarounds to the issue of returning Japanese text
> information from a PDF using pdfbox? If not, would this be an issue which
> the dev team is currently working to solve?
Please be more specific. There are 3 known cases:

- PDFBox can extract the text of pdfs containing foreign (non latin)
languages depending on the used font
- the text extraction doesn't work because of the used font and a
wrong/incomplete
Implementation in PDFBox
- the text can't be extracted, even the adobe test fails see [1]

So, the question is, did you ever try to extract text? If not, give it a try [2]

> Best regards,
>
> Zak

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/userguide/faq.html#notext
[2] http://pdfbox.apache.org/commandline/