You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Renaud Billen <re...@nic.be> on 2015/01/06 11:59:47 UTC

Extraction of chinese characters

Hello,

fresh new user of pdfbox, I’ve got some problems extracting the text of pdfs with Chinese characters in it.

I use pdfbox from the command line with the command : java -jar C:/pdfbox-app.jar ExtractText C:/Test_Pdfbox.pdf C:/Test_Pdfbox.txt

Result text only contains question marks..


Here is the document : 




Thanks for your help,
Renaud

Re: Extraction of chinese characters

Posted by Renaud Billen <re...@nic.be>.

Thanks a lot, works like a charm now :)


> Le 6 janv. 2015 à 12:14, Gilad Denneboom <gi...@gmail.com> a écrit :
> 
> Try specifying the encoding parameter... See:
> https://pdfbox.apache.org/1.8/commandline.html#extractText
> 
> On Tue, Jan 6, 2015 at 11:59 AM, Renaud Billen <re...@nic.be> wrote:
> 
>> Hello,
>> 
>> fresh new user of pdfbox, I’ve got some problems extracting the text of
>> pdfs with Chinese characters in it.
>> 
>> I use pdfbox from the command line with the command : *java -jar
>> C:/pdfbox-app.jar ExtractText C:/Test_Pdfbox.pdf C:/Test_Pdfbox.txt*
>> 
>> Result text only contains question marks..
>> 
>> 
>> Here is the document :
>> 
>> 
>> 
>> 
>> 
>> Thanks for your help,
>> Renaud
>> 
>>

Re: Extraction of chinese characters

Posted by Gilad Denneboom <gi...@gmail.com>.

Try specifying the encoding parameter... See:
https://pdfbox.apache.org/1.8/commandline.html#extractText

On Tue, Jan 6, 2015 at 11:59 AM, Renaud Billen <re...@nic.be> wrote:

> Hello,
>
> fresh new user of pdfbox, I’ve got some problems extracting the text of
> pdfs with Chinese characters in it.
>
> I use pdfbox from the command line with the command : *java -jar
> C:/pdfbox-app.jar ExtractText C:/Test_Pdfbox.pdf C:/Test_Pdfbox.txt*
>
> Result text only contains question marks..
>
>
> Here is the document :
>
>
>
>
>
> Thanks for your help,
> Renaud
>
>