You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ilya Zavorin <iz...@caci.com> on 2011/03/19 22:55:45 UTC

how do I specify different encodings with "--text --encoding="?

I need to convert a bunch of MS Office-type docs into Unicode text. My preference is to generate UTF16 LE files with BOM. So far I tried:


1.       --encoding=UTF-16LE : produced UTF16LE w/o BOM

2.       --encoding=UTF-16 : UTF16BE with BOM (I think)

3.       --encoding=UCS2 : UnsupportedEncodingException

4.       --encoding=UCS-2 : UnsupportedEncodingException

5.       --encoding=UTF16LE : UnsupportedEncodingException

So how do I get #1 but with BOM?

Thanks,

Mr. Ilya Zavorin, Ph.D.
Principal Research Analyst
Knowledge and Information Management Division
CACI International
4831 Walden Lane
Lanham, MD 20706
ph: 1-301-306-2859
fx: 1-301-306-8201
izavorin@caci.com
www.caci.com


Lot of WARNINGS when parsing PDF with Asian text!

Posted by Zabrane Mickael <za...@gmail.com>.
Hi guys,

While trying to extract text from this online PDF using Tika CLI 0.9, a lot of warnings were reported:

$ java -jar tika-app.jar -v --encoding=UTF8 "http://www.hsbc.com/1/PA_1_1_S5/content/assets/investor_relations/hbap2010arn_hk_cn.pdf"

Could someone please explains me what's going on?
Is it related to missed fonts?

N.B: I was able to reproduce the same result on OSX and Linux both using Apache Tika CLI 0.9.

Thanks in advance!

Regards,
Zabrane


RE: how do I specify different encodings with "--text --encoding="?

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,


From: Ilya Zavorin [mailto:izavorin@caci.com] 
> So how do I get #1 but with BOM?

Try using --encoding=UnicodeLittle. See [1] for the available encoding names in Java 5.

[1] http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html

BR,

Jukka Zitting