You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ilya Zavorin <iz...@caci.com> on 2011/03/19 22:55:45 UTC
how do I specify different encodings with "--text --encoding="?
I need to convert a bunch of MS Office-type docs into Unicode text. My preference is to generate UTF16 LE files with BOM. So far I tried:
1. --encoding=UTF-16LE : produced UTF16LE w/o BOM
2. --encoding=UTF-16 : UTF16BE with BOM (I think)
3. --encoding=UCS2 : UnsupportedEncodingException
4. --encoding=UCS-2 : UnsupportedEncodingException
5. --encoding=UTF16LE : UnsupportedEncodingException
So how do I get #1 but with BOM?
Thanks,
Mr. Ilya Zavorin, Ph.D.
Principal Research Analyst
Knowledge and Information Management Division
CACI International
4831 Walden Lane
Lanham, MD 20706
ph: 1-301-306-2859
fx: 1-301-306-8201
izavorin@caci.com
www.caci.com
Lot of WARNINGS when parsing PDF with Asian text!
Posted by Zabrane Mickael <za...@gmail.com>.
Hi guys,
While trying to extract text from this online PDF using Tika CLI 0.9, a lot of warnings were reported:
$ java -jar tika-app.jar -v --encoding=UTF8 "http://www.hsbc.com/1/PA_1_1_S5/content/assets/investor_relations/hbap2010arn_hk_cn.pdf"
Could someone please explains me what's going on?
Is it related to missed fonts?
N.B: I was able to reproduce the same result on OSX and Linux both using Apache Tika CLI 0.9.
Thanks in advance!
Regards,
Zabrane
RE: how do I specify different encodings with "--text --encoding="?
Posted by Jukka Zitting <jz...@adobe.com>.
Hi,
From: Ilya Zavorin [mailto:izavorin@caci.com]
> So how do I get #1 but with BOM?
Try using --encoding=UnicodeLittle. See [1] for the available encoding names in Java 5.
[1] http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html
BR,
Jukka Zitting