You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Zabrane Mickael <za...@gmail.com> on 2011/03/24 17:28:53 UTC
Lot of WARNINGS when parsing PDF with Asian text!
Hi guys,
While trying to extract text from this online PDF using Tika CLI 0.9, a lot of warnings were reported:
$ java -jar tika-app.jar -v --encoding=UTF8 "http://www.hsbc.com/1/PA_1_1_S5/content/assets/investor_relations/hbap2010arn_hk_cn.pdf"
Could someone please explains me what's going on?
Is it related to missed fonts?
N.B: I was able to reproduce the same result on OSX and Linux both using Apache Tika CLI 0.9.
Thanks in advance!
Regards,
Zabrane