You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Zabrane Mickael <za...@gmail.com> on 2011/03/24 17:28:53 UTC

Lot of WARNINGS when parsing PDF with Asian text!

Hi guys,

While trying to extract text from this online PDF using Tika CLI 0.9, a lot of warnings were reported:

$ java -jar tika-app.jar -v --encoding=UTF8 "http://www.hsbc.com/1/PA_1_1_S5/content/assets/investor_relations/hbap2010arn_hk_cn.pdf"

Could someone please explains me what's going on?
Is it related to missed fonts?

N.B: I was able to reproduce the same result on OSX and Linux both using Apache Tika CLI 0.9.

Thanks in advance!

Regards,
Zabrane