You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by yu shen <sh...@gmail.com> on 2010/11/13 11:16:55 UTC

Is Chinese content extraction supported by Tika ?

Hi all,

I just started using tika. I tried to extract English words in html files,
it works fine?
And I try to integrate a Chinese words tokenizer into solr, and search
again, many previous hitted english words does not hit anymore.

Is there already a solution from tika to extract chinese content within a
html file?

Thanks in advance.