You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chandan Tamrakar <ch...@ccnep.com.np> on 2004/03/19 11:26:10 UTC

Re: Indexing HTML

How do I index a HTM document which may have any encoding like
EUC,SJIS,Western European or UTF 8. Can  I parse and extract the html into
string and than convert into Text file in UNICODE ?
Is this an appropiate way  to index HTML files ? Can anyone suggest me a
simple parser other than a parser found in demo of lucene ?

Also how do i find the "encoding " of files ? Whenever there are ANSI text
files containing japanese characters i am not able to convert into UTF-16
lucene is indexing properly when I convert into SJIS

thnks
chandan



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org