You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jennifer May <je...@sino.uni-heidelberg.de> on 2007/09/14 12:24:11 UTC
HTMLParser and Chinese
Hello!
I want to index an HTML document with the lucene demo, but have problems
parsing some Chinese files.
I changed code in the HTMLDocument class as to be able to define the
encoding of the document to be parsed:
InputStreamReader fis = new InputStreamReader(new FileInputStream(f),
IndexHTML.encoding);
HTMLParser parser = new HTMLParser(fis);
It works fine for most of my files in GB, Big5 or UTF-8. However, I get
the following exception for some of my files:
Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53"
(20307), after : ""
The HTML document looks like this:
<HTML><HEAD><meta http-equiv="Content-Type" content="text/html; charset=GB2312"><TITLE>刘先生(阿成)</TITLE>
<META NAME="keywords" CONTENT="阿成 魂游天国 刘先生">...
Obviously, the Chinese in the meta-tag is the problem. But why? And how
to solve it?
JTidy parses the same file without errors, but than I have problems with
the indexing as the JTidyparser takes only InputStreams without
specified encoding, not InputStreamReaders (at least as far as I found
out). Even if I convert my file from the original GB to UTF-8 I get only
gibberish in the Lucene index when using JTidy for parsing.
Thanks in advance for any suggestions either to get around the
HTMLParser problem or get JTidy to handle different encodings,
Jenny
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org