You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jennifer May <je...@sino.uni-heidelberg.de> on 2007/09/14 12:24:11 UTC

HTMLParser and Chinese

Hello!

I want to index an HTML document with the lucene demo, but have problems 
parsing some Chinese files.

I changed code in the HTMLDocument class as to be able to define the 
encoding of the document to be parsed:
InputStreamReader fis = new InputStreamReader(new FileInputStream(f), 
IndexHTML.encoding);
HTMLParser parser = new HTMLParser(fis);

It works fine for most of my files in GB, Big5 or UTF-8. However, I get 
the following exception for some of my files:
Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53" 
(20307), after : ""

The HTML document looks like this:

<HTML><HEAD><meta http-equiv="Content-Type" content="text/html; charset=GB2312"><TITLE>刘先生(阿成)</TITLE>
<META NAME="keywords" CONTENT="阿成 魂游天国 刘先生">...

Obviously, the Chinese in the meta-tag is the problem. But why? And how 
to solve it?

JTidy parses the same file without errors, but than I have problems with 
the indexing as the JTidyparser takes only InputStreams without 
specified encoding, not InputStreamReaders (at least as far as I found 
out). Even if I convert my file from the original GB to UTF-8 I get only 
gibberish in the Lucene index when using JTidy for parsing.

Thanks in advance for any suggestions either to get around the 
HTMLParser problem or get JTidy to handle different encodings,
Jenny

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org