You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Daniel Naber <da...@t-online.de> on 2004/09/07 22:19:13 UTC

Re: CJK Support for HTMLParser.jj

On Monday 23 August 2004 13:46, Joey Lawrance wrote:

> I've attached the HTMLParser.jj file that successfully parses Japanese
> HTML for indexing.

Joey,

thanks for the patch. When I compile it with "ant javacc-HTMLParser" I get 
this warning:

"Warning: Line 364, Column 3: Non-ASCII characters used in regular 
expression.
Please make sure you use the correct Reader when you create the parser that 
can handle your character set."

Is it okay to get this warning? The line the warning refers to is this one:

| < CJK:                                          // non-alphabets

Besides that, the patch seems to work, i.e. the parser doesn't stop on 
Japanese HTML files anymore, but that's all I can say, as I don't speak 
Japanese.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: CJK Support for HTMLParser.jj

Posted by Joey Lawrance <la...@cs.orst.edu>.

I got the same warning when I compiled the patch. I haven't tried my 
patch with the patch for Bug 30844 (or the latest CVS) to see if it 
removes the warning. I assume that would fix the problem, but I haven't 
tested that theory out. I'll get around to that after I finish my 
current work (which uses Lucene to index Japanese documents) under a 
looming deadline. :-)

Joey

On Tuesday, September 7, 2004, at 01:19  PM, Daniel Naber wrote:

> On Monday 23 August 2004 13:46, Joey Lawrance wrote:
>
>> I've attached the HTMLParser.jj file that successfully parses Japanese
>> HTML for indexing.
>
> Joey,
>
> thanks for the patch. When I compile it with "ant javacc-HTMLParser" I 
> get
> this warning:
>
> "Warning: Line 364, Column 3: Non-ASCII characters used in regular
> expression.
> Please make sure you use the correct Reader when you create the parser 
> that
> can handle your character set."
>
> Is it okay to get this warning? The line the warning refers to is this 
> one:
>
> | < CJK:                                          // non-alphabets
>
> Besides that, the patch seems to work, i.e. the parser doesn't stop on
> Japanese HTML files anymore, but that's all I can say, as I don't speak
> Japanese.
>
> Regards
>  Daniel
>
> -- 
> http://www.danielnaber.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org