You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Br...@gxs.com on 2008/04/02 22:58:38 UTC

Unicode Tokenizer problem with Registered Trademark Search

I am having a problem when searching for certain Unicode characters, such as the Registered Trademark. That's the Unicode character 00AE. It's also a problem searching for a Japanese Yen symbol (Unicode character 00A5).

I'm using the Lucene 2.0.0 jar file, and we used to use Lucene 1.4.2 jar file, where this used to work OK. But Lucene 2.0.0 doesn't work the same way.

I see that the registered trademark is in the Lucene index file, so that's good. The problem comes when I try to search for these characters.

I see that my query starts off OK, as this:

( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you cannot see the Japanese Yen symbol, it comes directly after "Digital")

Note: the "^95" is just a boost factor, and is OK.

I'm using StandardAnalyzer and StandardTokenizer to create a new QueryParser , and after I call the "parse" method of the QueryParser, my query becomes this:

 +Locale:en +productName:digital^95.0

Notice that the Japanese Yen symbol is gone! I think it's because the StandardTokenizer.jj file doesn't handle this character, and so it throws it away.

Is there any way to use a different Analyzer and/or Tokenizer, rather than building my own?

And if I had created my Lucene indexes with the StandardAnalyzer, must I use the StandardAnalyzer and StandardTokenizer to search the index?

Thanks.

RE: Unicode Tokenizer problem with Registered Trademark Search

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Bruce,

On 04/02/2008 at 4:58 PM, Bruce.Nawrocki@gxs.com wrote:
> I am having a problem when searching for certain Unicode
> characters, such as the Registered Trademark. That's the
> Unicode character 00AE. It's also a problem searching for a
> Japanese Yen symbol (Unicode character 00A5).
> 
> I'm using the Lucene 2.0.0 jar file, and we used to use
> Lucene 1.4.2 jar file, where this used to work OK. But Lucene
> 2.0.0 doesn't work the same way.

I don't see anything that would have caused such a change - below is a colored side-by-side diff of StandardTokenizer.jj at revisions 150560 and 409716, corresponding to the lucene_1_4_2 and lucene_2_0_0 tags, respectively:

<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_0_0/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=150560&r2=409716&diff_format=h>

(Note that the JavaCC-targetted StandardAnalyzer.jj was replaced at release 2.3.0 by JFlex-targetted StandardTokenizerImpl.jflex for performance reasons - see <http://issues.apache.org/jira/browse/LUCENE-966>.)

> I see that the registered trademark is in the Lucene index
> file, so that's good. The problem comes when I try to search
> for these characters.
>
> I see that my query starts off OK, as this:
> 
> ( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you
> cannot see the Japanese Yen symbol, it comes directly after "Digital")
> 
> Note: the "^95" is just a boost factor, and is OK.
> 
> I'm using StandardAnalyzer and StandardTokenizer to create a
> new QueryParser , and after I call the "parse" method of the
> QueryParser, my query becomes this:
> 
>  +Locale:en +productName:digital^95.0
> 
> Notice that the Japanese Yen symbol is gone! I think it's
> because the StandardTokenizer.jj file doesn't handle this
> character, and so it throws it away.
> 
> Is there any way to use a different Analyzer and/or
> Tokenizer, rather than building my own?
> 
> And if I had created my Lucene indexes with the
> StandardAnalyzer, must I use the StandardAnalyzer and
> StandardTokenizer to search the index?

In order for the Yen and Registered Trademark symbols to appear in the index, you must have used a different analyzer for indexing than the one you're using for querying.  This can lead to problems, as you have discovered.

The short answer is: you should use the same analyzer.

The longer answer is that you should use "compatible" analyzers.  "Compatibility" means that the terms produced by the query-time analyzer have corresponding index terms.  Of course, this condition is satisfied by using the same analyzer at both index- and query-time.  An example of compatibile, but different, analyzers is index- or query-time synonym injection.

I don't know why you weren't seeing this problem with Lucene 1.4.2, but is it possible that the 1.4.2-created index did *not* have these two symbols?  If that were true, then you would get the hits you're looking for, though you might get some others that you don't want.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org