You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Indu Abeyaratna <ia...@aconex.com> on 2005/07/28 01:26:22 UTC

Query text Tokenize issue

I have a field index as keyword. And have two records "J400-C-V1-S10-T1" and
"J400-C-V-S10-T1"

When I search for  "J400-C-V1-S10-T1", it returns me matching record, but
when I Search for "J400-C-V-S10-T1" it doesn't return the matching one.

Further I found that "J400-C-V-S10-T1" is incorrectly tokenised to "J400-C"
and "V-S10-T1" but nothing like that happened to "J400-C-V1-S10-T1".

This happens when there is combination like "?-?-" and its get tokenised
into "?" and "?-".

I attached test case for further clarification.

I am using StandardAnalyser and query parser.

Is this a bug in the lucene or JavaCC??  Or am I missing something here? any
suggestion to get away with this?

 
 


Re: Query text Tokenize issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 27, 2005, at 7:26 PM, Indu Abeyaratna wrote:

>
> I have a field index as keyword. And have two records "J400-C-V1- 
> S10-T1" and
> "J400-C-V-S10-T1"
>
> When I search for  "J400-C-V1-S10-T1", it returns me matching  
> record, but
> when I Search for "J400-C-V-S10-T1" it doesn't return the matching  
> one.
>
> Further I found that "J400-C-V-S10-T1" is incorrectly tokenised to  
> "J400-C"
> and "V-S10-T1" but nothing like that happened to "J400-C-V1-S10-T1".
>
> This happens when there is combination like "?-?-" and its get  
> tokenised
> into "?" and "?-".
>
> I attached test case for further clarification.
>
> I am using StandardAnalyser and query parser.
>
> Is this a bug in the lucene or JavaCC??  Or am I missing something  
> here? any
> suggestion to get away with this?

It's not a "bug" per se.... but rather just how StandardAnalyzer  
works.  StandardAnalyzer is a general-purpose text analyzer, and  
cannot reasonably deal with this issue and also deal with the much  
more common scenario of "hyphenated-text" that should be split into  
separate tokens.

As Otis said, this is really the job for a custom analyzer.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query text Tokenize issue

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

I believe your problem is described on page 121 in the Lucene book:
http://www.lucenebook.com/search?query=%22dealing+with+keyword+fields%22

The solution for you may be to write your own Analyzer that knows how
to correctly tokenize or not tokenize certain fields in your index. 
Using PerFieldAnalyzerWrapper may help you here.

Otis



--- Indu Abeyaratna <ia...@aconex.com> wrote:

> 
> I have a field index as keyword. And have two records
> "J400-C-V1-S10-T1" and
> "J400-C-V-S10-T1"
> 
> When I search for  "J400-C-V1-S10-T1", it returns me matching record,
> but
> when I Search for "J400-C-V-S10-T1" it doesn't return the matching
> one.
> 
> Further I found that "J400-C-V-S10-T1" is incorrectly tokenised to
> "J400-C"
> and "V-S10-T1" but nothing like that happened to "J400-C-V1-S10-T1".
> 
> This happens when there is combination like "?-?-" and its get
> tokenised
> into "?" and "?-".
> 
> I attached test case for further clarification.
> 
> I am using StandardAnalyser and query parser.
> 
> Is this a bug in the lucene or JavaCC??  Or am I missing something
> here? any
> suggestion to get away with this?
> 
>  
>  
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org