You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2003/06/04 18:58:40 UTC

Re: search item with '-' in it

You should look at the output of your analyzer.  Just write a simple 
test program, something like:

   public static void main(String[] args) throws Exception {
     System.out.println("Tokenizing " + args[0]);
     Analyzer analyzer = new MyAnalyzer(...);
     TokenStream ts = analyzer.tokenStream(new StringReader(args[0]));
     Token token;
     while ((token = ts.next()) != null) {
       System.out.println("Token: " + token.termText());
     }
   }

StandardAnalyzer will accept hyphenations when digits are included on 
one side or the other.  This is a heuristic used to index things like 
part numbers (which contain digits) as a single word but not index 
things like "long-hyphenated-phrase" as a single word.  It may not be 
appropriate for your application.

Also, a part number field might better be indexed as a keyword field...

Doug

Lixin Meng wrote:
> I have a field, 'PartNumber', that has '-' in its value (e.g.
> SG-XRRH-C1M0-A).
> 
> After indexing, I can perform certain queries. However, I feel confused to
> explain the behavior.
> 
> - if searching for
> 	PartNumber:"SG"
>   it will return multiple hits. I assume the anaylzer might take out '-'.
> 
> - if searching for
> 	PartNumber:"XRRH"
>   it will return no hit. So, the above assumption doesn't hold itself. :)
> 
> - if searching for
> 	PartNumber:"SG-XRRH-C1M0-A"
>   it will return one hit
> 
> - if searching for
>       PartNumber:"sg-xrrh-c1m0-a*"
>   it will return one hit. So far so good
> 
> - if searching for
>       PartNumber:sg-xrrh-c1m0-a*
>   it will return multiple hits which even include things like
> "SG-XSWBRO...". Why?
> 
> - if searching for
>       PartNumber:"sg-xrrh-c1m0*"
>   no hit. Why?
> 
> Any comments?
> 
> Regards,
> Lixin
> 
> P.S. I used following filters
> 
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, m_StopWordTable);
>     result = new PorterStemFilter(result);
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: search item with '-' in it

Posted by Eric Jain <Er...@isb-sib.ch>.

> If we change StandardTokenizer in this way then we risk breaking all
> the applications that currently use it and depend on its current
> behaviour.

My personal issue with the StandardTokenizer is that it splits off
single letter prefixes, as in 't-shirt'. A query for 't-shirt' therefore
also returns documents with 't. miller's shirt'. I can't imagine how
this behavior could ever be considered useful or depended upon, but I
may be wrong (perhaps someone has an example where it does make sense).

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: search item with '-' in it

Posted by Doug Cutting <cu...@lucene.com>.

Lixin Meng wrote:
> Therefore, it would be preferable to treat all hyphen in the same way.
> Either as a delimiter or as part of the word (maybe with a flag at the API).

If we change StandardTokenizer in this way then we risk breaking all the 
applications that currently use it and depend on its current behaviour. 
  So I'm reluctant to make this change.

 From the StandardTokenizer documentation:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does 
not suit your application, please consider copying this source code 
directory to your project and maintaining your own grammar-based tokenizer."

Also, if you construct a tokenizer that you think is more generally 
useful than StandardTokenizer, please contribute it by mailing it to one 
of the Lucene mailing lists.

Thanks,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: search item with '-' in it

Posted by Lixin Meng <li...@fulldegree.com>.

Thanks for the tip. The analyzer does tokenize "SG-XRRH-C1M0-A" into 'SG'
and 'XRRH-C1M0-A'.

The approach that 'accept hyphenations when digits are included on one side
or the other' is indeed 'heuristic' :).

I might consider your suggestion on using keyword. However, in a more
general case, if one has a block of text with hyphenated words inside, the
work around with keyword doesn't apply.

Therefore, it would be preferable to treat all hyphen in the same way.
Either as a delimiter or as part of the word (maybe with a flag at the API).

Again, thanks for all the help.

Regards,
Lixin

-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com]
Sent: Wednesday, June 04, 2003 9:59 AM
To: Lucene Users List
Subject: Re: search item with '-' in it

You should look at the output of your analyzer.  Just write a simple
test program, something like:

   public static void main(String[] args) throws Exception {
     System.out.println("Tokenizing " + args[0]);
     Analyzer analyzer = new MyAnalyzer(...);
     TokenStream ts = analyzer.tokenStream(new StringReader(args[0]));
     Token token;
     while ((token = ts.next()) != null) {
       System.out.println("Token: " + token.termText());
     }
   }

StandardAnalyzer will accept hyphenations when digits are included on
one side or the other.  This is a heuristic used to index things like
part numbers (which contain digits) as a single word but not index
things like "long-hyphenated-phrase" as a single word.  It may not be
appropriate for your application.

Also, a part number field might better be indexed as a keyword field...

Doug

Lixin Meng wrote:
> I have a field, 'PartNumber', that has '-' in its value (e.g.
> SG-XRRH-C1M0-A).
>
> After indexing, I can perform certain queries. However, I feel confused to
> explain the behavior.
>
> - if searching for
> 	PartNumber:"SG"
>   it will return multiple hits. I assume the anaylzer might take out '-'.
>
> - if searching for
> 	PartNumber:"XRRH"
>   it will return no hit. So, the above assumption doesn't hold itself. :)
>
> - if searching for
> 	PartNumber:"SG-XRRH-C1M0-A"
>   it will return one hit
>
> - if searching for
>       PartNumber:"sg-xrrh-c1m0-a*"
>   it will return one hit. So far so good
>
> - if searching for
>       PartNumber:sg-xrrh-c1m0-a*
>   it will return multiple hits which even include things like
> "SG-XSWBRO...". Why?
>
> - if searching for
>       PartNumber:"sg-xrrh-c1m0*"
>   no hit. Why?
>
> Any comments?
>
> Regards,
> Lixin
>
> P.S. I used following filters
>
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, m_StopWordTable);
>     result = new PorterStemFilter(result);
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org