You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Andy Berryman <to...@gmail.com> on 2006/11/06 14:37:34 UTC

Need some help understanding what the "StandardAnalyzer" is doing here ...

I have an index with a Field named "SKU" which is a "Text" type.  I'm using
the "StandardAnalyzer" for indexing and searching.  I'm using "Luke" (
http://www.getopt.org/luke/luke.jnlp) to do some testing for this problem
and to allow me to see how Lucene is parsing the query etc.  If I provide
the search expression as ... *SKU:andyb-test-item-001* ... Lucene is parsing
that to ... *SKU:"andyb test item-001"*.  Soo my question is ... Why are the
dashes between "andyb", "test", and "item" being removed but not the one
between "item" and "001"?

Thanks
Andy

Re: Need some help understanding what the "StandardAnalyzer" is doing here ...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Nov 6, 2006, at 8:37 AM, Andy Berryman wrote:

> I have an index with a Field named "SKU" which is a "Text" type.   
> I'm using
> the "StandardAnalyzer" for indexing and searching.  I'm using "Luke" (
> http://www.getopt.org/luke/luke.jnlp) to do some testing for this  
> problem
> and to allow me to see how Lucene is parsing the query etc.  If I  
> provide
> the search expression as ... *SKU:andyb-test-item-001* ... Lucene  
> is parsing
> that to ... *SKU:"andyb test item-001"*.  Soo my question is ...  
> Why are the
> dashes between "andyb", "test", and "item" being removed but not  
> the one
> between "item" and "001"?

The StandardAnalyzer is designed to attempt to be clever with part  
numbers, id's and such, that intermix alphas and numerics.  Like R2D2  
and C-3P0

	Erik

RE: Need some help understanding what the "StandardAnalyzer" is doing here ...

Posted by George Aroush <ge...@aroush.net>.

Hi Andy,

What's happening here is the data getting analyzed and tokenized.  You need
to stop that from happening on this field.  There are two solution that come
to mind which Lucene.Net offers, use per-field analyzer and tokenizer, or
(this is easier) store the field as non-tokenized.

Regards,

-- George Aroush

-----Original Message-----
From: Andy Berryman [mailto:topdev1@gmail.com] 
Sent: Monday, November 06, 2006 8:38 AM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: Need some help understanding what the "StandardAnalyzer" is doing
here ...

I have an index with a Field named "SKU" which is a "Text" type.  I'm using
the "StandardAnalyzer" for indexing and searching.  I'm using "Luke" (
http://www.getopt.org/luke/luke.jnlp) to do some testing for this problem
and to allow me to see how Lucene is parsing the query etc.  If I provide
the search expression as ... *SKU:andyb-test-item-001* ... Lucene is parsing
that to ... *SKU:"andyb test item-001"*.  Soo my question is ... Why are the
dashes between "andyb", "test", and "item" being removed but not the one
between "item" and "001"?

Thanks
Andy

RE: Need some help understanding what the "StandardAnalyzer" is doing here ...

Posted by George Aroush <ge...@aroush.net>.

Hi Andy,

What's happening here is the data getting analyzed and tokenized.  You need
to stop that from happening on this field.  There are two solution that come
to mind which Lucene.Net offers, use per-field analyzer and tokenizer, or
(this is easier) store the field as non-tokenized.

Regards,

-- George Aroush

-----Original Message-----
From: Andy Berryman [mailto:topdev1@gmail.com] 
Sent: Monday, November 06, 2006 8:38 AM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: Need some help understanding what the "StandardAnalyzer" is doing
here ...

I have an index with a Field named "SKU" which is a "Text" type.  I'm using
the "StandardAnalyzer" for indexing and searching.  I'm using "Luke" (
http://www.getopt.org/luke/luke.jnlp) to do some testing for this problem
and to allow me to see how Lucene is parsing the query etc.  If I provide
the search expression as ... *SKU:andyb-test-item-001* ... Lucene is parsing
that to ... *SKU:"andyb test item-001"*.  Soo my question is ... Why are the
dashes between "andyb", "test", and "item" being removed but not the one
between "item" and "001"?

Thanks
Andy