You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jeremy Hanna <je...@mac.com> on 2006/04/26 00:53:24 UTC

Alphanumeric model ids

I am trying to search by a number of fields including an alphanumeric  
model id.

This is just the model id that comes from manufacturers.  I've tried  
to use a StandardAnalyzer and a SnowballAnalyzer to index the data.   
Then I search with the associated analyzer using a  
MultiFieldQueryParser.  Going through the debug into the attached  
Lucene source, I see that all a MultiFieldQueryParser does is make a  
bunch of queries and link them together with a Boolean query with  
SHOULD values.  I see that it is getting the right field, "model",  
and has the right query in there, e.g. "XPHP", but it returns no  
results.

When I index it, I do the following:
modelField = new Field("model", (product.getModelNumber() == null) ?  
"" : product.getModelNumber(), Field.Store.NO,  
Field.Index.UN_TOKENIZED);
...
document.add(modelField);
...
indexWriter.addDocument(document);

So it shouldn't be messing with the model id retrieved from the  
database when it puts it in the index (UN_TOKENIZED).

The weird thing is that it finds those model ids that are only  
numeric (including punctuation, e.g. "40603-38").  But it cannot find  
the "XPHP" model id.  On the command line SQL interface, I can do a  
select * from product where model = 'XPHP'; and it comes back with  
the single result.

Anyone have any idea as to why the numeric ones would come up and the  
alphanumeric ones would not find the right values in the index?

Thanks much,
Jeremy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Alphanumeric model ids

Posted by Jeremy Hanna <je...@mac.com>.
Thanks Chris, it works like a champ now.  I had thought I looked at  
the queries themselves with toString but in any case, the queries  
actually work now.  I didn't realize that Lucene was customizable on  
so many levels - when you create the analyzer, when you create the  
index, when you perform each query.  Kinda cool.

On Apr 25, 2006, at 5:02 PM, Chris Hostetter wrote:

>
> I bet that if you look at the toString() of the query you get back  
> from
> your query parser, you'll see that the non numeric part numbers  
> have been
> stemmed.
>
> You took the right steps when you indexed the field as  
> UN_TOKENIZED, but
> at query time your query parser doesn't know about that -- take a  
> look at
> the PerFieldAnalyzerWrapper and the KeywordAnalyzer as a way to  
> make sure
> your query parser doesn't do any processing on the terms you search  
> for in
> non-tockenized fields.
>
> Or ... prepare for shameless plug ... check out the Solr project.   
> Solr
> adds a very flexible schema layer on top of lucene, that lets you  
> specify
> field types, and map fields (explicitly named or dynamicaly created  
> based
> on field name patters) to field types -- every field type can have a
> differnet analyzer, two acctually: one used when indexing and one used
> when quering...
> 	http://incubator.apache.org/solr/
>
>
> : Date: Tue, 25 Apr 2006 16:53:24 -0600
> : From: Jeremy Hanna <je...@mac.com>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Alphanumeric model ids
> :
> : I am trying to search by a number of fields including an  
> alphanumeric
> : model id.
> :
> : This is just the model id that comes from manufacturers.  I've tried
> : to use a StandardAnalyzer and a SnowballAnalyzer to index the data.
> : Then I search with the associated analyzer using a
> : MultiFieldQueryParser.  Going through the debug into the attached
> : Lucene source, I see that all a MultiFieldQueryParser does is make a
> : bunch of queries and link them together with a Boolean query with
> : SHOULD values.  I see that it is getting the right field, "model",
> : and has the right query in there, e.g. "XPHP", but it returns no
> : results.
> :
> : When I index it, I do the following:
> : modelField = new Field("model", (product.getModelNumber() == null) ?
> : "" : product.getModelNumber(), Field.Store.NO,
> : Field.Index.UN_TOKENIZED);
> : ...
> : document.add(modelField);
> : ...
> : indexWriter.addDocument(document);
> :
> : So it shouldn't be messing with the model id retrieved from the
> : database when it puts it in the index (UN_TOKENIZED).
> :
> : The weird thing is that it finds those model ids that are only
> : numeric (including punctuation, e.g. "40603-38").  But it cannot  
> find
> : the "XPHP" model id.  On the command line SQL interface, I can do a
> : select * from product where model = 'XPHP'; and it comes back with
> : the single result.
> :
> : Anyone have any idea as to why the numeric ones would come up and  
> the
> : alphanumeric ones would not find the right values in the index?
> :
> : Thanks much,
> : Jeremy
> :
> :  
> ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Alphanumeric model ids

Posted by Chris Hostetter <ho...@fucit.org>.
I bet that if you look at the toString() of the query you get back from
your query parser, you'll see that the non numeric part numbers have been
stemmed.

You took the right steps when you indexed the field as UN_TOKENIZED, but
at query time your query parser doesn't know about that -- take a look at
the PerFieldAnalyzerWrapper and the KeywordAnalyzer as a way to make sure
your query parser doesn't do any processing on the terms you search for in
non-tockenized fields.

Or ... prepare for shameless plug ... check out the Solr project.  Solr
adds a very flexible schema layer on top of lucene, that lets you specify
field types, and map fields (explicitly named or dynamicaly created based
on field name patters) to field types -- every field type can have a
differnet analyzer, two acctually: one used when indexing and one used
when quering...
	http://incubator.apache.org/solr/


: Date: Tue, 25 Apr 2006 16:53:24 -0600
: From: Jeremy Hanna <je...@mac.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Alphanumeric model ids
:
: I am trying to search by a number of fields including an alphanumeric
: model id.
:
: This is just the model id that comes from manufacturers.  I've tried
: to use a StandardAnalyzer and a SnowballAnalyzer to index the data.
: Then I search with the associated analyzer using a
: MultiFieldQueryParser.  Going through the debug into the attached
: Lucene source, I see that all a MultiFieldQueryParser does is make a
: bunch of queries and link them together with a Boolean query with
: SHOULD values.  I see that it is getting the right field, "model",
: and has the right query in there, e.g. "XPHP", but it returns no
: results.
:
: When I index it, I do the following:
: modelField = new Field("model", (product.getModelNumber() == null) ?
: "" : product.getModelNumber(), Field.Store.NO,
: Field.Index.UN_TOKENIZED);
: ...
: document.add(modelField);
: ...
: indexWriter.addDocument(document);
:
: So it shouldn't be messing with the model id retrieved from the
: database when it puts it in the index (UN_TOKENIZED).
:
: The weird thing is that it finds those model ids that are only
: numeric (including punctuation, e.g. "40603-38").  But it cannot find
: the "XPHP" model id.  On the command line SQL interface, I can do a
: select * from product where model = 'XPHP'; and it comes back with
: the single result.
:
: Anyone have any idea as to why the numeric ones would come up and the
: alphanumeric ones would not find the right values in the index?
:
: Thanks much,
: Jeremy
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org