You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2002/11/25 18:27:29 UTC

Is "id" a special case?

I've encountered some very puzzling Lucene behavior (I'm using 1.3dev1, StandardAnalyzer, QueryParser).  

My indexed documents have, among other fields, two Text fields (indexed, tokenized, stored) called "pub_date" and "id".  These two fields have similar values.  A typical pub_date value is "20021121" and a typical id value is "20021121_4477".

When I search on pub_date, everything is normal.  The search for the query "pub_date:20021121" responds in less than a second with about 200 hits (only 25 of which are displayed).  Changing the query to "pub_date:2002112*" generates about 1000 hits with about the same response time.  Changing it to "pub_date:200211*" increases the hits to about 5000 with little increase in response time.  Finally, using "pub_date:2002*" generates about 35,000 hits and requires about two seconds or so.

However,..... when I try the same kinds of searches the 'id' field, the results are *very* different.  

Searching against the query "id:20021121*" generates the same approximately 200 hits in less than a second.  Using the query "id:2002112*" generates about 1000 hits, but it takes about two seconds.  Changing the query to "id:200211*" produces about 500 hits, but requires about seven seconds.  Generalizing the query to "id:2002*" causes my application to crash with an out of memory error.

As mentioned above, the two fields are very similar in content and are indexed in the same way.  One behaves as expected but the other one doesn't work well at all.

Any ideas on what's going on here?  Is "id" some kind of reserved Lucene word?

Regards,

Terry



Re: Is "id" a special case?

Posted by Terry Steichen <te...@net-frame.com>.
I was in error about the structures.  The "pub_date" field is indexed,
stored but *not* tokenized.  When I changed the "id" field to that form, it
then appeared to worked fine (based on a very small sample - will test more
soon).

Can anyone help me understand why the search takes such a huge amount of
memory and processing time if the field is tokenized?

Regards,

Terry

----- Original Message -----
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users Group" <lu...@jakarta.apache.org>
Sent: Monday, November 25, 2002 12:27 PM
Subject: Is "id" a special case?


I've encountered some very puzzling Lucene behavior (I'm using 1.3dev1,
StandardAnalyzer, QueryParser).

My indexed documents have, among other fields, two Text fields (indexed,
tokenized, stored) called "pub_date" and "id".  These two fields have
similar values.  A typical pub_date value is "20021121" and a typical id
value is "20021121_4477".

When I search on pub_date, everything is normal.  The search for the query
"pub_date:20021121" responds in less than a second with about 200 hits (only
25 of which are displayed).  Changing the query to "pub_date:2002112*"
generates about 1000 hits with about the same response time.  Changing it to
"pub_date:200211*" increases the hits to about 5000 with little increase in
response time.  Finally, using "pub_date:2002*" generates about 35,000 hits
and requires about two seconds or so.

However,..... when I try the same kinds of searches the 'id' field, the
results are *very* different.

Searching against the query "id:20021121*" generates the same approximately
200 hits in less than a second.  Using the query "id:2002112*" generates
about 1000 hits, but it takes about two seconds.  Changing the query to
"id:200211*" produces about 500 hits, but requires about seven seconds.
Generalizing the query to "id:2002*" causes my application to crash with an
out of memory error.

As mentioned above, the two fields are very similar in content and are
indexed in the same way.  One behaves as expected but the other one doesn't
work well at all.

Any ideas on what's going on here?  Is "id" some kind of reserved Lucene
word?

Regards,

Terry





--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>