You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ma...@ibsbe.be on 2007/03/16 11:07:53 UTC

Bug ? unique id

Hello,

we have been using Solr for a month now and we are running into a lot of 
trouble .

one of the issues is a problem with the unique id field.

can this field have analyzer, filters and tokenizers on it ??

because when we use filters or tokenizers on our unique id field, we get 
duplicate id's.

thanks in advance,
maarten

Re: Bug ? unique id

Posted by Ma...@ibsbe.be.

ok, i'm starting to see the light :))

at this moment, we are running this for our uniqueID :
field :
<field name="id" type="idstring" stored="true"/>
and everything is working well ...

so i dont explicitly say indexed='true' ... i guess indexed is default 
true ...

i'll be sure do to do some testing with stored=false and indexed=false
but that'll be for next week when i start optimizing
i'll be sure to mail you the results of the testing

thanks again,m





Chris Hostetter <ho...@fucit.org> 
19/03/2007 20:30
Please respond to
solr-user@lucene.apache.org


To
solr-user@lucene.apache.org
cc

Subject
Re: Bug ? unique id







: it would maybe be a good idea to have Lucene check the *stored* value 
for
: duplicate keys ... that seems so much more logical to me !
: (imho, it makes no sense to check the *indexed* value for duplicate 
keys,
: but maybe there is a reason ?)

it's probably a terminology issue ... stored fields are nothing more then
Payloads .. LUcene (and Solr) don't do anything with them but hang on to
them for you and return them to you later.

the "indexed" value is the value that matters in the "index" ... it's the
one searches/lookups and sorting can be performed on.

: or maybe give us the option to choose wether Lucene should check the
: *stored* or *indexed* value for duplicate keys.

if Solr were to try and deal with your uniqueKey using hte Stored value,
it would have to do the same copyField stuff under the coveres in order
for that Stored value to be "indexed" in a way it can see it.

: since we now use a copyfield to perform searches on the IDs, there is no
: more reason to index our unique key field ....
: what would happen if I set indexed=false on my unique id field ??

it wouldn't work at all ... as i said, the indexed value is all that Solr
really cares about -- i think you could probably mark your uniqueKey as
stored=false, but if it's indexed-false then at best you'll get a nice
error telling you it must be indexed, and at worst it will crash and burn
in a non-obvious way -- possibly silently.

(if you want to try it out, and the later happens, please file abug we
should definitely have a nice error message in that case)




-Hoss

Re: Bug ? unique id

Posted by Chris Hostetter <ho...@fucit.org>.

: it would maybe be a good idea to have Lucene check the *stored* value for
: duplicate keys ... that seems so much more logical to me !
: (imho, it makes no sense to check the *indexed* value for duplicate keys,
: but maybe there is a reason ?)

it's probably a terminology issue ... stored fields are nothing more then
Payloads .. LUcene (and Solr) don't do anything with them but hang on to
them for you and return them to you later.

the "indexed" value is the value that matters in the "index" ... it's the
one searches/lookups and sorting can be performed on.

: or maybe give us the option to choose wether Lucene should check the
: *stored* or *indexed* value for duplicate keys.

if Solr were to try and deal with your uniqueKey using hte Stored value,
it would have to do the same copyField stuff under the coveres in order
for that Stored value to be "indexed" in a way it can see it.

: since we now use a copyfield to perform searches on the IDs, there is no
: more reason to index our unique key field ....
: what would happen if I set indexed=false on my unique id field ??

it wouldn't work at all ... as i said, the indexed value is all that Solr
really cares about -- i think you could probably mark your uniqueKey as
stored=false, but if it's indexed-false then at best you'll get a nice
error telling you it must be indexed, and at worst it will crash and burn
in a non-obvious way -- possibly silently.

(if you want to try it out, and the later happens, please file abug we
should definitely have a nice error message in that case)




-Hoss

Re: Bug ? unique id

Posted by Ma...@ibsbe.be.

thanks for your reply... it kind of solved our problem !

we were in fact using Tokenizers that produce multiple tokens ... 
so i guess there is no other way for us then to use the copyfield 
workaround.

it would maybe be a good idea to have Lucene check the *stored* value for 
duplicate keys ... that seems so much more logical to me !
(imho, it makes no sense to check the *indexed* value for duplicate keys, 
but maybe there is a reason ?)
or maybe give us the option to choose wether Lucene should check the 
*stored* or *indexed* value for duplicate keys.

it is really confusing to get duplicate unique key *stored* values back 
from the server .... (and kind of frustrating)

since we now use a copyfield to perform searches on the IDs, there is no 
more reason to index our unique key field ....
what would happen if I set indexed=false on my unique id field ??

Maarten :-)





Chris Hostetter <ho...@fucit.org> 
16/03/2007 19:14
Please respond to
solr-user@lucene.apache.org


To
solr-user@lucene.apache.org
cc

Subject
Re: Bug ? unique id







: but can someone please answer my question :'(
: is it illegal to put filters on the unique id ?
: or is it a bug that we get duplicate id's?
: or is this a know issue (since everybody is using copyfields?)

there's nothing illegal about using an Analyzer on your uniqueKey, but you
have to ensure that your Analyzer:
  1) never produces multiple tokens (ie: KeywordTokenizer is fine)
  2) never produces duplicate output for differnet (legal) input.

...so if your dataset can legally contain two different documnets
whose keys are "foo bar" and "Foo Bar" you certianly wouldn't want
to use a Whitspace or StandardTokenizer -- but you also wouldn't ever want
to use the LowerCaseFilter.

If however you really wanted to ignore all punctuation in keys when
clients upload documents to you, and trust that doc "1234-56-7890" is the
same as doc "1234567890" then something liek hte pattern striping filter
would be fine.


the thing to understnad is that it's the *indexed* value of the uniqueKey
that must be unique in order for Solr to do things properly ... it has to
be able to search on that uniqueKey term to delete/replace a doc properly.


-Hoss

Re: Bug ? unique id

Posted by Chris Hostetter <ho...@fucit.org>.

: but can someone please answer my question :'(
: is it illegal to put filters on the unique id ?
: or is it a bug that we get duplicate id's?
: or is this a know issue (since everybody is using copyfields?)

there's nothing illegal about using an Analyzer on your uniqueKey, but you
have to ensure that your Analyzer:
  1) never produces multiple tokens (ie: KeywordTokenizer is fine)
  2) never produces duplicate output for differnet (legal) input.

...so if your dataset can legally contain two different documnets
whose keys are "foo bar" and "Foo Bar" you certianly wouldn't want
to use a Whitspace or StandardTokenizer -- but you also wouldn't ever want
to use the LowerCaseFilter.

If however you really wanted to ignore all punctuation in keys when
clients upload documents to you, and trust that doc "1234-56-7890" is the
same as doc "1234567890" then something liek hte pattern striping filter
would be fine.


the thing to understnad is that it's the *indexed* value of the uniqueKey
that must be unique in order for Solr to do things properly ... it has to
be able to search on that uniqueKey term to delete/replace a doc properly.


-Hoss

Re: Bug ? unique id

Posted by Ma...@ibsbe.be.

yes, that is exactly what we are doing now ... copyfield with the filters 
... we figured that much :)

but we are talking about a couple of million records, so the less data we 
copy the better ...

but can someone please answer my question :'(
is it illegal to put filters on the unique id ?
or is it a bug that we get duplicate id's?
or is this a know issue (since everybody is using copyfields?)

thanks for all your replys !

grts,m

"Paul Borgermans" <pa...@gmail.com> 
16/03/2007 16:12
Please respond to
solr-user@lucene.apache.org

To
solr-user@lucene.apache.org
cc

Subject
Re: Bug ? unique id

Hi Maarten

Why not copy your unique id into another field with the required filters 
and
use that for search?

Regards
Paul

On 3/16/07, Maarten.De.Vilder@ibsbe.be <Ma...@ibsbe.be> wrote:
>
> because we want to be able to search our unique id's :)
> and we would like to use the Latin character filter and the Lowercase
> filter so our searches dont have to be case sensitive and stuff.
>
> thanks for the quick response!
>
> grts,m
>
>
>
>
> Erik Hatcher <er...@ehatchersolutions.com>
> 16/03/2007 12:09
> Please respond to
> solr-user@lucene.apache.org
>
>
> To
> solr-user@lucene.apache.org
> cc
>
> Subject
> Re: Bug ? unique id
>
>
>
>
>
>
> Why in the world would you want to analyze your unique id?
>
>                  Erik
>

>
> On Mar 16, 2007, at 6:07 AM, Maarten.De.Vilder@ibsbe.be wrote:
>
> > Hello,
> >
> > we have been using Solr for a month now and we are running into a
> > lot of
> > trouble .
> >
> > one of the issues is a problem with the unique id field.
> >
> > can this field have analyzer, filters and tokenizers on it ??
> >
> > because when we use filters or tokenizers on our unique id field,
> > we get
> > duplicate id's.
> >
> > thanks in advance,
> > maarten
>
>
>

Re: Bug ? unique id

Posted by Paul Borgermans <pa...@gmail.com>.

Hi Maarten

Why not copy your unique id into another field with the required filters and
use that for search?

Regards
Paul

On 3/16/07, Maarten.De.Vilder@ibsbe.be <Ma...@ibsbe.be> wrote:
>
> because we want to be able to search our unique id's :)
> and we would like to use the Latin character filter and the Lowercase
> filter so our searches dont have to be case sensitive and stuff.
>
> thanks for the quick response!
>
> grts,m
>
>
>
>
> Erik Hatcher <er...@ehatchersolutions.com>
> 16/03/2007 12:09
> Please respond to
> solr-user@lucene.apache.org
>
>
> To
> solr-user@lucene.apache.org
> cc
>
> Subject
> Re: Bug ? unique id
>
>
>
>
>
>
> Why in the world would you want to analyze your unique id?
>
>                  Erik
>
>
> On Mar 16, 2007, at 6:07 AM, Maarten.De.Vilder@ibsbe.be wrote:
>
> > Hello,
> >
> > we have been using Solr for a month now and we are running into a
> > lot of
> > trouble .
> >
> > one of the issues is a problem with the unique id field.
> >
> > can this field have analyzer, filters and tokenizers on it ??
> >
> > because when we use filters or tokenizers on our unique id field,
> > we get
> > duplicate id's.
> >
> > thanks in advance,
> > maarten
>
>
>

RE: Bug ? unique id

Posted by "Gunther, Andrew" <Gu...@si.edu>.

Why not use CopyField and put an analyzer on that field?

-----Original Message-----
From: Maarten.De.Vilder@ibsbe.be [mailto:Maarten.De.Vilder@ibsbe.be] 
Sent: Friday, March 16, 2007 10:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Bug ? unique id

because we want to be able to search our unique id's :)
and we would like to use the Latin character filter and the Lowercase 
filter so our searches dont have to be case sensitive and stuff.

thanks for the quick response!

grts,m

Erik Hatcher <er...@ehatchersolutions.com> 
16/03/2007 12:09
Please respond to
solr-user@lucene.apache.org

To
solr-user@lucene.apache.org
cc

Subject
Re: Bug ? unique id

Why in the world would you want to analyze your unique id?

                 Erik

On Mar 16, 2007, at 6:07 AM, Maarten.De.Vilder@ibsbe.be wrote:

> Hello,
>
> we have been using Solr for a month now and we are running into a 
> lot of
> trouble .
>
> one of the issues is a problem with the unique id field.
>
> can this field have analyzer, filters and tokenizers on it ??
>
> because when we use filters or tokenizers on our unique id field, 
> we get
> duplicate id's.
>
> thanks in advance,
> maarten

Re: Bug ? unique id

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 16, 2007, at 10:54 AM, Maarten.De.Vilder@ibsbe.be wrote:
> because we want to be able to search our unique id's :)
> and we would like to use the Latin character filter and the Lowercase
> filter so our searches dont have to be case sensitive and stuff.

that seems reasonable, however you'd have to be sure you wouldn't  
normalize an id into a conflict with another one.  it would seem to  
me that the client should control the unique id completely, but i can  
certainly see a case for what you're asking for also.   but, no, i  
don't have an answer to your original question, sorry.

	Erik

Re: Bug ? unique id

Posted by Ma...@ibsbe.be.

because we want to be able to search our unique id's :)
and we would like to use the Latin character filter and the Lowercase 
filter so our searches dont have to be case sensitive and stuff.

thanks for the quick response!

grts,m

Erik Hatcher <er...@ehatchersolutions.com> 
16/03/2007 12:09
Please respond to
solr-user@lucene.apache.org

To
solr-user@lucene.apache.org
cc

Subject
Re: Bug ? unique id

Why in the world would you want to analyze your unique id?

                 Erik

On Mar 16, 2007, at 6:07 AM, Maarten.De.Vilder@ibsbe.be wrote:

> Hello,
>
> we have been using Solr for a month now and we are running into a 
> lot of
> trouble .
>
> one of the issues is a problem with the unique id field.
>
> can this field have analyzer, filters and tokenizers on it ??
>
> because when we use filters or tokenizers on our unique id field, 
> we get
> duplicate id's.
>
> thanks in advance,
> maarten

Re: Bug ? unique id

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Why in the world would you want to analyze your unique id?

	Erik


On Mar 16, 2007, at 6:07 AM, Maarten.De.Vilder@ibsbe.be wrote:

> Hello,
>
> we have been using Solr for a month now and we are running into a  
> lot of
> trouble .
>
> one of the issues is a problem with the unique id field.
>
> can this field have analyzer, filters and tokenizers on it ??
>
> because when we use filters or tokenizers on our unique id field,  
> we get
> duplicate id's.
>
> thanks in advance,
> maarten