You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Dowling <td...@ohiolink.edu> on 2012/03/02 20:23:33 UTC

Help with duplicate unique IDs

In a Solr index of journal articles, I thought I was safe reindexing 
articles because their unique ID would cause the new record in the index 
to overwrite the old one. (As stated at 
http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field - right?)

My schema.xml includes:

<fields>...
<field name="id" type="string" indexed="true" stored="true"
   required="true"/>
...</fields>

And:

<uniqueKey>id</uniqueKey>

And yet I can compose a query with two hits in the index, showing:

#1: <str name="id">03405443/v66i0003/347_mrirtaitmbpa</str>
#2: <str name="id">03405443/v66i0003/347_mrirtaitmbpa</str>


Can anyone give pointers on where I'm screwing something up?


Thomas Dowling
thomas.dowling@gmail.com

Re: Help with duplicate unique IDs

Posted by Erick Erickson <er...@gmail.com>.
Thomas:

It's *vaguely* possible that you have, say, a space in one of those
keys. the string type is pretty stupid about things like that, it takes the
raw input and does *nothing* to it.

I'm assuming that you don't have shards. If you have shards it's
possible you indexed the same uniqueKey to different shards.

FWIW, string type is the source of considerable confusion,
personally I prefer my own analysis chain of something
like KeywordTokenizerFactory, LowercaseFilterFactory,
TrimFilterFactory rather than string. I've even, on occasion,
thrown in things that collapse multiple spaces into one underscore.

But this behavior isn't what I'd expect, the uniqueKey should cause the
entire document to be overwritten. You might use the TermsComponent
or Luke or perhaps the admin page to examine what's actually in
your index to see if it's not quite what you expect for this field.
You should have only one of these returned.

Do note as well that if you have been changing your schema (which I often
do while experimenting) such that uniqueKey was undefined/something else,
and then changed it back, nothing in Solr would go back and clean up, so is
that possible?

Best
Erick

On Fri, Mar 2, 2012 at 4:30 PM,  <al...@aim.com> wrote:
>
>  take a look to  <updateRequestProcessorChain name="dedupe">
> I think you must use dedup to solve this issue
>
>
>
>
>
> -----Original Message-----
> From: Thomas Dowling <td...@ohiolink.edu>
> To: solr-user <so...@lucene.apache.org>
> Cc: Mikhail Khludnev <mk...@griddynamics.com>
> Sent: Fri, Mar 2, 2012 1:10 pm
> Subject: Re: Help with duplicate unique IDs
>
>
> Thanks.  In fact, the behavior I want is overwrite=true.  I want to be
> able to reindex documents, with the same id string, and automatically
> overwrite the previous version.
>
>
> Thomas
>
>
> On 03/02/2012 04:01 PM, Mikhail Khludnev wrote:
>> Hello Tomas,
>>
>> I guess you could just specify overwrite=false
>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22add.22
>>
>>
>> On Fri, Mar 2, 2012 at 11:23 PM, Thomas Dowling<td...@ohiolink.edu>wrote:
>>
>>> In a Solr index of journal articles, I thought I was safe reindexing
>>> articles because their unique ID would cause the new record in the index to
>>> overwrite the old one. (As stated at http://wiki.apache.org/solr/**
>>> SchemaXml#The_Unique_Key_Field<http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field>-
> right?)
>>>
>
>

Re: Help with duplicate unique IDs

Posted by al...@aim.com.
 take a look to  <updateRequestProcessorChain name="dedupe">
I think you must use dedup to solve this issue

 

 

-----Original Message-----
From: Thomas Dowling <td...@ohiolink.edu>
To: solr-user <so...@lucene.apache.org>
Cc: Mikhail Khludnev <mk...@griddynamics.com>
Sent: Fri, Mar 2, 2012 1:10 pm
Subject: Re: Help with duplicate unique IDs


Thanks.  In fact, the behavior I want is overwrite=true.  I want to be 
able to reindex documents, with the same id string, and automatically 
overwrite the previous version.


Thomas


On 03/02/2012 04:01 PM, Mikhail Khludnev wrote:
> Hello Tomas,
>
> I guess you could just specify overwrite=false
> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22add.22
>
>
> On Fri, Mar 2, 2012 at 11:23 PM, Thomas Dowling<td...@ohiolink.edu>wrote:
>
>> In a Solr index of journal articles, I thought I was safe reindexing
>> articles because their unique ID would cause the new record in the index to
>> overwrite the old one. (As stated at http://wiki.apache.org/solr/**
>> SchemaXml#The_Unique_Key_Field<http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field>- 
right?)
>>

 

Re: Help with duplicate unique IDs

Posted by Thomas Dowling <td...@ohiolink.edu>.
Thanks.  In fact, the behavior I want is overwrite=true.  I want to be 
able to reindex documents, with the same id string, and automatically 
overwrite the previous version.


Thomas


On 03/02/2012 04:01 PM, Mikhail Khludnev wrote:
> Hello Tomas,
>
> I guess you could just specify overwrite=false
> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22add.22
>
>
> On Fri, Mar 2, 2012 at 11:23 PM, Thomas Dowling<td...@ohiolink.edu>wrote:
>
>> In a Solr index of journal articles, I thought I was safe reindexing
>> articles because their unique ID would cause the new record in the index to
>> overwrite the old one. (As stated at http://wiki.apache.org/solr/**
>> SchemaXml#The_Unique_Key_Field<http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field>- right?)
>>

Re: Help with duplicate unique IDs

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Tomas,

I guess you could just specify overwrite=false
http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22add.22


On Fri, Mar 2, 2012 at 11:23 PM, Thomas Dowling <td...@ohiolink.edu>wrote:

> In a Solr index of journal articles, I thought I was safe reindexing
> articles because their unique ID would cause the new record in the index to
> overwrite the old one. (As stated at http://wiki.apache.org/solr/**
> SchemaXml#The_Unique_Key_Field<http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field>- right?)
>
> My schema.xml includes:
>
> <fields>...
> <field name="id" type="string" indexed="true" stored="true"
>  required="true"/>
> ...</fields>
>
> And:
>
> <uniqueKey>id</uniqueKey>
>
> And yet I can compose a query with two hits in the index, showing:
>
> #1: <str name="id">03405443/v66i0003/**347_mrirtaitmbpa</str>
> #2: <str name="id">03405443/v66i0003/**347_mrirtaitmbpa</str>
>
>
> Can anyone give pointers on where I'm screwing something up?
>
>
> Thomas Dowling
> thomas.dowling@gmail.com
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Help with duplicate unique IDs

Posted by Pawel Rog <pa...@gmail.com>.
Once I had the same problem. I didn't know what's going on. After few
moment of analysis I created completely new index and removed old one
(I hadn't enough time to analyze problem). Problem didn't come back
any more.

--
Regards,
Pawel

On Fri, Mar 2, 2012 at 8:23 PM, Thomas Dowling <td...@ohiolink.edu> wrote:
> In a Solr index of journal articles, I thought I was safe reindexing
> articles because their unique ID would cause the new record in the index to
> overwrite the old one. (As stated at
> http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field - right?)
>
> My schema.xml includes:
>
> <fields>...
> <field name="id" type="string" indexed="true" stored="true"
>  required="true"/>
> ...</fields>
>
> And:
>
> <uniqueKey>id</uniqueKey>
>
> And yet I can compose a query with two hits in the index, showing:
>
> #1: <str name="id">03405443/v66i0003/347_mrirtaitmbpa</str>
> #2: <str name="id">03405443/v66i0003/347_mrirtaitmbpa</str>
>
>
> Can anyone give pointers on where I'm screwing something up?
>
>
> Thomas Dowling
> thomas.dowling@gmail.com