You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mr Havercamp <mr...@gmail.com> on 2015/09/11 16:25:55 UTC

Re: Duplicate Documents

Running 4.8.1. I am experiencing the same problem where I get duplicates on
index update despite using overwrite=true when adding existing documents.
My duplicate ratio is a lot higher with maybe 25 - 50% of records having
duplicates (and as the index continues to run the duplicates increase from
2 to 3,4,5 etc).

<field name="key"    type="string"    indexed="true"    stored="true"
required="true"/>

and

<uniqueKey>key</uniqueKey>

are set in the schema.xml but along with overwrite="true" this still
doesn't guarantee uniqueness.

On 5 August 2015 at 14:29, Tarala, Magesh <MT...@bh.com> wrote:

> I deleted the index and re-indexed. Duplicates went away. Have not
> identified root cause, but looks like updating documents is causing it
> sporadically. Going to try deleting the document and then update.
>
>
> -----Original Message-----
> From: Tarala, Magesh
> Sent: Monday, August 03, 2015 8:27 AM
> To: solr-user@lucene.apache.org
> Subject: Duplicate Documents
>
> I'm using solr 4.10.2. I'm using "id" field as the unique key - it is
> passed in with the document when ingesting the documents into solr. When
> querying I get duplicate documents with different "_version_". Out off
> approx. 25K unique documents ingested into solr, I see approx. 300
> duplicates.
>
> It is a 3 node solr cloud with one shard and 2 replicas.
> I'm also using nested documents.
>
> Thanks in advance for any insights.
>
> --Magesh
>
>

Re: Duplicate Documents

Posted by Mr Havercamp <mr...@gmail.com>.
Thanks. Okay have done what you suggest, I.e. removed the overwrite=true
which should default to solr's default value. I've also tried a re-index
and left it to run for a few days; so far so good, nothing indicating
duplicates, so as you say, could just be a bug in my code.

Will continue to monitor to see if the problem reoccurs.

Thanks again


Hayden

On 12 September 2015 at 19:48, Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/12/2015 10:51 AM, Mr Havercamp wrote:
> > Unfortunately, <uniqueKey/> has never changed. The issue can take some
> time
> > to show itself although I think there were logic issues with the way I
> > update documents in my index.
> >
> > I first do a full purge and reindex of all items without issue.
> >
> > Over time, I only index items that have changed/are new since initial
> > reindex. However, I start to see duplicates appear which is strange
> becuase
> > I use a combination of <uniqueKey/> plus overwrite="true" which should
> > guarantee uniqueness.
> >
> > However, I have been using the /admin/luke lastModified date to check for
> > items which have been added/updated after this date but have just
> realized
> > that lastModified will only change if I a) reindex everything or b) call
> > optimize, so I have been retrieving items which have already been added
> to
> > the index. I think explicitly storing the last run time (in a file/db
> > field) will ensure I only retrieve those items which have changed since
> the
> > last index. This will also go a long way to solving the duplication
> issue.
>
> Solr will already overwrite when the uniqueKey matches (case sensitive),
> you do not need to tell it explicitly to do it.  Virtually all
> situations when people use the overwrite parameter, they are specifying
> "false" ... so I wonder if perhaps there's a bug when it is explicitly
> set to "true".  Can you do a full purge and reindex with the overwrite
> parameter removed from all requests?
>
> The XMLLoader code looks pretty straightforward, so I don't really
> expect that removing the overwrite parameter will help, but like Erick,
> I cannot see any obvious problem in the info you've shared so far.  I'm
> trying shots in the dark.
>
> Thanks,
> Shawn
>
>

Re: Duplicate Documents

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/12/2015 10:51 AM, Mr Havercamp wrote:
> Unfortunately, <uniqueKey/> has never changed. The issue can take some time
> to show itself although I think there were logic issues with the way I
> update documents in my index.
> 
> I first do a full purge and reindex of all items without issue.
> 
> Over time, I only index items that have changed/are new since initial
> reindex. However, I start to see duplicates appear which is strange becuase
> I use a combination of <uniqueKey/> plus overwrite="true" which should
> guarantee uniqueness.
> 
> However, I have been using the /admin/luke lastModified date to check for
> items which have been added/updated after this date but have just realized
> that lastModified will only change if I a) reindex everything or b) call
> optimize, so I have been retrieving items which have already been added to
> the index. I think explicitly storing the last run time (in a file/db
> field) will ensure I only retrieve those items which have changed since the
> last index. This will also go a long way to solving the duplication issue.

Solr will already overwrite when the uniqueKey matches (case sensitive),
you do not need to tell it explicitly to do it.  Virtually all
situations when people use the overwrite parameter, they are specifying
"false" ... so I wonder if perhaps there's a bug when it is explicitly
set to "true".  Can you do a full purge and reindex with the overwrite
parameter removed from all requests?

The XMLLoader code looks pretty straightforward, so I don't really
expect that removing the overwrite parameter will help, but like Erick,
I cannot see any obvious problem in the info you've shared so far.  I'm
trying shots in the dark.

Thanks,
Shawn


Re: Duplicate Documents

Posted by Mr Havercamp <mr...@gmail.com>.
Unfortunately, <uniqueKey/> has never changed. The issue can take some time
to show itself although I think there were logic issues with the way I
update documents in my index.

I first do a full purge and reindex of all items without issue.

Over time, I only index items that have changed/are new since initial
reindex. However, I start to see duplicates appear which is strange becuase
I use a combination of <uniqueKey/> plus overwrite="true" which should
guarantee uniqueness.

However, I have been using the /admin/luke lastModified date to check for
items which have been added/updated after this date but have just realized
that lastModified will only change if I a) reindex everything or b) call
optimize, so I have been retrieving items which have already been added to
the index. I think explicitly storing the last run time (in a file/db
field) will ensure I only retrieve those items which have changed since the
last index. This will also go a long way to solving the duplication issue.

Thanks again


Hayden

On 11 September 2015 at 19:33, Erick Erickson <er...@gmail.com>
wrote:

> OK, this makes no sense whatsoever, so I"m missing something.
>
> commitWithin shouldn't matter at all, there's code to handle multiple
> updates between commits.
>
> I'm _really_ shooting in the dark here, but...
>
> > did you perhaps change the <uniqueKey> definition from the default "id"
> to "key" without blowing away the entire data directory in between?
>
> > Take a look at your schema file through the Admin/UI browser, is it what
> you expect? And did you reload/restart after the changes?
>
> > I could get _some_ duplication by changing the field that was my
> <uniqueKey>
> the adding more docs. Which makes some sense since some of the Lucene
> segment files were created with one definition and some with another. But
> that
> doesn't explain why you _keep_ getting more and more duplicates.
>
> But this behavior is fundamental Solr, so I doubt it would have snuck
> through
> or not generated very loud howls. Which leaves us with wondering what is
> unexpected in your setup. Everything you've shown us looks good, so I'm
> puzzled.
>
> Best,
> Erick
>
>
> On Fri, Sep 11, 2015 at 9:52 AM, Mr Havercamp <mr...@gmail.com>
> wrote:
> > I'm wondering if the commitWithin is causing issues.
> >
> > On 11 September 2015 at 18:52, Mr Havercamp <mr...@gmail.com>
> wrote:
> >
> >> Thanks for the suggestions. No, not using MERGEINDEXES nor
> >> MapReduceIndexerTool.
> >>
> >> I've pasted the <add/> XML in case there is something broken there (cut
> >> down for brevity, i.e. the "..."):
> >>
> >> <add overwrite="true" commitWithin="10000"><doc><field
> >> name="handle_s">123456789/3</field><field name="title">Test
> >> Submission</field><field name="title_sort">Test Submission</field><field
> >> name="access">1</field><field name="parent_id">1</field><field
> >> name="collection_s">Test Collection</field><field
> name="collection_fc">test
> >> collection|||Test Collection</field><field name="collection_sort">Test
> >> Collection</field><field name="dc.contributor.author_fc">young,
> >> hayden|||Young, Hayden</field><field name="author">Young,
> >> Hayden</field><field name="dc.contributor.author_sm">Young,
> >> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>
> >>
> >> On 11 September 2015 at 18:06, Erick Erickson <er...@gmail.com>
> >> wrote:
> >>
> >>> Are you by any chance using the MERGEINDEXES
> >>> core admin call? Or using MapReduceIndexerTool?
> >>>
> >>> Neither of those delete duplicates....
> >>>
> >>> This is a fundamental part of Solr though, so it's
> >>> virtually certain that there's some innocent-seeming
> >>> thing you're doing that's causing this...
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <ap...@elyograg.org>
> >>> wrote:
> >>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
> >>> >> fieldType def:
> >>> >>
> >>> >>         <!-- The StrField type is not analyzed, but indexed/stored
> >>> >> verbatim. -->
> >>> >>         <fieldType name="string" class="solr.StrField"
> >>> >> sortMissingLast="true" />
> >>> >>
> >>> >> It is not SolrCloud.
> >>> >
> >>> > As long as it's not a distributed index, I can't think of any problem
> >>> > those field/type definitions might cause.  Even if it were
> distributed
> >>> > and you had the same document in multiple shards, duplicates should
> be
> >>> > removed at query time, if each shard has the same schema as the
> others.
> >>> >
> >>> > I don't have any further ideas.  There may be something wrong that I
> >>> > haven't thought of.
> >>> >
> >>> > Thanks,
> >>> > Shawn
> >>> >
> >>>
> >>
> >>
>

Re: Duplicate Documents

Posted by Erick Erickson <er...@gmail.com>.
OK, this makes no sense whatsoever, so I"m missing something.

commitWithin shouldn't matter at all, there's code to handle multiple
updates between commits.

I'm _really_ shooting in the dark here, but...

> did you perhaps change the <uniqueKey> definition from the default "id"
to "key" without blowing away the entire data directory in between?

> Take a look at your schema file through the Admin/UI browser, is it what
you expect? And did you reload/restart after the changes?

> I could get _some_ duplication by changing the field that was my <uniqueKey>
the adding more docs. Which makes some sense since some of the Lucene
segment files were created with one definition and some with another. But that
doesn't explain why you _keep_ getting more and more duplicates.

But this behavior is fundamental Solr, so I doubt it would have snuck through
or not generated very loud howls. Which leaves us with wondering what is
unexpected in your setup. Everything you've shown us looks good, so I'm puzzled.

Best,
Erick


On Fri, Sep 11, 2015 at 9:52 AM, Mr Havercamp <mr...@gmail.com> wrote:
> I'm wondering if the commitWithin is causing issues.
>
> On 11 September 2015 at 18:52, Mr Havercamp <mr...@gmail.com> wrote:
>
>> Thanks for the suggestions. No, not using MERGEINDEXES nor
>> MapReduceIndexerTool.
>>
>> I've pasted the <add/> XML in case there is something broken there (cut
>> down for brevity, i.e. the "..."):
>>
>> <add overwrite="true" commitWithin="10000"><doc><field
>> name="handle_s">123456789/3</field><field name="title">Test
>> Submission</field><field name="title_sort">Test Submission</field><field
>> name="access">1</field><field name="parent_id">1</field><field
>> name="collection_s">Test Collection</field><field name="collection_fc">test
>> collection|||Test Collection</field><field name="collection_sort">Test
>> Collection</field><field name="dc.contributor.author_fc">young,
>> hayden|||Young, Hayden</field><field name="author">Young,
>> Hayden</field><field name="dc.contributor.author_sm">Young,
>> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>
>>
>> On 11 September 2015 at 18:06, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> Are you by any chance using the MERGEINDEXES
>>> core admin call? Or using MapReduceIndexerTool?
>>>
>>> Neither of those delete duplicates....
>>>
>>> This is a fundamental part of Solr though, so it's
>>> virtually certain that there's some innocent-seeming
>>> thing you're doing that's causing this...
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <ap...@elyograg.org>
>>> wrote:
>>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
>>> >> fieldType def:
>>> >>
>>> >>         <!-- The StrField type is not analyzed, but indexed/stored
>>> >> verbatim. -->
>>> >>         <fieldType name="string" class="solr.StrField"
>>> >> sortMissingLast="true" />
>>> >>
>>> >> It is not SolrCloud.
>>> >
>>> > As long as it's not a distributed index, I can't think of any problem
>>> > those field/type definitions might cause.  Even if it were distributed
>>> > and you had the same document in multiple shards, duplicates should be
>>> > removed at query time, if each shard has the same schema as the others.
>>> >
>>> > I don't have any further ideas.  There may be something wrong that I
>>> > haven't thought of.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>>
>>
>>

Re: Duplicate Documents

Posted by Mr Havercamp <mr...@gmail.com>.
I'm wondering if the commitWithin is causing issues.

On 11 September 2015 at 18:52, Mr Havercamp <mr...@gmail.com> wrote:

> Thanks for the suggestions. No, not using MERGEINDEXES nor
> MapReduceIndexerTool.
>
> I've pasted the <add/> XML in case there is something broken there (cut
> down for brevity, i.e. the "..."):
>
> <add overwrite="true" commitWithin="10000"><doc><field
> name="handle_s">123456789/3</field><field name="title">Test
> Submission</field><field name="title_sort">Test Submission</field><field
> name="access">1</field><field name="parent_id">1</field><field
> name="collection_s">Test Collection</field><field name="collection_fc">test
> collection|||Test Collection</field><field name="collection_sort">Test
> Collection</field><field name="dc.contributor.author_fc">young,
> hayden|||Young, Hayden</field><field name="author">Young,
> Hayden</field><field name="dc.contributor.author_sm">Young,
> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>
>
> On 11 September 2015 at 18:06, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Are you by any chance using the MERGEINDEXES
>> core admin call? Or using MapReduceIndexerTool?
>>
>> Neither of those delete duplicates....
>>
>> This is a fundamental part of Solr though, so it's
>> virtually certain that there's some innocent-seeming
>> thing you're doing that's causing this...
>>
>> Best,
>> Erick
>>
>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <ap...@elyograg.org>
>> wrote:
>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
>> >> fieldType def:
>> >>
>> >>         <!-- The StrField type is not analyzed, but indexed/stored
>> >> verbatim. -->
>> >>         <fieldType name="string" class="solr.StrField"
>> >> sortMissingLast="true" />
>> >>
>> >> It is not SolrCloud.
>> >
>> > As long as it's not a distributed index, I can't think of any problem
>> > those field/type definitions might cause.  Even if it were distributed
>> > and you had the same document in multiple shards, duplicates should be
>> > removed at query time, if each shard has the same schema as the others.
>> >
>> > I don't have any further ideas.  There may be something wrong that I
>> > haven't thought of.
>> >
>> > Thanks,
>> > Shawn
>> >
>>
>
>

Re: Duplicate Documents

Posted by Mr Havercamp <mr...@gmail.com>.
Thanks for the suggestions. No, not using MERGEINDEXES nor
MapReduceIndexerTool.

I've pasted the <add/> XML in case there is something broken there (cut
down for brevity, i.e. the "..."):

<add overwrite="true" commitWithin="10000"><doc><field
name="handle_s">123456789/3</field><field name="title">Test
Submission</field><field name="title_sort">Test Submission</field><field
name="access">1</field><field name="parent_id">1</field><field
name="collection_s">Test Collection</field><field name="collection_fc">test
collection|||Test Collection</field><field name="collection_sort">Test
Collection</field><field name="dc.contributor.author_fc">young,
hayden|||Young, Hayden</field><field name="author">Young,
Hayden</field><field name="dc.contributor.author_sm">Young,
Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>

On 11 September 2015 at 18:06, Erick Erickson <er...@gmail.com>
wrote:

> Are you by any chance using the MERGEINDEXES
> core admin call? Or using MapReduceIndexerTool?
>
> Neither of those delete duplicates....
>
> This is a fundamental part of Solr though, so it's
> virtually certain that there's some innocent-seeming
> thing you're doing that's causing this...
>
> Best,
> Erick
>
> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
> >> fieldType def:
> >>
> >>         <!-- The StrField type is not analyzed, but indexed/stored
> >> verbatim. -->
> >>         <fieldType name="string" class="solr.StrField"
> >> sortMissingLast="true" />
> >>
> >> It is not SolrCloud.
> >
> > As long as it's not a distributed index, I can't think of any problem
> > those field/type definitions might cause.  Even if it were distributed
> > and you had the same document in multiple shards, duplicates should be
> > removed at query time, if each shard has the same schema as the others.
> >
> > I don't have any further ideas.  There may be something wrong that I
> > haven't thought of.
> >
> > Thanks,
> > Shawn
> >
>

Re: Duplicate Documents

Posted by Erick Erickson <er...@gmail.com>.
Are you by any chance using the MERGEINDEXES
core admin call? Or using MapReduceIndexerTool?

Neither of those delete duplicates....

This is a fundamental part of Solr though, so it's
virtually certain that there's some innocent-seeming
thing you're doing that's causing this...

Best,
Erick

On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 9/11/2015 9:10 AM, Mr Havercamp wrote:
>> fieldType def:
>>
>>         <!-- The StrField type is not analyzed, but indexed/stored
>> verbatim. -->
>>         <fieldType name="string" class="solr.StrField"
>> sortMissingLast="true" />
>>
>> It is not SolrCloud.
>
> As long as it's not a distributed index, I can't think of any problem
> those field/type definitions might cause.  Even if it were distributed
> and you had the same document in multiple shards, duplicates should be
> removed at query time, if each shard has the same schema as the others.
>
> I don't have any further ideas.  There may be something wrong that I
> haven't thought of.
>
> Thanks,
> Shawn
>

Re: Duplicate Documents

Posted by Vivek Pathak <vp...@orgmeta.com>.
At query time,  you could externally roll in the dups when they have the 
same signature.

If you define your use case, it might be easier..



On 09/11/2015 11:55 AM, Shawn Heisey wrote:
> On 9/11/2015 9:10 AM, Mr Havercamp wrote:
>> fieldType def:
>>
>>          <!-- The StrField type is not analyzed, but indexed/stored
>> verbatim. -->
>>          <fieldType name="string" class="solr.StrField"
>> sortMissingLast="true" />
>>
>> It is not SolrCloud.
> As long as it's not a distributed index, I can't think of any problem
> those field/type definitions might cause.  Even if it were distributed
> and you had the same document in multiple shards, duplicates should be
> removed at query time, if each shard has the same schema as the others.
>
> I don't have any further ideas.  There may be something wrong that I
> haven't thought of.
>
> Thanks,
> Shawn
>


Re: Duplicate Documents

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/11/2015 9:10 AM, Mr Havercamp wrote:
> fieldType def:
>
>         <!-- The StrField type is not analyzed, but indexed/stored
> verbatim. -->
>         <fieldType name="string" class="solr.StrField"
> sortMissingLast="true" />
>
> It is not SolrCloud.

As long as it's not a distributed index, I can't think of any problem
those field/type definitions might cause.  Even if it were distributed
and you had the same document in multiple shards, duplicates should be
removed at query time, if each shard has the same schema as the others.

I don't have any further ideas.  There may be something wrong that I
haven't thought of.

Thanks,
Shawn


Re: Duplicate Documents

Posted by Mr Havercamp <mr...@gmail.com>.
Hi Shawn

Thanks for your response.

fieldType def:

        <!-- The StrField type is not analyzed, but indexed/stored
verbatim. -->
        <fieldType name="string" class="solr.StrField"
sortMissingLast="true" />

It is not SolrCloud.

Cheers


Hayden

On 11 September 2015 at 16:35, Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/11/2015 8:25 AM, Mr Havercamp wrote:
> > Running 4.8.1. I am experiencing the same problem where I get duplicates
> on
> > index update despite using overwrite=true when adding existing documents.
> > My duplicate ratio is a lot higher with maybe 25 - 50% of records having
> > duplicates (and as the index continues to run the duplicates increase
> from
> > 2 to 3,4,5 etc).
> >
> > <field name="key"    type="string"    indexed="true"    stored="true"
> > required="true"/>
> >
> > and
> >
> > <uniqueKey>key</uniqueKey>
> >
> > are set in the schema.xml but along with overwrite="true" this still
> > doesn't guarantee uniqueness.
>
> What is the fieldType definition for "string" ?  I know what it is in
> the example, but it could be something entirely different in your schema.
>
> Also, is it SolrCloud?
>
> Thanks,
> Shawn
>
>

Re: Duplicate Documents

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/11/2015 8:25 AM, Mr Havercamp wrote:
> Running 4.8.1. I am experiencing the same problem where I get duplicates on
> index update despite using overwrite=true when adding existing documents.
> My duplicate ratio is a lot higher with maybe 25 - 50% of records having
> duplicates (and as the index continues to run the duplicates increase from
> 2 to 3,4,5 etc).
>
> <field name="key"    type="string"    indexed="true"    stored="true"
> required="true"/>
>
> and
>
> <uniqueKey>key</uniqueKey>
>
> are set in the schema.xml but along with overwrite="true" this still
> doesn't guarantee uniqueness.

What is the fieldType definition for "string" ?  I know what it is in
the example, but it could be something entirely different in your schema.

Also, is it SolrCloud?

Thanks,
Shawn