You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Eduardo Gomez <eg...@mintel.com.INVALID> on 2022/12/08 09:43:29 UTC

Duplicate docs with same unique id on update

Hi All,

I'm in the process of porting our Solr 7.5 to 8.11.1. I'm using our legacy
schema.xml with ClassicIndexSchemaFactory in solrconfig.xml.

I have seen there have been some changes introduced to how child docs are
updated (
https://solr.apache.org/guide/8_0/major-changes-in-solr-8.html#nested-documents).
From the docs:

*" ... an attempt to update a child document by providing a new document
with the same ID would add a new document (which will probably be
erroneous)"*

I'm not using nested docs, however I'm observing exactly that happening in
Solr 8.11.1 for all my docs. It seems like the only way of avoiding that is
adding this to the schema:

 <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />

which is supposed to be needed only for nested docs to refer to their
parent, is that correct? Has anyone seen this? Is that expected behaviour,
with the _root_ field needed to refer to itself in non-nested docs?

Thanks!
Eduardo

-- 

Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations 
<http://www.mintel.com/office-locations>.

This email and any attachments 
may include content that is confidential, privileged 
or otherwise 
protected under applicable law. Unauthorised disclosure, copying, 
distribution 
or use of the contents is prohibited and may be unlawful. If 
you have received this email in error,
including without appropriate 
authorisation, then please reply to the sender about the error 
and delete 
this email and any attachments.


Re: Duplicate docs with same unique id on update

Posted by Eduardo Gomez <eg...@mintel.com.INVALID>.
Hi, sorry for thee delay in replying.

After some more digging, I noticed the following in the schema (which I
didn't originally created and which works without apparent issues in Solr
7.5):

<dynamicField name="*" type="string" indexed="false" stored="false"/>


I think that was intended as a catchall field for fields in the input data
not found in the schema.

Removing that field would stop producing duplicate documents with the same
unique id. Without removing that field, adding the following:

<field name="_root_" type="string" indexed="true" stored="false"
docValues="false"/>


also prevents the creation of duplicates.

So to clarify.

In *Solr 7.5*, the following:

<dynamicField name="*" type="string" indexed="false" stored="false"/>
(with no <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />)


When running:

curl -X POST -H 'Content-type:application/json' '
http://localhost:8983/solr/test-dup/update?commit=true'
--data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'hello'}}}"

curl -X POST -H 'Content-type:application/json' '
http://localhost:8983/solr/test-dup/update?commit=true'
--data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'bye'}}}"


it results in a single doc:

{"id":  "28a6a45a-5f81...", "title": "bye"}



In *Solr 8.11*, with the following in the schema:

<dynamicField name="*" type="string" indexed="false" stored="false"/>
(with no <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />)


When running:

curl -X POST -H 'Content-type:application/json' '
http://localhost:8983/solr/test-dup/update?commit=true'
--data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'hello'}}}"

curl -X POST -H 'Content-type:application/json' '
http://localhost:8983/solr/test-dup/update?commit=true'
--data "{'add': {'doc':{'id': '28a6a45a-5f81...', 'title': 'bye'}}}"


it results in two docs:

{"id": "28a6a45a-5f81...", "title": "hello"}
{"id": "28a6a45a-5f81...", "title": "bye"}



As I mentioned above, removing the dynamic field or adding the _root_ field
produces the expected behaviour with the document getting updated instead
of duplicated.

I have added uploaded a very stripped down version of the schema.xml
<https://pastebin.com/raw/bxBBF2tP> plus the solrconfig.xml
<https://pastebin.com/raw/aUsh9g2z> that reproduce the duplicating
behaviour.

Thanks!

Eduardo

On Fri, Dec 9, 2022 at 1:53 PM Jan Høydahl <ja...@cominvent.com> wrote:

> No no. The schema still has ONE a uniqueId field.
> The _root_ field is used as a parent pointer for child documents, it will
> hold the ID of the parent.
> Thus you should not need _root_ if you don't use parent/child. But this
> thread suggests that _root_ may be needed in some other code paths as well.
>
> I suspect perhaps this JIRA
> https://issues.apache.org/jira/browse/SOLR-12638 may be related in some
> way (have not looked at any of that code though, see
> https://github.com/apache/solr/search?q=SOLR-12638&type=commits)
>
> Jan
>
> > 9. des. 2022 kl. 14:32 skrev Dave <ha...@gmail.com>:
> >
> > So it was a decision to remove the unique field id and replace it with
> root? This seems, bad. You can’t have two documents with the same id/unique
> field.
> >
> >> On Dec 9, 2022, at 7:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> >>
> >> Hi,
> >>
> >> So to be clear - you have a working fix by adding the _root_ field to
> your schema?
> >>
> >> I suppose most 8.x users already have a _root_ field, so the thing you
> are seeing could very well be some bug related to atomic update.
> >>
> >> Can I propose that you create a minimal reproduction of this issue and
> upload somewhere?
> >> It could e.g. be a set of curl commands that, starting from a newly
> installed Solr 8.11 (or even better 9.1) reproduce the issue.
> >> Hint: You can create a collection with default schema: `solr create -c
> test` and then remove the _root_ field by issuing a delete-field command as
> described here
> https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html#delete-a-field
> >>
> >> Jan
> >>
> >>>> 8. des. 2022 kl. 15:30 skrev Eduardo Gomez <egomez@mintel.com.INVALID
> >:
> >>>>
> >>>> At first it wasn't clear to me what the problem you're having actually
> >>>> is.  Then I glanced back at the message subject ... it is the only
> place
> >>>> you mention it.
> >>>
> >>> Sorry Shawn, you are right, I didn't explain very clearly. So
> basically, in
> >>> Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
> >>> "22468d41-3b...", "title": "Old title"}:
> >>>
> >>> curl -X POST -H 'Content-type:application/json' '
> >>> http://localhost:8983/solr/clients_main/update?commit=true' --data
> "{'add':
> >>> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
> >>>
> >>> I get two docs with the same id and different titles in the index.
> That is
> >>> different to the behaviour I see using Solr 7.5, which is a single
> document
> >>> with the updated title.To get that with the same schema in Solr
> 8.11.1, I
> >>> have to add this to the schema:
> >>>
> >>> <field name="_root_" type="string" indexed="true" stored="false">
> >>>
> >>> So without the _root_ definition, the behaviour is as expected in Solr
> 7.5
> >>> but produces duplicate documents in Solr 8.11. I haven't noticed Solr
> >>> complainig if the _root_ field is not defined.
> >>>
> >>> So my question was if that is expected, as that field seems to be
> related
> >>> to parent-child documents, which I don't use at all.
> >>>
> >>> The definition for the id field in my schema.xml is similar to the one
> you
> >>> posted:
> >>>
> >>> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
> >>> <field name="id" type="string" indexed="true" stored="true"
> required="true"
> >>> docValues="false"/>
> >>> <uniqueKey>id</uniqueKey>
> >>>
> >>> Eduardo
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org>
> wrote:
> >>>>
> >>>> Right, Shawn. That's how it works
> >>>>
> >>>>
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
> >>>> And it's really fast in query time.
> >>>>
> >>>>> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >>>>
> >>>>> On 12/8/22 05:58, Shawn Heisey wrote:
> >>>>>> So you can't just update a child document, you have to update all
> the
> >>>>>> children and all the parents at the same time, so the new documents
> >>>>>> are all in the same segment.
> >>>>>
> >>>>> That's a little unclear and sounds like a draconian requirement. :)
> I
> >>>>> meant that all children must be in the same segment as their
> parent.  I
> >>>>> think Solr might support the idea of multiple nesting levels ... if
> so,
> >>>>> then the ultimate parent document and all its descendants need to be
> in
> >>>>> the same segment.
> >>>>>
> >>>>> Thanks,
> >>>>> Shawn
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Sincerely yours
> >>>> Mikhail Khludnev
> >>>>
> >>>
> >>> --
> >>>
> >>> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
> >>> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> >>>
> >>> Contact details for our other offices can be found at
> >>> http://www.mintel.com/office-locations
> >>> <http://www.mintel.com/office-locations>.
> >>>
> >>> This email and any attachments
> >>> may include content that is confidential, privileged
> >>> or otherwise
> >>> protected under applicable law. Unauthorised disclosure, copying,
> >>> distribution
> >>> or use of the contents is prohibited and may be unlawful. If
> >>> you have received this email in error,
> >>> including without appropriate
> >>> authorisation, then please reply to the sender about the error
> >>> and delete
> >>> this email and any attachments.
> >>>
> >>
>
>

-- 

Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations 
<http://www.mintel.com/office-locations>.

This email and any attachments 
may include content that is confidential, privileged 
or otherwise 
protected under applicable law. Unauthorised disclosure, copying, 
distribution 
or use of the contents is prohibited and may be unlawful. If 
you have received this email in error,
including without appropriate 
authorisation, then please reply to the sender about the error 
and delete 
this email and any attachments.


Re: Duplicate docs with same unique id on update

Posted by Jan Høydahl <ja...@cominvent.com>.
No no. The schema still has ONE a uniqueId field.
The _root_ field is used as a parent pointer for child documents, it will hold the ID of the parent.
Thus you should not need _root_ if you don't use parent/child. But this thread suggests that _root_ may be needed in some other code paths as well.

I suspect perhaps this JIRA https://issues.apache.org/jira/browse/SOLR-12638 may be related in some way (have not looked at any of that code though, see https://github.com/apache/solr/search?q=SOLR-12638&type=commits)

Jan

> 9. des. 2022 kl. 14:32 skrev Dave <ha...@gmail.com>:
> 
> So it was a decision to remove the unique field id and replace it with root? This seems, bad. You can’t have two documents with the same id/unique field.  
> 
>> On Dec 9, 2022, at 7:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>> 
>> Hi,
>> 
>> So to be clear - you have a working fix by adding the _root_ field to your schema?
>> 
>> I suppose most 8.x users already have a _root_ field, so the thing you are seeing could very well be some bug related to atomic update.
>> 
>> Can I propose that you create a minimal reproduction of this issue and upload somewhere?
>> It could e.g. be a set of curl commands that, starting from a newly installed Solr 8.11 (or even better 9.1) reproduce the issue.
>> Hint: You can create a collection with default schema: `solr create -c test` and then remove the _root_ field by issuing a delete-field command as described here https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html#delete-a-field
>> 
>> Jan
>> 
>>>> 8. des. 2022 kl. 15:30 skrev Eduardo Gomez <eg...@mintel.com.INVALID>:
>>>> 
>>>> At first it wasn't clear to me what the problem you're having actually
>>>> is.  Then I glanced back at the message subject ... it is the only place
>>>> you mention it.
>>> 
>>> Sorry Shawn, you are right, I didn't explain very clearly. So basically, in
>>> Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
>>> "22468d41-3b...", "title": "Old title"}:
>>> 
>>> curl -X POST -H 'Content-type:application/json' '
>>> http://localhost:8983/solr/clients_main/update?commit=true' --data "{'add':
>>> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
>>> 
>>> I get two docs with the same id and different titles in the index. That is
>>> different to the behaviour I see using Solr 7.5, which is a single document
>>> with the updated title.To get that with the same schema in Solr 8.11.1, I
>>> have to add this to the schema:
>>> 
>>> <field name="_root_" type="string" indexed="true" stored="false">
>>> 
>>> So without the _root_ definition, the behaviour is as expected in Solr 7.5
>>> but produces duplicate documents in Solr 8.11. I haven't noticed Solr
>>> complainig if the _root_ field is not defined.
>>> 
>>> So my question was if that is expected, as that field seems to be related
>>> to parent-child documents, which I don't use at all.
>>> 
>>> The definition for the id field in my schema.xml is similar to the one you
>>> posted:
>>> 
>>> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
>>> <field name="id" type="string" indexed="true" stored="true" required="true"
>>> docValues="false"/>
>>> <uniqueKey>id</uniqueKey>
>>> 
>>> Eduardo
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:
>>>> 
>>>> Right, Shawn. That's how it works
>>>> 
>>>> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
>>>> And it's really fast in query time.
>>>> 
>>>>> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>>>> 
>>>>> On 12/8/22 05:58, Shawn Heisey wrote:
>>>>>> So you can't just update a child document, you have to update all the
>>>>>> children and all the parents at the same time, so the new documents
>>>>>> are all in the same segment.
>>>>> 
>>>>> That's a little unclear and sounds like a draconian requirement. :)  I
>>>>> meant that all children must be in the same segment as their parent.  I
>>>>> think Solr might support the idea of multiple nesting levels ... if so,
>>>>> then the ultimate parent document and all its descendants need to be in
>>>>> the same segment.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> 
>>> 
>>> -- 
>>> 
>>> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
>>> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>>> 
>>> Contact details for our other offices can be found at 
>>> http://www.mintel.com/office-locations 
>>> <http://www.mintel.com/office-locations>.
>>> 
>>> This email and any attachments 
>>> may include content that is confidential, privileged 
>>> or otherwise 
>>> protected under applicable law. Unauthorised disclosure, copying, 
>>> distribution 
>>> or use of the contents is prohibited and may be unlawful. If 
>>> you have received this email in error,
>>> including without appropriate 
>>> authorisation, then please reply to the sender about the error 
>>> and delete 
>>> this email and any attachments.
>>> 
>> 


Re: Duplicate docs with same unique id on update

Posted by Dave <ha...@gmail.com>.
So it was a decision to remove the unique field id and replace it with root? This seems, bad. You can’t have two documents with the same id/unique field.  

> On Dec 9, 2022, at 7:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> 
> Hi,
> 
> So to be clear - you have a working fix by adding the _root_ field to your schema?
> 
> I suppose most 8.x users already have a _root_ field, so the thing you are seeing could very well be some bug related to atomic update.
> 
> Can I propose that you create a minimal reproduction of this issue and upload somewhere?
> It could e.g. be a set of curl commands that, starting from a newly installed Solr 8.11 (or even better 9.1) reproduce the issue.
> Hint: You can create a collection with default schema: `solr create -c test` and then remove the _root_ field by issuing a delete-field command as described here https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html#delete-a-field
> 
> Jan
> 
>>> 8. des. 2022 kl. 15:30 skrev Eduardo Gomez <eg...@mintel.com.INVALID>:
>>> 
>>> At first it wasn't clear to me what the problem you're having actually
>>> is.  Then I glanced back at the message subject ... it is the only place
>>> you mention it.
>> 
>> Sorry Shawn, you are right, I didn't explain very clearly. So basically, in
>> Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
>> "22468d41-3b...", "title": "Old title"}:
>> 
>> curl -X POST -H 'Content-type:application/json' '
>> http://localhost:8983/solr/clients_main/update?commit=true' --data "{'add':
>> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
>> 
>> I get two docs with the same id and different titles in the index. That is
>> different to the behaviour I see using Solr 7.5, which is a single document
>> with the updated title.To get that with the same schema in Solr 8.11.1, I
>> have to add this to the schema:
>> 
>> <field name="_root_" type="string" indexed="true" stored="false">
>> 
>> So without the _root_ definition, the behaviour is as expected in Solr 7.5
>> but produces duplicate documents in Solr 8.11. I haven't noticed Solr
>> complainig if the _root_ field is not defined.
>> 
>> So my question was if that is expected, as that field seems to be related
>> to parent-child documents, which I don't use at all.
>> 
>> The definition for the id field in my schema.xml is similar to the one you
>> posted:
>> 
>> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
>> <field name="id" type="string" indexed="true" stored="true" required="true"
>> docValues="false"/>
>> <uniqueKey>id</uniqueKey>
>> 
>> Eduardo
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:
>>> 
>>> Right, Shawn. That's how it works
>>> 
>>> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
>>> And it's really fast in query time.
>>> 
>>>> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>>> 
>>>> On 12/8/22 05:58, Shawn Heisey wrote:
>>>>> So you can't just update a child document, you have to update all the
>>>>> children and all the parents at the same time, so the new documents
>>>>> are all in the same segment.
>>>> 
>>>> That's a little unclear and sounds like a draconian requirement. :)  I
>>>> meant that all children must be in the same segment as their parent.  I
>>>> think Solr might support the idea of multiple nesting levels ... if so,
>>>> then the ultimate parent document and all its descendants need to be in
>>>> the same segment.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> 
>> 
>> -- 
>> 
>> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
>> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>> 
>> Contact details for our other offices can be found at 
>> http://www.mintel.com/office-locations 
>> <http://www.mintel.com/office-locations>.
>> 
>> This email and any attachments 
>> may include content that is confidential, privileged 
>> or otherwise 
>> protected under applicable law. Unauthorised disclosure, copying, 
>> distribution 
>> or use of the contents is prohibited and may be unlawful. If 
>> you have received this email in error,
>> including without appropriate 
>> authorisation, then please reply to the sender about the error 
>> and delete 
>> this email and any attachments.
>> 
> 

Re: Duplicate docs with same unique id on update

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

So to be clear - you have a working fix by adding the _root_ field to your schema?

I suppose most 8.x users already have a _root_ field, so the thing you are seeing could very well be some bug related to atomic update.

Can I propose that you create a minimal reproduction of this issue and upload somewhere?
It could e.g. be a set of curl commands that, starting from a newly installed Solr 8.11 (or even better 9.1) reproduce the issue.
Hint: You can create a collection with default schema: `solr create -c test` and then remove the _root_ field by issuing a delete-field command as described here https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html#delete-a-field

Jan

> 8. des. 2022 kl. 15:30 skrev Eduardo Gomez <eg...@mintel.com.INVALID>:
> 
>> At first it wasn't clear to me what the problem you're having actually
>> is.  Then I glanced back at the message subject ... it is the only place
>> you mention it.
> 
> Sorry Shawn, you are right, I didn't explain very clearly. So basically, in
> Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
> "22468d41-3b...", "title": "Old title"}:
> 
> curl -X POST -H 'Content-type:application/json' '
> http://localhost:8983/solr/clients_main/update?commit=true' --data "{'add':
> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
> 
> I get two docs with the same id and different titles in the index. That is
> different to the behaviour I see using Solr 7.5, which is a single document
> with the updated title.To get that with the same schema in Solr 8.11.1, I
> have to add this to the schema:
> 
> <field name="_root_" type="string" indexed="true" stored="false">
> 
> So without the _root_ definition, the behaviour is as expected in Solr 7.5
> but produces duplicate documents in Solr 8.11. I haven't noticed Solr
> complainig if the _root_ field is not defined.
> 
> So my question was if that is expected, as that field seems to be related
> to parent-child documents, which I don't use at all.
> 
> The definition for the id field in my schema.xml is similar to the one you
> posted:
> 
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
> <field name="id" type="string" indexed="true" stored="true" required="true"
> docValues="false"/>
> <uniqueKey>id</uniqueKey>
> 
> Eduardo
> 
> 
> 
> 
> 
> 
> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:
> 
>> Right, Shawn. That's how it works
>> 
>> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
>> And it's really fast in query time.
>> 
>> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>> 
>>> On 12/8/22 05:58, Shawn Heisey wrote:
>>>> So you can't just update a child document, you have to update all the
>>>> children and all the parents at the same time, so the new documents
>>>> are all in the same segment.
>>> 
>>> That's a little unclear and sounds like a draconian requirement. :)  I
>>> meant that all children must be in the same segment as their parent.  I
>>> think Solr might support the idea of multiple nesting levels ... if so,
>>> then the ultimate parent document and all its descendants need to be in
>>> the same segment.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> 
> 
> -- 
> 
> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> 
> Contact details for our other offices can be found at 
> http://www.mintel.com/office-locations 
> <http://www.mintel.com/office-locations>.
> 
> This email and any attachments 
> may include content that is confidential, privileged 
> or otherwise 
> protected under applicable law. Unauthorised disclosure, copying, 
> distribution 
> or use of the contents is prohibited and may be unlawful. If 
> you have received this email in error,
> including without appropriate 
> authorisation, then please reply to the sender about the error 
> and delete 
> this email and any attachments.
> 


Re: Duplicate docs with same unique id on update

Posted by Eduardo Gomez <eg...@mintel.com.INVALID>.
The default managed_schema in solr 8.11 says:

    <!-- If you don't use child/nested documents, then you should remove
the next two fields:  -->
    <!-- for nested documents (minimal; points to root document) -->
    <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />
    <!-- for nested documents (relationship tracking) -->
    <field name="_nest_path_" type="_nest_path_" /><fieldType
name="_nest_path_" class="solr.NestPathField" />




On Thu, Dec 8, 2022 at 2:40 PM David Hastings <ha...@gmail.com>
wrote:

> Interesting, this is kind of bizarre behavior.
> is:
> <field name="_root_" type="string" indexed="true" stored="false">
> defaulted in the schema for 8.x?
>
> On Thu, Dec 8, 2022 at 9:31 AM Eduardo Gomez <eg...@mintel.com.invalid>
> wrote:
>
> > > At first it wasn't clear to me what the problem you're having actually
> > > is.  Then I glanced back at the message subject ... it is the only
> place
> > > you mention it.
> >
> > Sorry Shawn, you are right, I didn't explain very clearly. So basically,
> in
> > Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
> > "22468d41-3b...", "title": "Old title"}:
> >
> > curl -X POST -H 'Content-type:application/json' '
> > http://localhost:8983/solr/clients_main/update?commit=true' --data
> > "{'add':
> > {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
> >
> > I get two docs with the same id and different titles in the index. That
> is
> > different to the behaviour I see using Solr 7.5, which is a single
> document
> > with the updated title.To get that with the same schema in Solr 8.11.1, I
> > have to add this to the schema:
> >
> > <field name="_root_" type="string" indexed="true" stored="false">
> >
> > So without the _root_ definition, the behaviour is as expected in Solr
> 7.5
> > but produces duplicate documents in Solr 8.11. I haven't noticed Solr
> > complainig if the _root_ field is not defined.
> >
> > So my question was if that is expected, as that field seems to be related
> > to parent-child documents, which I don't use at all.
> >
> > The definition for the id field in my schema.xml is similar to the one
> you
> > posted:
> >
> > <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
> > <field name="id" type="string" indexed="true" stored="true"
> required="true"
> > docValues="false"/>
> > <uniqueKey>id</uniqueKey>
> >
> > Eduardo
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:
> >
> > > Right, Shawn. That's how it works
> > >
> > >
> >
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
> > > And it's really fast in query time.
> > >
> > > On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> > >
> > > > On 12/8/22 05:58, Shawn Heisey wrote:
> > > > > So you can't just update a child document, you have to update all
> the
> > > > > children and all the parents at the same time, so the new documents
> > > > > are all in the same segment.
> > > >
> > > > That's a little unclear and sounds like a draconian requirement. :)
> I
> > > > meant that all children must be in the same segment as their
> parent.  I
> > > > think Solr might support the idea of multiple nesting levels ... if
> so,
> > > > then the ultimate parent document and all its descendants need to be
> in
> > > > the same segment.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
> > --
> >
> > Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
> > Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> >
> > Contact details for our other offices can be found at
> > http://www.mintel.com/office-locations
> > <http://www.mintel.com/office-locations>.
> >
> > This email and any attachments
> > may include content that is confidential, privileged
> > or otherwise
> > protected under applicable law. Unauthorised disclosure, copying,
> > distribution
> > or use of the contents is prohibited and may be unlawful. If
> > you have received this email in error,
> > including without appropriate
> > authorisation, then please reply to the sender about the error
> > and delete
> > this email and any attachments.
> >
> >
>

-- 

Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations 
<http://www.mintel.com/office-locations>.

This email and any attachments 
may include content that is confidential, privileged 
or otherwise 
protected under applicable law. Unauthorised disclosure, copying, 
distribution 
or use of the contents is prohibited and may be unlawful. If 
you have received this email in error,
including without appropriate 
authorisation, then please reply to the sender about the error 
and delete 
this email and any attachments.


Re: Duplicate docs with same unique id on update

Posted by David Hastings <ha...@gmail.com>.
Interesting, this is kind of bizarre behavior.
is:
<field name="_root_" type="string" indexed="true" stored="false">
defaulted in the schema for 8.x?

On Thu, Dec 8, 2022 at 9:31 AM Eduardo Gomez <eg...@mintel.com.invalid>
wrote:

> > At first it wasn't clear to me what the problem you're having actually
> > is.  Then I glanced back at the message subject ... it is the only place
> > you mention it.
>
> Sorry Shawn, you are right, I didn't explain very clearly. So basically, in
> Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
> "22468d41-3b...", "title": "Old title"}:
>
> curl -X POST -H 'Content-type:application/json' '
> http://localhost:8983/solr/clients_main/update?commit=true' --data
> "{'add':
> {'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"
>
> I get two docs with the same id and different titles in the index. That is
> different to the behaviour I see using Solr 7.5, which is a single document
> with the updated title.To get that with the same schema in Solr 8.11.1, I
> have to add this to the schema:
>
> <field name="_root_" type="string" indexed="true" stored="false">
>
> So without the _root_ definition, the behaviour is as expected in Solr 7.5
> but produces duplicate documents in Solr 8.11. I haven't noticed Solr
> complainig if the _root_ field is not defined.
>
> So my question was if that is expected, as that field seems to be related
> to parent-child documents, which I don't use at all.
>
> The definition for the id field in my schema.xml is similar to the one you
> posted:
>
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
> <field name="id" type="string" indexed="true" stored="true" required="true"
> docValues="false"/>
> <uniqueKey>id</uniqueKey>
>
> Eduardo
>
>
>
>
>
>
> On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Right, Shawn. That's how it works
> >
> >
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
> > And it's really fast in query time.
> >
> > On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > > On 12/8/22 05:58, Shawn Heisey wrote:
> > > > So you can't just update a child document, you have to update all the
> > > > children and all the parents at the same time, so the new documents
> > > > are all in the same segment.
> > >
> > > That's a little unclear and sounds like a draconian requirement. :)  I
> > > meant that all children must be in the same segment as their parent.  I
> > > think Solr might support the idea of multiple nesting levels ... if so,
> > > then the ultimate parent document and all its descendants need to be in
> > > the same segment.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
> --
>
> Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at
> http://www.mintel.com/office-locations
> <http://www.mintel.com/office-locations>.
>
> This email and any attachments
> may include content that is confidential, privileged
> or otherwise
> protected under applicable law. Unauthorised disclosure, copying,
> distribution
> or use of the contents is prohibited and may be unlawful. If
> you have received this email in error,
> including without appropriate
> authorisation, then please reply to the sender about the error
> and delete
> this email and any attachments.
>
>

Re: Duplicate docs with same unique id on update

Posted by Eduardo Gomez <eg...@mintel.com.INVALID>.
> At first it wasn't clear to me what the problem you're having actually
> is.  Then I glanced back at the message subject ... it is the only place
> you mention it.

Sorry Shawn, you are right, I didn't explain very clearly. So basically, in
Solr 8.11.1,  I can see that updating an existing document, e.g. {"id":
"22468d41-3b...", "title": "Old title"}:

curl -X POST -H 'Content-type:application/json' '
http://localhost:8983/solr/clients_main/update?commit=true' --data "{'add':
{'doc':{'id': '22468d41-3b...', 'title': 'New title'}}}"

I get two docs with the same id and different titles in the index. That is
different to the behaviour I see using Solr 7.5, which is a single document
with the updated title.To get that with the same schema in Solr 8.11.1, I
have to add this to the schema:

<field name="_root_" type="string" indexed="true" stored="false">

So without the _root_ definition, the behaviour is as expected in Solr 7.5
but produces duplicate documents in Solr 8.11. I haven't noticed Solr
complainig if the _root_ field is not defined.

So my question was if that is expected, as that field seems to be related
to parent-child documents, which I don't use at all.

The definition for the id field in my schema.xml is similar to the one you
posted:

<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true"
docValues="false"/>
<uniqueKey>id</uniqueKey>

Eduardo






On Thu, Dec 8, 2022 at 1:11 PM Mikhail Khludnev <mk...@apache.org> wrote:

> Right, Shawn. That's how it works
>
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
> And it's really fast in query time.
>
> On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 12/8/22 05:58, Shawn Heisey wrote:
> > > So you can't just update a child document, you have to update all the
> > > children and all the parents at the same time, so the new documents
> > > are all in the same segment.
> >
> > That's a little unclear and sounds like a draconian requirement. :)  I
> > meant that all children must be in the same segment as their parent.  I
> > think Solr might support the idea of multiple nesting levels ... if so,
> > then the ultimate parent document and all its descendants need to be in
> > the same segment.
> >
> > Thanks,
> > Shawn
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

-- 

Mintel Group Ltd | Mintel House, 4 Playhouse Yard | London | EC4V 5EX
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations 
<http://www.mintel.com/office-locations>.

This email and any attachments 
may include content that is confidential, privileged 
or otherwise 
protected under applicable law. Unauthorised disclosure, copying, 
distribution 
or use of the contents is prohibited and may be unlawful. If 
you have received this email in error,
including without appropriate 
authorisation, then please reply to the sender about the error 
and delete 
this email and any attachments.


Re: Duplicate docs with same unique id on update

Posted by Mikhail Khludnev <mk...@apache.org>.
Right, Shawn. That's how it works
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#updateDocuments-org.apache.lucene.index.Term-java.lang.Iterable-
And it's really fast in query time.

On Thu, Dec 8, 2022 at 4:06 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 12/8/22 05:58, Shawn Heisey wrote:
> > So you can't just update a child document, you have to update all the
> > children and all the parents at the same time, so the new documents
> > are all in the same segment.
>
> That's a little unclear and sounds like a draconian requirement. :)  I
> meant that all children must be in the same segment as their parent.  I
> think Solr might support the idea of multiple nesting levels ... if so,
> then the ultimate parent document and all its descendants need to be in
> the same segment.
>
> Thanks,
> Shawn
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: Duplicate docs with same unique id on update

Posted by Shawn Heisey <ap...@elyograg.org>.
On 12/8/22 05:58, Shawn Heisey wrote:
> So you can't just update a child document, you have to update all the 
> children and all the parents at the same time, so the new documents 
> are all in the same segment.

That's a little unclear and sounds like a draconian requirement. :)  I 
meant that all children must be in the same segment as their parent.  I 
think Solr might support the idea of multiple nesting levels ... if so, 
then the ultimate parent document and all its descendants need to be in 
the same segment.

Thanks,
Shawn


Re: Duplicate docs with same unique id on update

Posted by Shawn Heisey <ap...@elyograg.org>.
On 12/8/22 02:43, Eduardo Gomez wrote:
> I have seen there have been some changes introduced to how child docs are
> updated (
> https://solr.apache.org/guide/8_0/major-changes-in-solr-8.html#nested-documents).
>  From the docs:
>
> *" ... an attempt to update a child document by providing a new document
> with the same ID would add a new document (which will probably be
> erroneous)"*
>
> I'm not using nested docs, however I'm observing exactly that happening in
> Solr 8.11.1 for all my docs. It seems like the only way of avoiding that is
> adding this to the schema:

At first it wasn't clear to me what the problem you're having actually 
is.  Then I glanced back at the message subject ... it is the only place 
you mention it.

I have never used the parent/child document feature myself.  But from 
things other people have said, the main issue with updating a child 
document is that in order for a parent/child document relationship to 
work, all documents must be in the same Lucene segment.  So you can't 
just update a child document, you have to update all the children and 
all the parents at the same time, so the new documents are all in the 
same segment.  But if you are not using that functionality it's not 
really something you need to worry about.

I do have the _root_ field definition in my schema.  Not because I am 
using nested documents, but because Solr complained that the field was 
missing.  I never put anything in that field.  I just got a look at it, 
and it has indexed, stored, and docValues as false.  Which basically 
means that it's not possible to actually use the field.  That's fine for 
my use case, where there are no nested documents.  It has been a really 
long time since I created this schema ... I was probably trying to 
eliminate the log message without actually making my index larger.

What is the full definition of your uniqueKey field?  Looking for both 
the field definition as well as the referenced fieldType definition.  
There are gotchas if you try to use a TextField type field as a 
uniqueKey field.  You would want to use a StrField or one of the numeric 
types for uniqueKey.

This is my uniqueKey field definition:

<field name="id" type="string" indexed="true" stored="true" 
required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" 
docValues="true" />

Do you have uniqueKey defined?  If you don't, Solr has no way of knowing 
that a new document should replace an existing document. Certain other 
functionalities will not work at all without a uniqueKey.

There is also a setting that basically stops Solr from deleting an 
existing document when you index a new one with the same value in the 
uniqueKey field.  I forget what that is and where you might find it.  
Can you share the full core config and exactly what you are sending to 
Solr for indexing, including any URL parameters that you are using?

Lucene, the technology underlying most of Solr's functionality, does not 
have the concept of a uniqueKey.  That is something solr implemented on 
top of Lucene.

Thanks,
Shawn