You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "mosh (JIRA)" <ji...@apache.org> on 2018/10/21 06:57:00 UTC

[jira] [Comment Edited] (SOLR-12638) Support atomic updates of nested/child documents for nested-enabled schema

    [ https://issues.apache.org/jira/browse/SOLR-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658100#comment-16658100 ] 

mosh edited comment on SOLR-12638 at 10/21/18 6:56 AM:
-------------------------------------------------------

We have been testing this feature in-house, and have come across a problem regarding sharding when a document that is being updated is indexed inside a block,
 and the collection being used has more than a single shard.
 Right now when updating a document, an Id for the document has to be provided, in addition to the field which is being updated.
 When the document that is being updated is inside a block, the update can be routed to the wrong shard, since the shard in which it is indexed was calculated according to the root document's Id. ex.
 When this document:
{code:javascript}
 {"id": "1", "children": [{"id": "20", {"string_s": "ex"}]} {code}
Is being updated:
{code:javascript}
{"id": "20", "grand_children": {"add": [{"id": "21", "string_s": "ex"}]}}{code}
The update can be routed to another shard, where the block does not exist, causing the update to be indexed to a different shard,
 splitting our block in two pieces, existing in two separate shards.

Skimming through DistributedUpdateProcessor, I have suggestions for three different solutions.
 # If the schema is nested, the the routing method(in DistributedUpdateProcessor) can check if the document exists in any shards(lookup by id),
 find out whether it is inside a block(_root_) and route the update using the hash of _root_
 # Very similar to the previous method, only the _root_ lookup is done if the document which is being updated is not found in the shard it was routed to, asking other shards if the document exists inside a block, re-routing the update command.
 # The user provides the _root_, which is not the ideal case when it comes to user friendliness.

IMO the third option should be the last result, since it is the least user friendly out of the three options.
 My only concern regarding the first two options are the performance hit it might cause.

Another concern which David has discussed is the implications on the update log.
Would ensuring DistributedUpdateProcessor is run before RunUpdateProcessor be of any help?
I must admit I am not very familiar with these features of Solr.

WDYT [~dsmiley], [~caomanhdat]?


was (Author: moshebla):
We have been testing this feature in-house, and have come across a problem regarding sharding when a document that is being updated is indexed inside a block,
and the collection being used has more than a single shard.
Right now when updating a document, an Id for the document has to be provided, in addition to the field which is being updated.
When the document that is being updated is inside a block, the update can be routed to the wrong shard, since the shard in which it is indexed was calculated according to the root document's Id. ex.
When this document:
{code:javascript} {"id": "1", "children": [{"id": "20", {"string_s": "ex"}]} {code}
Is being updated:
{code:javascript}{"id": "20", "grand_children": {"add": [{"id": "21", "string_s": "ex"}]}}{code}
The update can be routed to another shard, where the block does not exist, causing the update to be indexed to a different shard,
splitting our block in two pieces, existing in two separate shards.

Skimming through DistributedUpdateProcessor, I have suggestions for three different solutions.

# If the schema is nested, the the routing method(in DistributedUpdateProcessor) can check if the document exists in any shards(lookup by id),
find out whether it is inside a block(_root_) and route the update using the hash of _root_
# Very similar to the previous method, only the _root_ lookup is done if the document which is being updated is not found in the shard it was routed to, asking other shards if the document exists inside a block, re-routing the update command.
# The user provides the _root_, which is not the ideal case when it comes to user friendliness.

IMO the third option should be the last result, since it is the least user friendly out of the three options.
My only concern regarding the first two options are the performance hit it might cause.

WDYT [~dsmiley], [~caomanhdat]?

> Support atomic updates of nested/child documents for nested-enabled schema
> --------------------------------------------------------------------------
>
>                 Key: SOLR-12638
>                 URL: https://issues.apache.org/jira/browse/SOLR-12638
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: mosh
>            Priority: Major
>         Attachments: SOLR-12638-delete-old-block-no-commit.patch, SOLR-12638-nocommit.patch
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> I have been toying with the thought of using this transformer in conjunction with NestedUpdateProcessor and AtomicUpdate to allow SOLR to completely re-index the entire nested structure. This is just a thought, I am still thinking about implementation details. Hopefully I will be able to post a more concrete proposal soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org