You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2021/03/25 18:30:00 UTC
[jira] [Updated] (SOLR-3473) Distributed deduplication broken when
using non-uniqueKey for signatureField
[ https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-3473:
-------------------------------------
Description:
The current state of things (as of 8.8) is that SignatureUpdateProcessorFactory *CAN* be safely used in in SolrCloud for two possible usecases:
* For de-duplication:
** the signatureField _MUST_ be the uniqueKey field *AND* the processor _MUST_ be configured to run prior to DistributedUpdateProcessor
* Solely for generating signatures, w/o de-duplication
** overwriteDupes _MUST_ be set to false ... any signatureField may be used, and it may run at any point in the processor chain
If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a non-uniqueKey signature field, one of two failure situations are likely to arise:
* in a multi-shard collection, documents with identical signatureField values will not be removed from any shard (leader) other then the one the document is routed to (by it's id)
* even in a single-shard collection, with multiple replicas, documents with identical signatureField values will *only* be deleted on the 'leader' and not on any other replicas, because the leader does not propogate the {{AddUpdateCommand.updateTerm}} computed by the SignatureUpdateProcessorFactory to each of it's shards
{panel:title=original bug report}
Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.
Mark Miller:
{quote}
Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.
I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.
{quote}
Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
{panel}
was:
Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.
Mark Miller:
{quote}
Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.
I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.
{quote}
Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
Summary: Distributed deduplication broken when using non-uniqueKey for signatureField (was: Distributed deduplication broken)
> Distributed deduplication broken when using non-uniqueKey for signatureField
> ----------------------------------------------------------------------------
>
> Key: SOLR-3473
> URL: https://issues.apache.org/jira/browse/SOLR-3473
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud, update
> Affects Versions: 4.0-ALPHA
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 4.9, 6.0
>
> Attachments: SOLR-3473-trunk-2.patch, SOLR-3473.patch, SOLR-3473.patch
>
>
> The current state of things (as of 8.8) is that SignatureUpdateProcessorFactory *CAN* be safely used in in SolrCloud for two possible usecases:
> * For de-duplication:
> ** the signatureField _MUST_ be the uniqueKey field *AND* the processor _MUST_ be configured to run prior to DistributedUpdateProcessor
> * Solely for generating signatures, w/o de-duplication
> ** overwriteDupes _MUST_ be set to false ... any signatureField may be used, and it may run at any point in the processor chain
> If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a non-uniqueKey signature field, one of two failure situations are likely to arise:
> * in a multi-shard collection, documents with identical signatureField values will not be removed from any shard (leader) other then the one the document is routed to (by it's id)
> * even in a single-shard collection, with multiple replicas, documents with identical signatureField values will *only* be deleted on the 'leader' and not on any other replicas, because the leader does not propogate the {{AddUpdateCommand.updateTerm}} computed by the SignatureUpdateProcessorFactory to each of it's shards
> {panel:title=original bug report}
> Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.
> I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.
> {quote}
> Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
> {panel}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org