You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2021/03/25 18:26:00 UTC

[jira] [Created] (SOLR-15294) Support "post-indexing" cleanup of documents with duplicate signatures

Chris M. Hostetter created SOLR-15294:
-----------------------------------------

             Summary: Support "post-indexing" cleanup of documents with duplicate signatures
                 Key: SOLR-15294
                 URL: https://issues.apache.org/jira/browse/SOLR-15294
             Project: Solr
          Issue Type: Sub-task
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter



Since there is no way to (efficiently) have a document "overwrite" some existing document with a different {{'id'}} but the same value in a {{'signature'}} field, We should see if we can implement a solution to "cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.

In the trivial case of adding one document, a Delete-By-Query for {{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding {{currentDoc}}) ... but this doesn't scale well when adding many many docs and broadcasting these DBQs across many shards (an operation which requires a distributed collection wide lock to ensure atomicity)

It would be nice if Solr offered some kind of _efficient_ functionality for accomplishing the same eventual goal, in a way that could be run after a bulk indexing job, or periodically under continuous indexing, such that "duplicate" documents would _eventually_ be cleaned up.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org