You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2021/03/25 18:27:00 UTC

[jira] [Commented] (SOLR-15294) Support "post-indexing" cleanup of documents with duplicate signatures

    [ https://issues.apache.org/jira/browse/SOLR-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308889#comment-17308889 ] 

Chris M. Hostetter commented on SOLR-15294:
-------------------------------------------

One idea for an implemetation of this would be a new "Stream Decorator" that could consume a (sorted) stream of all documents in the collection, and would emit only the documents that had the same value in a configured (signature) field as the document that preceeded them – essentially the inverse of how the {{unique()}} stream decorator works – so that it could the resulting stream could be fed into the existing {{delete()}} decorator.

So given a collection of documents that might look like...
{noformat}
id,signature,importance
1, X,        100
2, Y,        5
3, Y,        50
4, X,        13
5, Z,        4
6, X,        50
{noformat}
You could use something like...
{code:java}
 delete(collection1
        batchSize=500,
        not_unique(
          over="signature",
          search(collection1,
                 q="*:*"
                 qt="/export",
                 fl="id,signature,importance",
                 sort="signature asc, importance desc, id asc")))
{code}
...to delete documents 6,4,2, because those are the documents that would be emitted by the hypothetical {{not_unique}} decorator based on the (sorted) output of the search...
{noformat}
1, X,        100
6, X,        50
4, X,        13
3, Y,        50
2, Y,        5
5, Z,        4
{noformat}

> Support "post-indexing" cleanup of documents with duplicate signatures
> ----------------------------------------------------------------------
>
>                 Key: SOLR-15294
>                 URL: https://issues.apache.org/jira/browse/SOLR-15294
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> Since there is no way to (efficiently) have a document "overwrite" some existing document with a different {{'id'}} but the same value in a {{'signature'}} field, We should see if we can implement a solution to "cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.
> In the trivial case of adding one document, a Delete-By-Query for {{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding {{currentDoc}}) ... but this doesn't scale well when adding many many docs and broadcasting these DBQs across many shards (an operation which requires a distributed collection wide lock to ensure atomicity)
> It would be nice if Solr offered some kind of _efficient_ functionality for accomplishing the same eventual goal, in a way that could be run after a bulk indexing job, or periodically under continuous indexing, such that "duplicate" documents would _eventually_ be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org