You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2021/03/25 18:27:00 UTC
[jira] [Commented] (SOLR-15294) Support "post-indexing" cleanup of
documents with duplicate signatures
[ https://issues.apache.org/jira/browse/SOLR-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308889#comment-17308889 ]
Chris M. Hostetter commented on SOLR-15294:
-------------------------------------------
One idea for an implemetation of this would be a new "Stream Decorator" that could consume a (sorted) stream of all documents in the collection, and would emit only the documents that had the same value in a configured (signature) field as the document that preceeded them – essentially the inverse of how the {{unique()}} stream decorator works – so that it could the resulting stream could be fed into the existing {{delete()}} decorator.
So given a collection of documents that might look like...
{noformat}
id,signature,importance
1, X, 100
2, Y, 5
3, Y, 50
4, X, 13
5, Z, 4
6, X, 50
{noformat}
You could use something like...
{code:java}
delete(collection1
batchSize=500,
not_unique(
over="signature",
search(collection1,
q="*:*"
qt="/export",
fl="id,signature,importance",
sort="signature asc, importance desc, id asc")))
{code}
...to delete documents 6,4,2, because those are the documents that would be emitted by the hypothetical {{not_unique}} decorator based on the (sorted) output of the search...
{noformat}
1, X, 100
6, X, 50
4, X, 13
3, Y, 50
2, Y, 5
5, Z, 4
{noformat}
> Support "post-indexing" cleanup of documents with duplicate signatures
> ----------------------------------------------------------------------
>
> Key: SOLR-15294
> URL: https://issues.apache.org/jira/browse/SOLR-15294
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Priority: Major
>
> Since there is no way to (efficiently) have a document "overwrite" some existing document with a different {{'id'}} but the same value in a {{'signature'}} field, We should see if we can implement a solution to "cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.
> In the trivial case of adding one document, a Delete-By-Query for {{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding {{currentDoc}}) ... but this doesn't scale well when adding many many docs and broadcasting these DBQs across many shards (an operation which requires a distributed collection wide lock to ensure atomicity)
> It would be nice if Solr offered some kind of _efficient_ functionality for accomplishing the same eventual goal, in a way that could be run after a bulk indexing job, or periodically under continuous indexing, such that "duplicate" documents would _eventually_ be cleaned up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org