You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tim Owen (JIRA)" <ji...@apache.org> on 2017/01/03 17:39:58 UTC
[jira] [Created] (SOLR-9918) An UpdateRequestProcessor to skip
duplicate inserts and ignore updates to missing docs
Tim Owen created SOLR-9918:
------------------------------
Summary: An UpdateRequestProcessor to skip duplicate inserts and ignore updates to missing docs
Key: SOLR-9918
URL: https://issues.apache.org/jira/browse/SOLR-9918
Project: Solr
Issue Type: Improvement
Security Level: Public (Default Security Level. Issues are Public)
Components: update
Reporter: Tim Owen
This is an UpdateRequestProcessor and Factory that we have been using in production, to handle 2 common cases that were awkward to achieve using the existing update pipeline and current processor classes:
* When inserting document(s), if some already exist then quietly skip the new document inserts - do not churn the index by replacing the existing documents and do not throw a noisy exception that breaks the batch of inserts. By analogy with SQL, {{insert if not exists}}. In our use-case, multiple application instances can (rarely) process the same input so it's easier for us to de-dupe these at Solr insert time than to funnel them into a global ordered queue first.
* When applying AtomicUpdate documents, if a document being updated does not exist, quietly do nothing - do not create a new partially-populated document and do not throw a noisy exception about missing required fields. By analogy with SQL, {{update where id = ..}}. Our use-case relies on this because we apply updates optimistically and have best-effort knowledge about what documents will exist, so it's easiest to skip the updates (in the same way a Database would).
I would have kept this in our own package hierarchy but it relies on some package-scoped methods, and seems like it could be useful to others if they choose to configure it. Some bits of the code were borrowed from {{DocBasedVersionConstraintsProcessorFactory}}.
Attached patch has unit tests to confirm the behaviour.
This class can be used by configuring solrconfig.xml like so..
{noformat}
<updateRequestProcessorChain name="skipexisting">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
<bool name="skipInsertIfExists">true</bool>
<bool name="skipUpdateIfMissing">false</bool> <!-- We will override this per-request -->
</processor>
<processor class="solr.DistributedUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
{noformat}
and initParams defaults of
{noformat}
<str name="update.chain">skipexisting</str>
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org