You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2017/01/04 02:26:58 UTC

[jira] [Commented] (SOLR-9918) An UpdateRequestProcessor to skip duplicate inserts and ignore updates to missing docs

    [ https://issues.apache.org/jira/browse/SOLR-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796928#comment-15796928 ] 

Koji Sekiguchi commented on SOLR-9918:
--------------------------------------

I believe the proposal is very useful for users who need this function, but it is better for users if there is an additional explanation of the difference from the existing one that gives similar function.

How do users decide which UpdateRequestProcessor to use for their use cases as compared to SignatureUpdateProcessor?

> An UpdateRequestProcessor to skip duplicate inserts and ignore updates to missing docs
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-9918
>                 URL: https://issues.apache.org/jira/browse/SOLR-9918
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update
>            Reporter: Tim Owen
>         Attachments: SOLR-9918.patch
>
>
> This is an UpdateRequestProcessor and Factory that we have been using in production, to handle 2 common cases that were awkward to achieve using the existing update pipeline and current processor classes:
> * When inserting document(s), if some already exist then quietly skip the new document inserts - do not churn the index by replacing the existing documents and do not throw a noisy exception that breaks the batch of inserts. By analogy with SQL, {{insert if not exists}}. In our use-case, multiple application instances can (rarely) process the same input so it's easier for us to de-dupe these at Solr insert time than to funnel them into a global ordered queue first.
> * When applying AtomicUpdate documents, if a document being updated does not exist, quietly do nothing - do not create a new partially-populated document and do not throw a noisy exception about missing required fields. By analogy with SQL, {{update where id = ..}}. Our use-case relies on this because we apply updates optimistically and have best-effort knowledge about what documents will exist, so it's easiest to skip the updates (in the same way a Database would).
> I would have kept this in our own package hierarchy but it relies on some package-scoped methods, and seems like it could be useful to others if they choose to configure it. Some bits of the code were borrowed from {{DocBasedVersionConstraintsProcessorFactory}}.
> Attached patch has unit tests to confirm the behaviour.
> This class can be used by configuring solrconfig.xml like so..
> {noformat}
>   <updateRequestProcessorChain name="skipexisting">
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
>       <bool name="skipInsertIfExists">true</bool>
>       <bool name="skipUpdateIfMissing">false</bool> <!-- We will override this per-request -->
>     </processor>
>     <processor class="solr.DistributedUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> {noformat}
> and initParams defaults of
> {noformat}
>       <str name="update.chain">skipexisting</str>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org