You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2018/05/23 11:10:00 UTC

[jira] [Comment Edited] (BEAM-4389) Enable partial updates for Elasticsearch

    [ https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487085#comment-16487085 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:09 AM:
---------------------------------------------------------------

Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent to ES to have {{update}} instead of {{index}} operations. Server side Elasticsearch treats this as a "get document, apply edits, save document" operation.

In our code I think it would be something as simple as exposing the configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, document));
{code}
 
New fields being introduced and schema compatibility seem no different to the current model (you can push nonsense JSON to a live Elasticsearch using today). Or am I overlooking something please? 


was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]

The {withUsePartialUpdate(true)} would simply change the {bulk} list sent to ES to have {update} instead of {index} operations. Server side Elasticsearch treats this as a "get document, apply edits, save document" operation.

In our code I think it would be something as simple as exposing the configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, document));
{code}
 
New fields being introduced and schema compatibility seem no different to the current model (you can push nonsense JSON to a live Elasticsearch using today). Or am I overlooking something please? 

> Enable partial updates for Elasticsearch
> ----------------------------------------
>
>                 Key: BEAM-4389
>                 URL: https://issues.apache.org/jira/browse/BEAM-4389
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-elasticsearch
>    Affects Versions: 2.4.0
>            Reporter: Tim Robertson
>            Assignee: Tim Robertson
>            Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different categories of information of the target entity (e.g. one for taxonomic processing, another for geospatial processing). A read and merge is not possible inside the batch call, meaning the only way to do it is through a join. The join approach is slow, and also stops the ability to run a single process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
>     .withConnectionConfiguration(connectionConfiguration)
>     .withIdFn(new ExtractValueFn("id"))
>     .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)