You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Etienne Chauchot (JIRA)" <ji...@apache.org> on 2017/11/16 15:55:00 UTC

[jira] [Commented] (BEAM-3201) ElasticsearchIO should deal with documents id

    [ https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255513#comment-16255513 ] 

Etienne Chauchot commented on BEAM-3201:
----------------------------------------

Users have offered to contribute. So, here are my design comments:

1. to keep the API I propose to stick with PCollection<json>

2. We provide a method {{withDocumentIdField(String fieldName)}} in the write transform that allows the user to specify which json field is to be used as document id. The document id field will stay part of the payload to ensure that write then read provides the exact same document. Also keeping the id field part of the payload makes {{read.withDocumentIdField(String fieldName)}}  useless, only {{write.withDocumentIdField(String fieldName)}} is needed.

3.That parameter is optional if it is set then If we insert twice a record with same doc id then the document will be automatically updated by ES (no code needed in beam). If it is not provided then id is auto generated on the ES side (same as now)

I propose that the {{withDocumentIdField(String fieldName)}} is optional for these reasons:
a. backward compatible  
b. users who do not want to update documents and just want to insert should not have to generate a UUID per document
c. ES connector made by ES dev team also chose to make the doc id optional
d. forcing to provide a doc id to work around the lack of exactly-once semantic is not a good choice. Exactly-once semantics is a broader discussion in the project than just dealing with it in the datastores.

> ElasticsearchIO should deal with documents id
> ---------------------------------------------
>
>                 Key: BEAM-3201
>                 URL: https://issues.apache.org/jira/browse/BEAM-3201
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Today the ESIO only inserts the payload of the ES documents. Elasticsearch generates a document id for each record inserted. So each new insertion is considered as a new document. Users want to be able to update documents using the IO. So, for the write part of the IO, users should be able to provide a document id so that they could update already stored documents. Providing an id for the documents could also help the user on indempotency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)