You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/09 01:46:41 UTC

[GitHub] [beam] dennisylyung edited a comment on pull request #12583: [BEAM-10706] Fix duplicate key error in DynamoDBIO.Write

dennisylyung edited a comment on pull request #12583:
URL: https://github.com/apache/beam/pull/12583#issuecomment-741419939


   In the current implementation `private List<KV<String, WriteRequest>> batch`, the key is the table name, not the primary-key. 
   
   for example, in a table `user`, the primary key is `id`. An element would look like this: 
   `KV("user", {id=1, name=Chris, age=30})`
   We have no way to know that `id` is the key we need to deduplicate on without users specifying. 
   
   Theoretically, operating with a DynamoDB should not require setting the keys for de-duplication, since repeated write to the same key will just update the value. However, the current implementation of the DynamoDB batch put API requires no duplicate keys within a batch. Hence, users need to explicitly set the overwrite keys. 
   
   You are right that the overwrite keys are necessary to completely avoid `ValidationError`. As long as the sink operate in upsert logic (i.e. the data could contain duplicate keys), there is a risk of the same keys going into a single batch. This is also the problem I face developing pipelines with DynamoDB sinks.
   
   There is one special case though. If the user is very sure that the keys will never have duplicates, such as when their pipelines are logically append-only, they will not encounter `ValidationError`. In which case, requiring them to specify the keys could be unnecessary. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org