You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Alexander Hoem Rosbach (JIRA)" <ji...@apache.org> on 2017/10/17 18:29:00 UTC

[jira] [Comment Edited] (BEAM-3039) DatastoreIO.Write fails multiple mutations of same entity

    [ https://issues.apache.org/jira/browse/BEAM-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208098#comment-16208098 ] 

Alexander Hoem Rosbach edited comment on BEAM-3039 at 10/17/17 6:28 PM:
------------------------------------------------------------------------

Would it offer any advantage to use GroupByKey instead of Distinct?

What do you think about including features in the DatastoreIO to avoid the issue? It could be optional parameters passed to the write-function if you don't agree that it is a bug. In my opinion it is a bug since I would assume that it is a common use case for dataflow implementations to stream data from pubsub into datastore.

For instance:
{code}
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()).removeDuplicatesWithinCommits());
{code}


was (Author: routsi):
Would it offer any advantage to use GroupByKey instead of Distinct?

What do you think about including features in the DatastoreIO to avoid the issue? It could be optional parameters passed to the write-function if you don't agree that it is a bug. In my opinion it is a bug that what I would assume is a common use case for dataflow implementations, streaming data from pubsub into datastore.

For instance:
{code}
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()).removeDuplicatesWithinCommits());
{code}

> DatastoreIO.Write fails multiple mutations of same entity
> ---------------------------------------------------------
>
>                 Key: BEAM-3039
>                 URL: https://issues.apache.org/jira/browse/BEAM-3039
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>    Affects Versions: 2.1.0
>            Reporter: Alexander Hoem Rosbach
>            Assignee: Chamikara Jayalath
>            Priority: Minor
>
> When streaming messages from a source that doesn't guarantee once-only-delivery, but has at-least-once-delivery, then the DatastoreIO.Write will throw an exception which leads to Dataflow retrying the same commit multiple times before giving up. This leads to a significant bottleneck in the pipeline, with the end-result that the data is dropped. This should be handled better.
> There are a number of ways to fix this. One of them could be to drop any duplicate mutations within one batch. Non-duplicates should also be handled in some way. Perhaps a use NON-TRANSACTIONAL commit, or make sure the mutations are commited in different commits.
> {code}
> com.google.datastore.v1.client.DatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entity., code=INVALID_ARGUMENT
>         com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
>         com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
>         com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
>         com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
>         org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
>         org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:1253) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)