You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Luke Cwik <lc...@google.com> on 2020/10/13 16:24:19 UTC

Re: Adding transactional writer to SpannerIO

+user <us...@beam.apache.org> for feedback from users.

As long as users know that they must structure their transactions to be
repeatable and/or are ok with a transaction occurring multiple times then
that should be fine.

Has most of the focus been around a serializable function from customers or
would something using Spanner DML make more sense?

On Tue, Oct 13, 2020 at 2:37 AM Niel Markwick <ni...@google.com> wrote:

> Hey Beam-dev...
>
> I recently had an interaction with a customer that wanted to run a
> read-update-write transform on a Cloud Spanner DB inside a streaming Beam
> pipeline. I suggested writing their own DoFn, and pointed them at some of
> the various pitfalls they need to avoid - (those at least that have been
> found and fixed in the Beam SpannerIO.Write transform!)
>
> This is not the first time I have had this request, and I was thinking
> about how to introduce a generic transactional RW Spanner writer: The user
> would supply a serializable function that takes the input element and
> performs the read-update-write, while the transform wraps this function in
> the code required to handle the Spanner connection and transform,
> potentially adding batching -- running multiple transactions at once.
>
> Would this be something that the community could find useful? Should I
> productionize the PoC I have and submit a PR?
>
> In one sense it is against the 'repeatable
> <https://beam.apache.org/documentation/programming-guide/#user-code-idempotence>'
> recommendation of a DoFn (for example, a transaction that increments a DB
> counter would not be idempotent), but in another sense, it makes certain
> actions more reliable (eg processing bank account transfers).
>
> All opinions welcome.
>
> --
> <https://cloud.google.com>
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect <https://cloud.google.com/docs/tutorials>
> * •  *Google Belgium
> * •  *nielm@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>
> If you have received this communication by mistake, please don't forward
> it to anyone else (it may contain confidential or privileged information),
> please erase all copies of it, including all attachments, and please let
> the sender know it went to the wrong person. Thanks
>

Re: Adding transactional writer to SpannerIO

Posted by Reuven Lax <re...@google.com>.

To add to what Luke said, customers need to be warned because often these
read-modify-write transaction are not repeatable. e.g. imagine a
transaction whos purpose is to atomically increment a counter - read value,
increment value, write value back. If that transaction gets executed twice,
the counter will be incremented twice - unlikely that's what the user
wanted.

It is possible to design a sink that protects against this. You could do
this by adding a new version field to every Spanner row, and incrementing
the version inside the transaction (for every row involved in the
transaction). You would then store the new updated version in local state
using the stateful DoFn API. This allows the DoFn to detect replays: if
when reading the version from the Spanner row you find that the version is
greater than the version stored in the DoFn state, then you know this is a
repeat and you can simply skip the transaction.

If you do opt to go the above route, note that it will not work out of the
box since DoFn inputs are not deterministically ordered. However there is a
new ordered-list API in the process of being introduced that will make it
possible to implement this.

Reuven

On Tue, Oct 13, 2020 at 10:25 AM Niel Markwick <ni...@google.com> wrote:

> In most cases the customers want to execute some conditional update based
> on values existing in the database, which means a
> read-calcuate-write inside a RW transaction, which requires executing some
> provided logic. This will also allow customers to execute some DML inside
> the transaction.
>
> I think I would need to reimplement some of the functionality of DoFn in
> the provided function (such as @Startup @Teardown, reading side inputs,
> writing to multiple outputs etc)
>
> --
> <https://cloud.google.com>
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect <https://cloud.google.com/docs/tutorials>
> * •  *Google Belgium
> * •  *nielm@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>
> If you have received this communication by mistake, please don't forward
> it to anyone else (it may contain confidential or privileged information),
> please erase all copies of it, including all attachments, and please let
> the sender know it went to the wrong person. Thanks
>
>
> On Tue, 13 Oct 2020 at 18:26, Luke Cwik <lc...@google.com> wrote:
>
>> +user <us...@beam.apache.org> for feedback from users.
>>
>> As long as users know that they must structure their transactions to be
>> repeatable and/or are ok with a transaction occurring multiple times then
>> that should be fine.
>>
>> Has most of the focus been around a serializable function from customers
>> or would something using Spanner DML make more sense?
>>
>> On Tue, Oct 13, 2020 at 2:37 AM Niel Markwick <ni...@google.com> wrote:
>>
>>> Hey Beam-dev...
>>>
>>> I recently had an interaction with a customer that wanted to run a
>>> read-update-write transform on a Cloud Spanner DB inside a streaming Beam
>>> pipeline. I suggested writing their own DoFn, and pointed them at some of
>>> the various pitfalls they need to avoid - (those at least that have been
>>> found and fixed in the Beam SpannerIO.Write transform!)
>>>
>>> This is not the first time I have had this request, and I was thinking
>>> about how to introduce a generic transactional RW Spanner writer: The user
>>> would supply a serializable function that takes the input element and
>>> performs the read-update-write, while the transform wraps this function in
>>> the code required to handle the Spanner connection and transform,
>>> potentially adding batching -- running multiple transactions at once.
>>>
>>> Would this be something that the community could find useful? Should I
>>> productionize the PoC I have and submit a PR?
>>>
>>> In one sense it is against the 'repeatable
>>> <https://beam.apache.org/documentation/programming-guide/#user-code-idempotence>'
>>> recommendation of a DoFn (for example, a transaction that increments a DB
>>> counter would not be idempotent), but in another sense, it makes certain
>>> actions more reliable (eg processing bank account transfers).
>>>
>>> All opinions welcome.
>>>
>>> --
>>> <https://cloud.google.com>
>>> * •  **Niel Markwick*
>>> * •  *Cloud Solutions Architect
>>> <https://cloud.google.com/docs/tutorials>
>>> * •  *Google Belgium
>>> * •  *nielm@google.com
>>> * •  *+32 2 894 6771
>>>
>>>
>>> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>>>
>>> If you have received this communication by mistake, please don't forward
>>> it to anyone else (it may contain confidential or privileged information),
>>> please erase all copies of it, including all attachments, and please let
>>> the sender know it went to the wrong person. Thanks
>>>
>>

Re: Adding transactional writer to SpannerIO

Posted by Reuven Lax <re...@google.com>.

To add to what Luke said, customers need to be warned because often these
read-modify-write transaction are not repeatable. e.g. imagine a
transaction whos purpose is to atomically increment a counter - read value,
increment value, write value back. If that transaction gets executed twice,
the counter will be incremented twice - unlikely that's what the user
wanted.

It is possible to design a sink that protects against this. You could do
this by adding a new version field to every Spanner row, and incrementing
the version inside the transaction (for every row involved in the
transaction). You would then store the new updated version in local state
using the stateful DoFn API. This allows the DoFn to detect replays: if
when reading the version from the Spanner row you find that the version is
greater than the version stored in the DoFn state, then you know this is a
repeat and you can simply skip the transaction.

If you do opt to go the above route, note that it will not work out of the
box since DoFn inputs are not deterministically ordered. However there is a
new ordered-list API in the process of being introduced that will make it
possible to implement this.

Reuven

On Tue, Oct 13, 2020 at 10:25 AM Niel Markwick <ni...@google.com> wrote:

> In most cases the customers want to execute some conditional update based
> on values existing in the database, which means a
> read-calcuate-write inside a RW transaction, which requires executing some
> provided logic. This will also allow customers to execute some DML inside
> the transaction.
>
> I think I would need to reimplement some of the functionality of DoFn in
> the provided function (such as @Startup @Teardown, reading side inputs,
> writing to multiple outputs etc)
>
> --
> <https://cloud.google.com>
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect <https://cloud.google.com/docs/tutorials>
> * •  *Google Belgium
> * •  *nielm@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>
> If you have received this communication by mistake, please don't forward
> it to anyone else (it may contain confidential or privileged information),
> please erase all copies of it, including all attachments, and please let
> the sender know it went to the wrong person. Thanks
>
>
> On Tue, 13 Oct 2020 at 18:26, Luke Cwik <lc...@google.com> wrote:
>
>> +user <us...@beam.apache.org> for feedback from users.
>>
>> As long as users know that they must structure their transactions to be
>> repeatable and/or are ok with a transaction occurring multiple times then
>> that should be fine.
>>
>> Has most of the focus been around a serializable function from customers
>> or would something using Spanner DML make more sense?
>>
>> On Tue, Oct 13, 2020 at 2:37 AM Niel Markwick <ni...@google.com> wrote:
>>
>>> Hey Beam-dev...
>>>
>>> I recently had an interaction with a customer that wanted to run a
>>> read-update-write transform on a Cloud Spanner DB inside a streaming Beam
>>> pipeline. I suggested writing their own DoFn, and pointed them at some of
>>> the various pitfalls they need to avoid - (those at least that have been
>>> found and fixed in the Beam SpannerIO.Write transform!)
>>>
>>> This is not the first time I have had this request, and I was thinking
>>> about how to introduce a generic transactional RW Spanner writer: The user
>>> would supply a serializable function that takes the input element and
>>> performs the read-update-write, while the transform wraps this function in
>>> the code required to handle the Spanner connection and transform,
>>> potentially adding batching -- running multiple transactions at once.
>>>
>>> Would this be something that the community could find useful? Should I
>>> productionize the PoC I have and submit a PR?
>>>
>>> In one sense it is against the 'repeatable
>>> <https://beam.apache.org/documentation/programming-guide/#user-code-idempotence>'
>>> recommendation of a DoFn (for example, a transaction that increments a DB
>>> counter would not be idempotent), but in another sense, it makes certain
>>> actions more reliable (eg processing bank account transfers).
>>>
>>> All opinions welcome.
>>>
>>> --
>>> <https://cloud.google.com>
>>> * •  **Niel Markwick*
>>> * •  *Cloud Solutions Architect
>>> <https://cloud.google.com/docs/tutorials>
>>> * •  *Google Belgium
>>> * •  *nielm@google.com
>>> * •  *+32 2 894 6771
>>>
>>>
>>> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>>>
>>> If you have received this communication by mistake, please don't forward
>>> it to anyone else (it may contain confidential or privileged information),
>>> please erase all copies of it, including all attachments, and please let
>>> the sender know it went to the wrong person. Thanks
>>>
>>

Re: Adding transactional writer to SpannerIO

Posted by Niel Markwick <ni...@google.com>.

In most cases the customers want to execute some conditional update based
on values existing in the database, which means a
read-calcuate-write inside a RW transaction, which requires executing some
provided logic. This will also allow customers to execute some DML inside
the transaction.

I think I would need to reimplement some of the functionality of DoFn in
the provided function (such as @Startup @Teardown, reading side inputs,
writing to multiple outputs etc)

-- 
<https://cloud.google.com>
* •  **Niel Markwick*
* •  *Cloud Solutions Architect <https://cloud.google.com/docs/tutorials>
* •  *Google Belgium
* •  *nielm@google.com
* •  *+32 2 894 6771

Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie.
RPR: 0878.065.378

If you have received this communication by mistake, please don't forward it
to anyone else (it may contain confidential or privileged information),
please erase all copies of it, including all attachments, and please let
the sender know it went to the wrong person. Thanks


On Tue, 13 Oct 2020 at 18:26, Luke Cwik <lc...@google.com> wrote:

> +user <us...@beam.apache.org> for feedback from users.
>
> As long as users know that they must structure their transactions to be
> repeatable and/or are ok with a transaction occurring multiple times then
> that should be fine.
>
> Has most of the focus been around a serializable function from customers
> or would something using Spanner DML make more sense?
>
> On Tue, Oct 13, 2020 at 2:37 AM Niel Markwick <ni...@google.com> wrote:
>
>> Hey Beam-dev...
>>
>> I recently had an interaction with a customer that wanted to run a
>> read-update-write transform on a Cloud Spanner DB inside a streaming Beam
>> pipeline. I suggested writing their own DoFn, and pointed them at some of
>> the various pitfalls they need to avoid - (those at least that have been
>> found and fixed in the Beam SpannerIO.Write transform!)
>>
>> This is not the first time I have had this request, and I was thinking
>> about how to introduce a generic transactional RW Spanner writer: The user
>> would supply a serializable function that takes the input element and
>> performs the read-update-write, while the transform wraps this function in
>> the code required to handle the Spanner connection and transform,
>> potentially adding batching -- running multiple transactions at once.
>>
>> Would this be something that the community could find useful? Should I
>> productionize the PoC I have and submit a PR?
>>
>> In one sense it is against the 'repeatable
>> <https://beam.apache.org/documentation/programming-guide/#user-code-idempotence>'
>> recommendation of a DoFn (for example, a transaction that increments a DB
>> counter would not be idempotent), but in another sense, it makes certain
>> actions more reliable (eg processing bank account transfers).
>>
>> All opinions welcome.
>>
>> --
>> <https://cloud.google.com>
>> * •  **Niel Markwick*
>> * •  *Cloud Solutions Architect <https://cloud.google.com/docs/tutorials>
>> * •  *Google Belgium
>> * •  *nielm@google.com
>> * •  *+32 2 894 6771
>>
>>
>> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 0878.065.378
>>
>> If you have received this communication by mistake, please don't forward
>> it to anyone else (it may contain confidential or privileged information),
>> please erase all copies of it, including all attachments, and please let
>> the sender know it went to the wrong person. Thanks
>>
>