You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Niel Markwick (JIRA)" <ji...@apache.org> on 2019/03/22 14:26:00 UTC

[jira] [Updated] (BEAM-6887) Streaming Spanner Writer transform

     [ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niel Markwick updated BEAM-6887:
--------------------------------
    Description: 
At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an entire bundle of elements, sorts them by table/key, splitting the sorted list into batches (by size and number of cells modified) and then writes each batch to Spanner in a single transaction.

It returns a SpannerWriteResult.java containing :
 # a PCollection<Void> (the main output) - which will have no elements but will be closed to signal when all the input elements have been written (which is never in streaming because input is unbounded)
 # a PCollection<MutationGroup> of elements that failed to write.

This transform is useful as a bulk sink for data because it efficiently writes large amounts of data. 

It is not at all useful as an intermediate step in a streaming pipeline - because it has no useful output in streaming mode. 

I propose that we have a separate Spanner Write transform which simply writes each input Mutation to the database, and then pushes successful Mutations onto its output. 

This would allow use in the middle of a streaming pipeline, where the flow would be
 * Some data streamed in
 * Converted to Spanner Mutations
 * Written to Spanner Database
 * Further processing where the values written to the Spanner Database are used.

  was:
At present, the[ SpannerIO.Write(Grouped|http://go/gh/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L892]) transform works by collecting an entire bundle of elements, sorts them by table/key, splitting the sorted list into batches (by size and number of cells modified) and then writes each batch to Spanner in a single transaction. 

It returns an[ object|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteResult.java] containing :
 # a PCollection<Void> (the main output) - which will have no elements but will be closed to signal when all the input elements have been written (which is never in streaming because input is unbounded)
 # a PCollection<MutationGroup> of elements that failed to write.
 

This transform is useful as a bulk sink for data because it efficiently writes large amounts of data. 


It is not at all useful as an intermediate step in a streaming pipeline - because it has no useful output in streaming mode. 


I propose that we have a separate Spanner Write transform which simply writes each input Mutation to the database, and then pushes successful Mutations onto its output. 

This would allow use in the middle of a streaming pipeline, where the flow would be

 * Some data streamed in
 * Converted to Spanner Mutations
 * Written to Spanner Database
 * Further processing where the values written to the Spanner Database are used.


> Streaming Spanner Writer transform
> ----------------------------------
>
>                 Key: BEAM-6887
>                 URL: https://issues.apache.org/jira/browse/BEAM-6887
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-gcp
>            Reporter: Niel Markwick
>            Assignee: Niel Markwick
>            Priority: Minor
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an entire bundle of elements, sorts them by table/key, splitting the sorted list into batches (by size and number of cells modified) and then writes each batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection<Void> (the main output) - which will have no elements but will be closed to signal when all the input elements have been written (which is never in streaming because input is unbounded)
>  # a PCollection<MutationGroup> of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes each input Mutation to the database, and then pushes successful Mutations onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)