You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "tonytanger (via GitHub)" <gi...@apache.org> on 2023/01/24 23:50:55 UTC

[GitHub] [beam] tonytanger opened a new pull request, #25153: Initial commit of boilerplate of change stream pipeline for bigtable

tonytanger opened a new pull request, #25153:
URL: https://github.com/apache/beam/pull/25153

   Adding the skeleton of the change stream pipeline of Google Cloud Bigtable. The logic is similar to that of Cloud Spanner's change stream connector.
   
   Pipeline contains 3 steps
   * InitializeDoFn - Responsible to validating preconditions such as creating metadata table, ensuring options are valid and correct, and determine a valid start time for the pipeline.
   * DetectNewPartitionsDoFn - Manages the lifecycles of partitions being streamed. This includes initially outputting the initial list of partitions to be streamed, and as partitions split and merge, create new partitions to be streamed. DetectNewPartitions is also responsible to ensure all partitions are streamed and non are missing and keep track of the watermark of the entire pipeline.
   * ReadChangeStreamPartitionDoFn - Streams individual partitions and output data changes. It also handles termination, whether that's the end of the pipeline termination, or the partition has split or merged so it no longer exists.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] pabloem commented on a diff in pull request #25153: Initial commit of boilerplate of change stream pipeline for bigtable

Posted by "pabloem (via GitHub)" <gi...@apache.org>.
pabloem commented on code in PR #25153:
URL: https://github.com/apache/beam/pull/25153#discussion_r1088094820


##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java:
##########
@@ -208,6 +222,40 @@ public static Write write() {
     return Write.create();
   }
 
+  /**
+   * Creates an uninitialized {@link BigtableIO.ReadChangeStream}. Before use, the {@code
+   * ReadChangeStream} must be initialized with
+   *
+   * <ul>
+   *   <li>{@link BigtableIO.ReadChangeStream#withProjectId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withInstanceId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withTableId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withAppProfileId}
+   * </ul>
+   *
+   * <p>And optionally with
+   *
+   * <ul>
+   *   <li>{@link BigtableIO.ReadChangeStream#withStartTime} which defaults to now.
+   *   <li>{@link BigtableIO.ReadChangeStream#withEndTime} which defaults to empty.
+   *   <li>{@link BigtableIO.ReadChangeStream#withHeartbeatDuration} with defaults to 1 seconds.
+   *   <li>{@link BigtableIO.ReadChangeStream#withMetadataTableProjectId} which defaults to value
+   *       from {@link BigtableIO.ReadChangeStream#withProjectId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withMetadataTableInstanceId} which defaults to value
+   *       from {@link BigtableIO.ReadChangeStream#withInstanceId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withMetadataTableTableId} which defaults to {@link
+   *       MetadataTableAdminDao#DEFAULT_METADATA_TABLE_NAME}
+   *   <li>{@link BigtableIO.ReadChangeStream#withMetadataTableAppProfileId} which defaults to value
+   *       from {@link BigtableIO.ReadChangeStream#withAppProfileId}
+   *   <li>{@link BigtableIO.ReadChangeStream#withChangeStreamName} which defaults to randomly
+   *       generated string.
+   * </ul>
+   */
+  @Experimental

Review Comment:
   thanks for the good docs here. Do you think it makes sense to add some more documentation strings to the header of this file? e.g. links to how to start a changestream, the sort of data the transform outputs, the expected throughput, how to use with Beam schema (if supported), etc?
   
   happy to leave for next PR, but I think it will be important.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] pabloem merged pull request #25153: Initial commit of boilerplate of change stream pipeline for bigtable

Posted by "pabloem (via GitHub)" <gi...@apache.org>.
pabloem merged PR #25153:
URL: https://github.com/apache/beam/pull/25153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org