You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/10/29 23:12:05 UTC

[GitHub] [beam] boyuanzz opened a new pull request #13227: [WIP] Add splittable dofn as the recommended way of building connectors.

boyuanzz opened a new pull request #13227:
URL: https://github.com/apache/beam/pull/13227


   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/i
 con)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](htt
 ps://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam
 .apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website | Whitespace | Typescript
   --- | --- | --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) <br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/be
 am_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | --- | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] boyuanzz commented on a change in pull request #13227: [BEAM-10480] Add splittable dofn as the recommended way of building connectors.

Posted by GitBox <gi...@apache.org>.
boyuanzz commented on a change in pull request #13227:
URL: https://github.com/apache/beam/pull/13227#discussion_r528964281



##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the new source framework for both bounded and

Review comment:
       `Most recent` sounds good. I'm curious why `new` is not perferred?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rosetn commented on a change in pull request #13227: [BEAM-10480] Add splittable dofn as the recommended way of building connectors.

Posted by GitBox <gi...@apache.org>.
rosetn commented on a change in pull request #13227:
URL: https://github.com/apache/beam/pull/13227#discussion_r530678693



##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the most recent source framework for both
+bounded and unbounded sources. This is meant to replace the `Source` APIs(
+[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html) and
+[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html))
+in the new system. Please read
+[Splittable DoFn Programming Guide](/learn/programming-guide/#splittable-dofns) for how to write one
+Splittable DoFn. For more information, see the
+[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
 
-(Java only) For **unbounded (streaming) sources**, you must use the `Source`
-interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource`
-supports features that are useful for streaming pipelines, such as
-checkpointing.
+For Java and Python **unbounded (streaming) sources**, you must use the `Splittable DoFn`, which
+supports features that are useful for streaming pipelines, including checkpointing, controlling
+watermark, tracking backlog.

Review comment:
       Missing "and"
   
   "watermark, and tracking backlog."

##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the most recent source framework for both
+bounded and unbounded sources. This is meant to replace the `Source` APIs(
+[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html) and
+[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html))
+in the new system. Please read

Review comment:
       Remove instances of "please" on these pages: https://developers.google.com/style/tone#politeness




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rosetn commented on a change in pull request #13227: [BEAM-10480] Add splittable dofn as the recommended way of building connectors.

Posted by GitBox <gi...@apache.org>.
rosetn commented on a change in pull request #13227:
URL: https://github.com/apache/beam/pull/13227#discussion_r527219546



##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the new source framework for both bounded and

Review comment:
       Is there a way to avoid the word "new" that works? 
   
   "Most recent"? "Provides the most support"?

##########
File path: website/www/site/content/en/documentation/io/developing-io-java.md
##########
@@ -17,6 +17,9 @@ limitations under the License.
 -->
 # Developing I/O connectors for Java
 
+**IMPORTANT:** Please use ``Splittable DoFn`` to develop your new I/O. For more details, please read

Review comment:
       Remove instances of "please" on this page
   
   Reference: https://developers.google.com/style/tone#politeness

##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -90,22 +89,40 @@ performance:
   jobs. Depending on your data source, dynamic work rebalancing might not be
   possible.
 
-* **Splitting into parts of particular size recommended by the runner:** `ParDo`
-  does not receive `desired_bundle_size` as a hint from runners when performing
-  initial splitting.
+* **Splitting initially to increase parallelism:** `ParDo`
+  does not have the ability to perform initial splitting.
 
 For example, if you'd like to read from a new file format that contains many
 records per file, or if you'd like to read from a key-value store that supports
 read operations in sorted key order.
 
-### Source lifecycle {#source}
-Here is a sequence diagram that shows the lifecycle of the Source during
- the execution of the Read transform of an IO. The comments give useful
- information to IO developers such as the constraints that
- apply to the objects or particular cases such as streaming mode.
-
- <!-- The source for the sequence diagram can be found in the the SVG resource. -->
-![This is a sequence diagram that shows the lifecycle of the Source](/images/source-sequence-diagram.svg)
+### Real World IO Examples Using Splittable DoFn

Review comment:
       I'd replace this with: 
   
   I/O examples using SDFs

##########
File path: website/www/site/content/en/documentation/io/developing-io-overview.md
##########
@@ -46,33 +46,32 @@ are the recommended steps to get started:
 For **bounded (batch) sources**, there are currently two options for creating a
 Beam source:
 
+1. Use `Splittable DoFn`.
+
 1. Use `ParDo` and `GroupByKey`.
 
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
 
-`ParDo` is the recommended option, as implementing a `Source` can be tricky. See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the new source framework for both bounded and
+unbounded sources. This is meant to replace the `Source` APIs(
+[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html) and
+[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html))
+in the new system. Please read
+[Splittable DoFn Programming Guide](/learn/programming-guide/#splittable-dofns) for how to write one
+Splittable DoFn. For more information, see the
+[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
 
-(Java only) For **unbounded (streaming) sources**, you must use the `Source`
-interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource`
-supports features that are useful for streaming pipelines, such as
-checkpointing.
+For java and python **unbounded (streaming) sources**, you must use the `Splittable DoFn`, which

Review comment:
       Capitalize Java and Python




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] boyuanzz merged pull request #13227: [BEAM-10480] Add splittable dofn as the recommended way of building connectors.

Posted by GitBox <gi...@apache.org>.
boyuanzz merged pull request #13227:
URL: https://github.com/apache/beam/pull/13227


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] boyuanzz commented on pull request #13227: [BEAM-10480] Add splittable dofn as the recommended way of building connectors.

Posted by GitBox <gi...@apache.org>.
boyuanzz commented on pull request #13227:
URL: https://github.com/apache/beam/pull/13227#issuecomment-732399816


   Updated the PR with suggestions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org