You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/08/21 06:53:32 UTC

[GitHub] [beam] saavannanavati opened a new pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

saavannanavati opened a new pull request #12657:
URL: https://github.com/apache/beam/pull/12657


   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/i
 con)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](htt
 ps://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_
 Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_P
 ostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) <br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/be
 am_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   ![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg)
   ![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] udim commented on a change in pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
udim commented on a change in pull request #12657:
URL: https://github.com/apache/beam/pull/12657#discussion_r475031405



##########
File path: website/www/site/content/en/blog/python-performance-runtime-type-checking.md
##########
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in 
+runtime type checking system for Beam's Python SDK that's optimized for performance 
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking 
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type hint of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned the pipeline type check off using the `no_pipeline_type_check` 
+flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came 
+from a database, preventing inference of the output data type?

Review comment:
       I would write something like:
   "... database, and the output data type is not known before the pipeline starts running?"
   or
   "is only known at runtime"

##########
File path: website/www/site/content/en/blog/python-performance-runtime-type-checking.md
##########
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in 
+runtime type checking system for Beam's Python SDK that's optimized for performance 
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking 
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type hint of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned the pipeline type check off using the `no_pipeline_type_check` 
+flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came 
+from a database, preventing inference of the output data type?
+
+In either case, no error would be thrown during pipeline construction. 
+And even at runtime, this code works. Each string would be multiplied by 2, 
+yielding a result of `['11', '22']`, but that's certainly not the outcome we want.
+
+So how do you debug this breed of "hidden" errors? More broadly speaking, how do you
+debug any error message in Beam that's complex or confusing (e.g. serialization errors)?

Review comment:
       That is a very broad claim. :) This feature only helps with debugging typing issues.

##########
File path: website/www/site/content/en/blog/python-improved-annotations.md
##########
@@ -0,0 +1,109 @@
+---
+layout: post
+title:  "Improved Annotation Support for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+The importance of static type checking in a dynamically 
+typed language like Python is not up for debate. Type hints 
+allow developers to leverage a strong typing system to:
+ - write better code, 
+ - self-document ambiguous programming logic, and 
+ - inform intelligent code completion in IDEs like PyCharm.
+
+This is why we're excited to announce upcoming improvements to 
+the `typehints` module of Beam's Python SDK, including support 
+for typed PCollections and Python 3 style annotations on PTransforms.
+
+# Improved Annotations
+Today, you have the option to declare type hints on PTransforms using either
+class decorators or inline functions.
+
+For instance, a PTransform with decorated type hints might look like this:
+```
+@beam.typehints.with_input_types(int)
+@beam.typehints.with_output_types(str)
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+Using inline functions instead, the same transform would look like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str)
+```
+
+Both methods have problems. Class decorators are syntax-heavy, 
+requiring two additional lines of code, whereas inline functions provide type hints 
+that aren't reusable across other instances of the same transform. Additionally, both 
+methods are incompatible with static type checkers like MyPy.
+
+With Python 3 annotations however, we can subvert these problems to provide a 
+clean and reusable type hint experience. Our previous transform now looks like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll: PCollection[int]) -> PCollection[str]:
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+These type hints will actively hook into the internal Beam typing system to
+play a role in pipeline type checking, and runtime type checking. 
+
+So how does this work?
+
+## Typed PCollections
+You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be 
+parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). 
+- A PCollection with zero types is implicitly converted to `PCollection[any]`.

Review comment:
       Capitalize `Any`

##########
File path: website/www/site/content/en/blog/python-improved-annotations.md
##########
@@ -0,0 +1,109 @@
+---
+layout: post
+title:  "Improved Annotation Support for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+The importance of static type checking in a dynamically 
+typed language like Python is not up for debate. Type hints 
+allow developers to leverage a strong typing system to:
+ - write better code, 
+ - self-document ambiguous programming logic, and 
+ - inform intelligent code completion in IDEs like PyCharm.
+
+This is why we're excited to announce upcoming improvements to 
+the `typehints` module of Beam's Python SDK, including support 
+for typed PCollections and Python 3 style annotations on PTransforms.
+
+# Improved Annotations
+Today, you have the option to declare type hints on PTransforms using either
+class decorators or inline functions.
+
+For instance, a PTransform with decorated type hints might look like this:
+```
+@beam.typehints.with_input_types(int)
+@beam.typehints.with_output_types(str)
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+Using inline functions instead, the same transform would look like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str)
+```
+
+Both methods have problems. Class decorators are syntax-heavy, 
+requiring two additional lines of code, whereas inline functions provide type hints 
+that aren't reusable across other instances of the same transform. Additionally, both 
+methods are incompatible with static type checkers like MyPy.
+
+With Python 3 annotations however, we can subvert these problems to provide a 
+clean and reusable type hint experience. Our previous transform now looks like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll: PCollection[int]) -> PCollection[str]:
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+These type hints will actively hook into the internal Beam typing system to
+play a role in pipeline type checking, and runtime type checking. 
+
+So how does this work?
+
+## Typed PCollections
+You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be 
+parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). 
+- A PCollection with zero types is implicitly converted to `PCollection[any]`.
+- A PCollection with one type can have any nested type (e.g. `Union[int, str]`).
+
+Internally, Beam's typing system makes these annotations compatible with other 
+type hints by removing the outer PCollection container.
+
+## PBegin, PDone, None
+Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is
+`PBegin`, `PDone`, and `None`. These are generally used for I/O operations.

Review comment:
       "These are generally used for PTransforms that begin or end with an I/O operation."
   
   Also, I think we can omit PDone since we prefer None if I'm not mistaken.

##########
File path: website/www/site/content/en/blog/python-improved-annotations.md
##########
@@ -0,0 +1,109 @@
+---
+layout: post
+title:  "Improved Annotation Support for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+The importance of static type checking in a dynamically 
+typed language like Python is not up for debate. Type hints 
+allow developers to leverage a strong typing system to:
+ - write better code, 
+ - self-document ambiguous programming logic, and 
+ - inform intelligent code completion in IDEs like PyCharm.
+
+This is why we're excited to announce upcoming improvements to 
+the `typehints` module of Beam's Python SDK, including support 
+for typed PCollections and Python 3 style annotations on PTransforms.
+
+# Improved Annotations
+Today, you have the option to declare type hints on PTransforms using either
+class decorators or inline functions.
+
+For instance, a PTransform with decorated type hints might look like this:
+```
+@beam.typehints.with_input_types(int)
+@beam.typehints.with_output_types(str)
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+Using inline functions instead, the same transform would look like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str)
+```
+
+Both methods have problems. Class decorators are syntax-heavy, 
+requiring two additional lines of code, whereas inline functions provide type hints 
+that aren't reusable across other instances of the same transform. Additionally, both 
+methods are incompatible with static type checkers like MyPy.
+
+With Python 3 annotations however, we can subvert these problems to provide a 
+clean and reusable type hint experience. Our previous transform now looks like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll: PCollection[int]) -> PCollection[str]:
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+These type hints will actively hook into the internal Beam typing system to
+play a role in pipeline type checking, and runtime type checking. 
+
+So how does this work?
+
+## Typed PCollections
+You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be 
+parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). 
+- A PCollection with zero types is implicitly converted to `PCollection[any]`.
+- A PCollection with one type can have any nested type (e.g. `Union[int, str]`).
+
+Internally, Beam's typing system makes these annotations compatible with other 
+type hints by removing the outer PCollection container.
+
+## PBegin, PDone, None
+Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is
+`PBegin`, `PDone`, and `None`. These are generally used for I/O operations.
+
+For instance, when saving data, your transform's output type should be `None`.
+```
+class SaveResults(beam.PTransform):
+    def expand(self, pcoll: PCollection[str]) -> None:
+        return pcoll | beam.io.WriteToBigQuery(...)
+```
+
+# Next Steps
+What are you waiting for.. start using annotations on your transforms!
+
+For more background on type hints in Python, see:
+[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). 
+
+Finally, please 

Review comment:
       Awesome post!

##########
File path: website/www/site/content/en/blog/python-performance-runtime-type-checking.md
##########
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in 
+runtime type checking system for Beam's Python SDK that's optimized for performance 
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking 
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type hint of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned the pipeline type check off using the `no_pipeline_type_check` 
+flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came 
+from a database, preventing inference of the output data type?
+
+In either case, no error would be thrown during pipeline construction. 
+And even at runtime, this code works. Each string would be multiplied by 2, 
+yielding a result of `['11', '22']`, but that's certainly not the outcome we want.
+
+So how do you debug this breed of "hidden" errors? More broadly speaking, how do you
+debug any error message in Beam that's complex or confusing (e.g. serialization errors)?
+
+The answer is to use runtime type checking.
+
+# Runtime Type Checking (RTC)
+This feature works by checking that actual input and output values satisfy the declared
+type constraints during pipeline execution. If you ran the code from before with 
+`runtime_type_check` on, you would receive the following error message:
+
+```
+Type hint violation for 'ParDo(MultiplyByTwo)': requires <class 'int'> but got <class 'str'> for element
+```
+
+This is an actionable error message - it tells you that either your code has a bug 
+or that your declared type hints are incorrect. Sounds simple enough, so what's the catch?
+
+_It is soooo slowwwwww._ See for yourself.
+
+
+| Element Size | Normal Pipeline | Runtime Type Checking Pipeline
+| ------------ | --------------- | ------------------------------
+| 1            | 5.3 sec         | 5.6 sec
+| 2,001        | 9.4 sec         | 57.2 sec
+| 10,001       | 24.5 sec        | 259.8 sec
+| 18,001       | 38.7 sec        | 450.5 sec
+
+In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, 
+with the gap only increasing as our input PCollection increased in size.
+
+So, is there any production-friendly alternative?
+
+# Performance Runtime Type Check
+There is! We developed a new flag called `performance_runtime_type_check` that
+minimizes its footprint on the pipeline's time complexity using a combination of
+- efficient Cython code,
+- smart sampling techniques, and
+- optimized mega type-hints.
+
+So what do the new numbers look like?
+
+| Element Size | Normal    | RTC        | Performance RTC
+| -----------  | --------- | ---------- | ---------------
+| 1            | 5.3 sec   | 5.6 sec    | 5.4 sec
+| 2,001        | 9.4 sec   | 57.2 sec   | 11.2 sec
+| 10,001       | 24.5 sec  | 259.8 sec  | 25.5 sec
+| 18,001       | 38.7 sec  | 450.5 sec  | 39.4 sec
+
+On average, the new Performance RTC is 4.4% slower than a normal pipeline whereas the old RTC
+is over 900% slower! Additionally, as the size of the input PCollection increases, the fixed cost
+of setting up the Performance RTC system is spread across each element, decreasing the relative
+impact on the overall pipeline. With 18,001 elements, the difference is less than 1 second.
+
+## How does it work?
+There are three key factors responsible for this upgrade in performance.
+
+1. Instead of type checking all values, we only type check a subset of values, known as
+a sample in statistics. Initially, we sample a substantial number of elements, but as our 
+confidence that the element type won't change over time increases, we reduce our 
+sampling rate (up to a fixed minimum).
+
+2. Whereas the old RTC system used heavy decorators to perform the type check, the new RTC system

Review comment:
       I think the term "wrappers" is more apt than "decorators" here.

##########
File path: website/www/site/content/en/blog/python-performance-runtime-type-checking.md
##########
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in 
+runtime type checking system for Beam's Python SDK that's optimized for performance 
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking 
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type hint of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned the pipeline type check off using the `no_pipeline_type_check` 
+flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came 
+from a database, preventing inference of the output data type?
+
+In either case, no error would be thrown during pipeline construction. 
+And even at runtime, this code works. Each string would be multiplied by 2, 
+yielding a result of `['11', '22']`, but that's certainly not the outcome we want.
+
+So how do you debug this breed of "hidden" errors? More broadly speaking, how do you
+debug any error message in Beam that's complex or confusing (e.g. serialization errors)?
+
+The answer is to use runtime type checking.
+
+# Runtime Type Checking (RTC)
+This feature works by checking that actual input and output values satisfy the declared
+type constraints during pipeline execution. If you ran the code from before with 
+`runtime_type_check` on, you would receive the following error message:
+
+```
+Type hint violation for 'ParDo(MultiplyByTwo)': requires <class 'int'> but got <class 'str'> for element
+```
+
+This is an actionable error message - it tells you that either your code has a bug 
+or that your declared type hints are incorrect. Sounds simple enough, so what's the catch?
+
+_It is soooo slowwwwww._ See for yourself.
+
+
+| Element Size | Normal Pipeline | Runtime Type Checking Pipeline
+| ------------ | --------------- | ------------------------------
+| 1            | 5.3 sec         | 5.6 sec
+| 2,001        | 9.4 sec         | 57.2 sec
+| 10,001       | 24.5 sec        | 259.8 sec
+| 18,001       | 38.7 sec        | 450.5 sec
+
+In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, 
+with the gap only increasing as our input PCollection increased in size.
+
+So, is there any production-friendly alternative?
+
+# Performance Runtime Type Check
+There is! We developed a new flag called `performance_runtime_type_check` that
+minimizes its footprint on the pipeline's time complexity using a combination of
+- efficient Cython code,
+- smart sampling techniques, and
+- optimized mega type-hints.
+
+So what do the new numbers look like?
+
+| Element Size | Normal    | RTC        | Performance RTC
+| -----------  | --------- | ---------- | ---------------
+| 1            | 5.3 sec   | 5.6 sec    | 5.4 sec
+| 2,001        | 9.4 sec   | 57.2 sec   | 11.2 sec
+| 10,001       | 24.5 sec  | 259.8 sec  | 25.5 sec
+| 18,001       | 38.7 sec  | 450.5 sec  | 39.4 sec
+
+On average, the new Performance RTC is 4.4% slower than a normal pipeline whereas the old RTC
+is over 900% slower! Additionally, as the size of the input PCollection increases, the fixed cost
+of setting up the Performance RTC system is spread across each element, decreasing the relative
+impact on the overall pipeline. With 18,001 elements, the difference is less than 1 second.
+
+## How does it work?
+There are three key factors responsible for this upgrade in performance.
+
+1. Instead of type checking all values, we only type check a subset of values, known as
+a sample in statistics. Initially, we sample a substantial number of elements, but as our 
+confidence that the element type won't change over time increases, we reduce our 
+sampling rate (up to a fixed minimum).
+
+2. Whereas the old RTC system used heavy decorators to perform the type check, the new RTC system
+moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, 
+Cython is a programming language that gives C-like performance to Python code.
+
+3. Finally, we use a single mega type hint to type-check only the output values of transforms
+instead of type-checking both the input and output values separately. This mega typehint is composed of
+the original transform's output type constraints along with all consumer transforms' input type 
+constraints. Using this mega type hint allows us to reduce overhead while simultaneously allowing
+us to throw _more actionable errors_. For instance, consider the following error (which was 
+generated from the old RTC system):
+```
+Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argument: 'element' violated. Expected an instance of <class ‘str’>, instead found 9, an instance of <class ‘int’>.
+```
+
+This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us
+who created that `int` in the first place. Who is the offending upstream transform that's responsible for
+this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `any`) or otherwise non-existent because

Review comment:
       `Any`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] udim merged pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
udim merged pull request #12657:
URL: https://github.com/apache/beam/pull/12657


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] saavannanavati commented on pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
saavannanavati commented on pull request #12657:
URL: https://github.com/apache/beam/pull/12657#issuecomment-678692727


   > Looks good, I had a few comments.
   > There's a new whitespace check that's failing. :(
   
   Thanks, pushed some changes. PTAL when you have the chance


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] saavannanavati commented on pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
saavannanavati commented on pull request #12657:
URL: https://github.com/apache/beam/pull/12657#issuecomment-678077198


   R: @udim 
   R: @robertwb 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] saavannanavati commented on a change in pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
saavannanavati commented on a change in pull request #12657:
URL: https://github.com/apache/beam/pull/12657#discussion_r475123198



##########
File path: website/www/site/content/en/blog/python-performance-runtime-type-checking.md
##########
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in 
+runtime type checking system for Beam's Python SDK that's optimized for performance 
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking 
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type hint of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned the pipeline type check off using the `no_pipeline_type_check` 
+flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came 
+from a database, preventing inference of the output data type?
+
+In either case, no error would be thrown during pipeline construction. 
+And even at runtime, this code works. Each string would be multiplied by 2, 
+yielding a result of `['11', '22']`, but that's certainly not the outcome we want.
+
+So how do you debug this breed of "hidden" errors? More broadly speaking, how do you
+debug any error message in Beam that's complex or confusing (e.g. serialization errors)?

Review comment:
       True true




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] saavannanavati commented on a change in pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
saavannanavati commented on a change in pull request #12657:
URL: https://github.com/apache/beam/pull/12657#discussion_r475122546



##########
File path: website/www/site/content/en/blog/python-improved-annotations.md
##########
@@ -0,0 +1,109 @@
+---
+layout: post
+title:  "Improved Annotation Support for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog 
+  - python 
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+The importance of static type checking in a dynamically 
+typed language like Python is not up for debate. Type hints 
+allow developers to leverage a strong typing system to:
+ - write better code, 
+ - self-document ambiguous programming logic, and 
+ - inform intelligent code completion in IDEs like PyCharm.
+
+This is why we're excited to announce upcoming improvements to 
+the `typehints` module of Beam's Python SDK, including support 
+for typed PCollections and Python 3 style annotations on PTransforms.
+
+# Improved Annotations
+Today, you have the option to declare type hints on PTransforms using either
+class decorators or inline functions.
+
+For instance, a PTransform with decorated type hints might look like this:
+```
+@beam.typehints.with_input_types(int)
+@beam.typehints.with_output_types(str)
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+Using inline functions instead, the same transform would look like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str)
+```
+
+Both methods have problems. Class decorators are syntax-heavy, 
+requiring two additional lines of code, whereas inline functions provide type hints 
+that aren't reusable across other instances of the same transform. Additionally, both 
+methods are incompatible with static type checkers like MyPy.
+
+With Python 3 annotations however, we can subvert these problems to provide a 
+clean and reusable type hint experience. Our previous transform now looks like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll: PCollection[int]) -> PCollection[str]:
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+These type hints will actively hook into the internal Beam typing system to
+play a role in pipeline type checking, and runtime type checking. 
+
+So how does this work?
+
+## Typed PCollections
+You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be 
+parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). 
+- A PCollection with zero types is implicitly converted to `PCollection[any]`.
+- A PCollection with one type can have any nested type (e.g. `Union[int, str]`).
+
+Internally, Beam's typing system makes these annotations compatible with other 
+type hints by removing the outer PCollection container.
+
+## PBegin, PDone, None
+Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is
+`PBegin`, `PDone`, and `None`. These are generally used for I/O operations.
+
+For instance, when saving data, your transform's output type should be `None`.
+```
+class SaveResults(beam.PTransform):
+    def expand(self, pcoll: PCollection[str]) -> None:
+        return pcoll | beam.io.WriteToBigQuery(...)
+```
+
+# Next Steps
+What are you waiting for.. start using annotations on your transforms!
+
+For more background on type hints in Python, see:
+[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). 
+
+Finally, please 

Review comment:
       Thanks! :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] udim commented on pull request #12657: [BEAM-10777] Add two blog posts detailing changes to the type hints module of the Python SDK

Posted by GitBox <gi...@apache.org>.
udim commented on pull request #12657:
URL: https://github.com/apache/beam/pull/12657#issuecomment-678573594


   Run Website_Stage_GCS PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org