You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/02 07:27:29 UTC
[GitHub] [flink-web] morsapaes commented on a change in pull request #364: Add blog post for Pandas support in PyFlink.

morsapaes commented on a change in pull request #364:
URL: https://github.com/apache/flink-web/pull/364#discussion_r464040992



##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>

Review comment:
       ```suggestion
   <img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="450px" alt="Python Scientific Stack"/>
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.

Review comment:
       ```suggestion
   In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include **support for Pandas UDF** and the **conversion between Pandas DataFrame and Table**. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>

Review comment:
       ```suggestion
   <img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="550px" alt="VM Communication"/>
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples
+
+Using Python in Apache Flink requires installing PyFlink. PyFlink is available through PyPI and can be easily installed using pip: 

Review comment:
       ```suggestion
   Using Python in Apache Flink requires installing PyFlink, which is available on [PyPI](https://pypi.org/project/apache-flink/) and can be easily installed using `pip`. Before installing PyFlink, check the working version of Python running in your system using:
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples
+
+Using Python in Apache Flink requires installing PyFlink. PyFlink is available through PyPI and can be easily installed using pip: 
+
+```bash
+$ python --version
+Python 3.7.6
+```
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Please note that Python 3.5 or higher is required to install and run PyFlink
+</div>
+
+<br>

Review comment:
       ```suggestion
   Please note that Python 3.5 or higher is required to install and run PyFlink.
   </div>
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples

Review comment:
       ```suggestion
   # Examples
   
   ### Installation
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples
+
+Using Python in Apache Flink requires installing PyFlink. PyFlink is available through PyPI and can be easily installed using pip: 
+
+```bash
+$ python --version
+Python 3.7.6
+```
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Please note that Python 3.5 or higher is required to install and run PyFlink
+</div>
+
+<br>
+
+If you don't have a version above 3.5, you can use virtualenv with follows commands:
+
+```bash
+$ pip install virtualenv
+$ virtualenv --python /usr/local/bin/python3 py37
+$ source py37/bin/activate
+
+```
+<br>

Review comment:
       ```suggestion
   
   And then install PyFlink:
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+

Review comment:
       ```suggestion
   
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples
+
+Using Python in Apache Flink requires installing PyFlink. PyFlink is available through PyPI and can be easily installed using pip: 
+
+```bash
+$ python --version
+Python 3.7.6
+```
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Please note that Python 3.5 or higher is required to install and run PyFlink
+</div>
+
+<br>
+
+If you don't have a version above 3.5, you can use virtualenv with follows commands:

Review comment:
       ```suggestion
   If you don't have a version above 3.5, you can create a virtual environment (i.e. [virtualenv](https://docs.python-guide.org/dev/virtualenvs/)) with the following commands:
   ```

##########
File path: _posts/2020-07-28-pyflink-pandas-support-flink.md
##########
@@ -0,0 +1,253 @@
+---
+layout: post
+title: "PyFlink: The integration of Pandas into PyFlink"
+date: 2020-07-28T12:00:00.000Z
+authors:
+- Jincheng:
+  name: "Jincheng Sun"
+  twitter: "sunjincheng121"
+- markos:
+  name: "Markos Sfikas"
+  twitter: "MarkSfik"
+excerpt: Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. In this article, we will introduce how these functionalities work and how to use them with a step-by-step example. 
+---
+
+Python has evolved into one of the most important programming languages for many fields of data processing. So big has been Python’s popularity, that it has pretty much become the default data processing language for data scientists. On top of that, there is a plethora of  Python-based data processing tools such as NumPy, Pandas, and Scikit-learn that have  gained additional popularity due to their flexibility or powerful functionalities. 
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/python-scientific-stack.png" width="600px" alt="Python Scientific Stack"/>
+</center>
+<center>
+  <a href="https://speakerdeck.com/jakevdp/the-unexpected-effectiveness-of-python-in-science?slide=52">Pic source: VanderPlas 2017, slide 52.</a>
+</center>
+<br>
+
+In an effort to meet the user needs and demands, the Flink community hopes to leverage and make better use of these tools.  Along this direction, the Flink community put some great effort in integrating Pandas into PyFlink with the latest Flink version 1.11. Some of the added features include support for Pandas UDF and the conversion between Pandas DataFrame and Table. Pandas UDF not only greatly improve the execution performance of Python UDF, but also make it more convenient for users to leverage libraries such as Pandas and NumPy in Python UDF. Additionally, providing support for the conversion between Pandas DataFrame and Table enables users to switch processing engines seamlessly without the need for an intermediate connector. In the remainder of this article, we will introduce how these functionalities work and how to use them with a step-by-step example.
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Currently, only Scalar Pandas UDFs are supported in PyFlink.
+</div>
+
+<br>
+
+# Pandas UDF in Flink 1.11
+
+Using scalar Python UDF was already possible in Flink 1.10 as described in a [previous article on the Flink blog](https://flink.apache.org/2020/04/09/pyflink-udf-support-flink.html). Scalar Python UDFs work based on three primary steps: 
+
+ - the Java operator serializes one input row to bytes and sends them to the Python worker;
+
+ - the Python worker deserializes the input row and evaluates the Python UDF with it; 
+
+ - the resulting row is serialized and sent back to the Java operator
+
+
+While providing support for Python UDFs in PyFlink greatly improved the user experience, it had some drawbacks, namely resulting in:
+
+  - High serialization/deserialization overhead
+
+  - Difficulty when leveraging popular Python libraries used by data scientists — such as Pandas or NumPy — that provide high-performance data structure and functions.
+
+
+The introduction of Pandas UDF is used to address these drawbacks. For Pandas UDF, a batch of rows is transferred between the JVM and PVM in a columnar format (Arrow memory format). The batch of rows will be converted into a collection of Pandas Series and will be transferred to the Pandas UDF to then leverage popular Python libraries (such as Pandas, NumPy, etc.) for the Python UDF implementation.
+
+<center>
+<img src="{{ site.baseurl }}/img/blog/2020-07-28-pyflink-pandas/vm-communication.png" width="600px" alt="VM Communication"/>
+</center>
+<br>
+
+The performance of vectorized UDFs is usually much higher when compared to the normal Python UDF, as the serialization/deserialization overhead is minimized by falling back to [Apache Arrow](https://arrow.apache.org/), while handling Pandas.Series as input/output allows us to take full advantage of the Pandas and NumPy libraries, making it a popular solution to parallelize Machine Learning and other large-scale, distributed data science workloads (e.g. feature engineering, distributed model application).
+
+
+# Conversion between PyFlink Table and Pandas DataFrame
+
+Pandas DataFrame is the de-facto standard for working with tabular data in the Python community while PyFlink Table is Flink’s representation of the tabular data in Python language. Enabling the conversion between PyFlink Table and Pandas DataFrame allows switching between PyFlink and Pandas seamlessly when processing data in Python. Users can process data using one execution engine and switch to a different one effortlessly. For example, in case users already have a Pandas DataFrame at hand and want to perform some expensive transformation, they can easily convert it to a PyFlink Table and leverage the power of the Flink engine. On the other hand, users can also convert a PyFlink Table to a Pandas DataFrame and perform the same transformation with the rich functionalities provided by the Pandas ecosystem.
+
+
+## Examples
+
+Using Python in Apache Flink requires installing PyFlink. PyFlink is available through PyPI and can be easily installed using pip: 
+
+```bash
+$ python --version
+Python 3.7.6
+```
+
+<div class="alert alert-info" markdown="1">
+<span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Please note that Python 3.5 or higher is required to install and run PyFlink
+</div>
+
+<br>
+
+If you don't have a version above 3.5, you can use virtualenv with follows commands:
+
+```bash
+$ pip install virtualenv
+$ virtualenv --python /usr/local/bin/python3 py37
+$ source py37/bin/activate
+
+```
+<br>
+
+```bash
+
+$ python -m pip install apache-flink
+
+```
+<br>

Review comment:
       ```suggestion
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org