You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/19 14:22:07 UTC

[GitHub] [flink] sjwiesman commented on a change in pull request #13199: [FLINK-18953][python][docs] Add documentation for DataTypes in Python DataStream API

sjwiesman commented on a change in pull request #13199:
URL: https://github.com/apache/flink/pull/13199#discussion_r473066461



##########
File path: docs/dev/python/user-guide/datastream/data_types.zh.md
##########
@@ -0,0 +1,116 @@
+---
+title: "Data Types"
+nav-parent_id: python_datastream_api
+nav-pos: 10
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+
+## Why need Data Types
+
+In Python DataStream, a data type describes the type of a value in the DataStream ecosystem. 
+It can be used to declare input and/or output types of operations. 
+Similar to Python, you don't require to specify types for the parameters of a Function in Python DataStream. 
+If the type has not been declared, data would be serialized or deserialized using Pickle. 
+For example, the program below specifies no data types.
+
+{% highlight python %}
+from pyflink.datastream import StreamExecutionEnvironment
+
+
+def processing():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    env.from_collection(collection=[(1, 'aaa'), (2, 'bbb')]) \
+        .map(lambda record: (record[0]+1, record[1].upper())) \
+        .print()  # note: print to stdout on the worker machine
+
+    env.execute()
+
+
+if __name__ == '__main__':
+    processing()
+{% endhighlight %}
+
+However, types need to be specified when:
+
+- Passing Python records to Java operations.
+- Improve serialization and deserialization performance.
+
+### Passing Python records to Java operations
+
+Since Java operators or functions can not identify Python data, types need to be provided to help to convert Python data to Java data for processing.
+For example, types need to be provided if you want to output data from the map into the StreamingFileSink. 
+The StreamingFileSink is actually implemented by Java for the runtime part. 

Review comment:
       ```suggestion
   Since Java operators or functions can not identify Python data, types need to be provided to help to convert Python types to Java types for processing.
   For example, types need to be provided if you want to output data using the StreamingFileSink which is implemented in Java.
   ```

##########
File path: docs/dev/python/user-guide/datastream/data_types.zh.md
##########
@@ -0,0 +1,116 @@
+---
+title: "Data Types"
+nav-parent_id: python_datastream_api
+nav-pos: 10
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+
+## Why need Data Types
+
+In Python DataStream, a data type describes the type of a value in the DataStream ecosystem. 
+It can be used to declare input and/or output types of operations. 
+Similar to Python, you don't require to specify types for the parameters of a Function in Python DataStream. 
+If the type has not been declared, data would be serialized or deserialized using Pickle. 
+For example, the program below specifies no data types.

Review comment:
       ```suggestion
   In Apache Flink's Python DataStream API, a data type describes the type of a value in the DataStream ecosystem. 
   It can be used to declare input and output types of operations and informs the system how to serailize elements. 
   
   * This will be replaced by the TOC
   {:toc}
   
   
   ## Pickle Serialization
   
   If the type has not been declared, data would be serialized or deserialized using Pickle. 
   For example, the program below specifies no data types.
   ```

##########
File path: docs/dev/python/user-guide/datastream/data_types.zh.md
##########
@@ -0,0 +1,116 @@
+---
+title: "Data Types"
+nav-parent_id: python_datastream_api
+nav-pos: 10
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+
+## Why need Data Types
+
+In Python DataStream, a data type describes the type of a value in the DataStream ecosystem. 
+It can be used to declare input and/or output types of operations. 
+Similar to Python, you don't require to specify types for the parameters of a Function in Python DataStream. 
+If the type has not been declared, data would be serialized or deserialized using Pickle. 
+For example, the program below specifies no data types.
+
+{% highlight python %}
+from pyflink.datastream import StreamExecutionEnvironment
+
+
+def processing():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    env.from_collection(collection=[(1, 'aaa'), (2, 'bbb')]) \
+        .map(lambda record: (record[0]+1, record[1].upper())) \
+        .print()  # note: print to stdout on the worker machine
+
+    env.execute()
+
+
+if __name__ == '__main__':
+    processing()
+{% endhighlight %}
+
+However, types need to be specified when:
+
+- Passing Python records to Java operations.
+- Improve serialization and deserialization performance.
+
+### Passing Python records to Java operations
+
+Since Java operators or functions can not identify Python data, types need to be provided to help to convert Python data to Java data for processing.
+For example, types need to be provided if you want to output data from the map into the StreamingFileSink. 
+The StreamingFileSink is actually implemented by Java for the runtime part. 
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def streaming_file_sink():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    env.from_collection(collection=[(1, 'aaa'), (2, 'bbb')]) \
+        .map(lambda record: (record[0]+1, record[1].upper()),
+             result_type=Types.ROW([Types.INT(), Types.STRING()])) \
+        .add_sink(StreamingFileSink
+                  .for_row_format('/tmp/output', SimpleStringEncoder())
+                  .build())
+
+    env.execute()
+
+
+if __name__ == '__main__':
+    streaming_file_sink()
+
+{% endhighlight %}
+
+### Improve serialization and deserialization performance
+
+Even though data can be serialized and deserialized through Pickle, the performance should be better if types are provided. 
+This is because PyFlink can use more efficient serializers and deserializers to serialize and deserialize data.

Review comment:
       ```suggestion
   Even though data can be serialized and deserialized through Pickle, performance will be better if types are provided.
   Explicit types allow PyFlink to use efficient serializers when moving records through the pipeline.
   ```

##########
File path: docs/dev/python/user-guide/datastream/index.md
##########
@@ -0,0 +1,32 @@
+---
+title: "DataStream API"
+nav-id: python_datastream_api
+nav-parent_id: python_user_guide
+nav-pos: 30
+nav-show_overview: true

Review comment:
       I don't think we need this page. Let's revisit this after the rest of the content is merged in. 

##########
File path: docs/dev/python/user-guide/datastream/index.md
##########
@@ -0,0 +1,32 @@
+---
+title: "DataStream API"
+nav-id: python_datastream_api
+nav-parent_id: python_user_guide
+nav-pos: 30
+nav-show_overview: true

Review comment:
       ```suggestion
   ```

##########
File path: docs/dev/python/user-guide/datastream/index.md
##########
@@ -0,0 +1,32 @@
+---
+title: "DataStream API"
+nav-id: python_datastream_api
+nav-parent_id: python_user_guide
+nav-pos: 30
+nav-show_overview: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Python DataStream API allows users to develop [DataStream API]({{ site.baseurl }}/dev/datastream_api.html) programs using the Python language.
+Apache Flink has provided Python DataStream API support since 1.12.0.
+
+## Where to go next?
+
+- [Data Types]({{ site.baseurl }}/dev/python/user-guide/datastream/data_types.html): Lists the supported data types in Python DataStream API.

Review comment:
       ```suggestion
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org