You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/10/26 11:14:06 UTC

[GitHub] [flink-ml] lindong28 commented on a diff in pull request #164: [FLINK-29115] Improve quickstart documents

lindong28 commented on code in PR #164:
URL: https://github.com/apache/flink-ml/pull/164#discussion_r1005520306


##########
docs/content/docs/development/build-and-install.md:
##########
@@ -61,9 +61,13 @@ from the root directory of Flink ML:
 ./flink-ml-dist/target/flink-ml-*-bin/flink-ml*/
 ```
 
-## Build Flink ML Python SDK
+The `mvn clean install` command would have installed the binary into your local
+Maven repository so other projects can refer to it and grab it from the
+repository. There is no additional step required for installation.
 
-### Prerequisites
+## Python SDK

Review Comment:
   How about using this title:
   
   Build and Install Python SDK



##########
docs/content/docs/try-flink-ml/java/quick-start.md:
##########
@@ -28,18 +28,25 @@ under the License.
 
 This document provides a quick introduction to using Flink ML. Readers of this
 document will be guided to submit a simple Flink job that trains a Machine
-Learning Model and use it to provide prediction service.
+Learning Model and uses it to provide prediction service.
 
-## Prerequisites
+## Help, I’m Stuck!
 
-### Install Flink
+If you get stuck, check out the [community support
+resources](https://flink.apache.org/gettinghelp.html). In particular, Apache
+Flink's [user mailing
+list](https://flink.apache.org/community.html#mailing-lists) is consistently
+ranked as one of the most active of any Apache project and a great way to get
+help quickly.
+
+## Install Flink

Review Comment:
   The link given below does not mention anything related to `Install Flink`. 
   
   Do you actually mean `Download Flink` here? If so, it seems simpler to just copy/paste the related text instead of redirecting readers to another page.
   
   



##########
docs/content/docs/try-flink-ml/java/quick-start.md:
##########
@@ -49,31 +56,43 @@ cd ${path_to_flink}
 export FLINK_HOME=`pwd`
 ```
 
+## Add Flink ML binaries to Flink

Review Comment:
   Add Flink ML library to Flink's library folder



##########
docs/content/docs/try-flink-ml/python/quick-start.md:
##########
@@ -29,54 +29,204 @@ under the License.
 # Quick Start
 
 This document provides a quick introduction to using Flink ML. Readers of this
-document will be guided to submit a simple Flink job that trains a Machine
-Learning Model and use it to provide prediction service.
+document will be guided to create a simple Flink job that trains a Machine
+Learning Model and uses it to provide prediction service.
+
+## What Will You Be Building?
+
+Kmeans is a widely-used clustering algorithm and has been supported by Flink ML.
+This walkthrough guides you to create a Flink job with Flink ML that initializes
+and trains a Kmeans model, and finally uses it to predict the cluster id of
+certain data points.
 
 ## Prerequisites
 
-Python version (3.6, 3.7, or 3.8) is required for Flink ML. Please run the
-following command to make sure that it meets the requirements:
+This walkthrough assumes that you have some familiarity with Python, but you
+should be able to follow along even if you come from a different programming
+language.
 
-```shell
-$ python --version
-# the version printed here must be 3.6, 3.7 or 3.8
-```
+## Help, I’m Stuck!
 
-## Installation of Flink ML Python SDK
+If you get stuck, check out the [community support
+resources](https://flink.apache.org/gettinghelp.html). In particular, Apache
+Flink's [user mailing
+list](https://flink.apache.org/community.html#mailing-lists) is consistently
+ranked as one of the most active of any Apache project and a great way to get
+help quickly.
 
-Flink ML Python SDK is available in
-[PyPi](https://pypi.org/project/apache-flink-ml/) and can be installed as
-follows:
+## How To Follow Along
+
+If you want to follow along, you will require a computer with:
+
+{{< stable >}}
+- Java 8
+- Python 3.6, 3.7 or 3.8 {{< /stable >}} {{< unstable >}}
+- Java 8
+- Maven 3
+- Python 3.6, 3.7 or 3.8 {{< /unstable >}}
 
 {{< stable >}}
 
+This walkthrough requires installing Flink ML Python SDK, which is available on
+[PyPi](https://pypi.org/project/apache-flink-ml/) and can be easily installed
+using pip.
+
 ```bash
 $ python -m pip install apache-flink-ml=={{< version >}}
 ```
 
 {{< /stable >}} {{< unstable >}}
 
-```bash
-$ python -m pip install apache-flink-ml
-```
+Please walk through this [guideline]({{< ref
+"docs/development/build-and-install#build-flink-ml-python-sdk" >}}) to build and install
+Flink ML's Python SDK in your local environment.
 
 {{< /unstable >}}
 
-You can also build Flink ML Python SDK from sources by following the
-[development guide]({{< ref "docs/development/building" >}}).
+## Writing a Flink ML Python Program
+
+Flink ML programs begin by setting up the `StreamExecutionEnvironment` to
+execute the Flink ML job. You would have been familiar with this concept if you
+have experience using Flink. For the example program in this document, a simple
+`StreamExecutionEnvironment` without specific configurations would be enough. 
+
+Given that Flink ML uses Flink's Table API, a `StreamTableEnvironment` would
+also be necessary for the following program.
+
+```python
+# create a new StreamExecutionEnvironment
+env = StreamExecutionEnvironment.get_execution_environment()
+
+# create a StreamTableEnvironment
+t_env = StreamTableEnvironment.create(env)
+```
+
+Then you can create the Table containing data for the training and prediction
+process of the following Kmeans algorithm. Flink ML operators search the names
+of the columns of the input table for input data, and produce prediction results
+to designated column of the output Table.
+
+```python
+# generate input data
+input_data = t_env.from_data_stream(
+    env.from_collection([
+        (Vectors.dense([0.0, 0.0]),),
+        (Vectors.dense([0.0, 0.3]),),
+        (Vectors.dense([0.3, 3.0]),),
+        (Vectors.dense([9.0, 0.0]),),
+        (Vectors.dense([9.0, 0.6]),),
+        (Vectors.dense([9.6, 0.0]),),
+    ],
+        type_info=Types.ROW_NAMED(
+            ['features'],
+            [DenseVectorTypeInfo()])))
+```
+
+Flink ML classes for Kmeans algorithm include `KMeans` and `KMeansModel`.
+`KMeans` implements the training process of Kmeans algorithm based on the
+provided training data, and finally generates a `KMeansModel`.
+`KmeansModel.transform()` method encodes the Transformation logic of this
+algorithm and is used for predictions. 
+
+Both `KMeans` and `KMeansModel` provides getter/setter methods for Kmeans
+algorithm's configuration parameters. This example program explicitly sets the
+following parameters, and other configuration parameters will have their default
+values used.
+
+- `k`, the number of clusters to create
+- `seed`, the random seed to initialize cluster centers
+
+When the program invokes `KMeans.fit()` to generate a `KMeansModel`, the
+`KMeansModel` will inherit the `KMeans` object's configuration parameters. Thus
+it is supported to set `KMeansModel`'s parameters directly in `KMeans` object.
+
+```python
+# create a kmeans object and initialize its parameters
+kmeans = KMeans().set_k(2).set_seed(1)
+
+# train the kmeans model
+model = kmeans.fit(input_data)
+
+# use the kmeans model for predictions
+output = model.transform(input_data)[0]
 
-## Run Flink ML example job
+```
+
+Like all other Flink programs, the codes described in the sections above only
+configures the computation graph of a Flink job, and the program only evaluates
+the computation logic and collects outputs after the `execute()` method is
+invoked. Collected outputs from the output table would be `Row`s in which
+`featuresCol` contains input feature vectors, and `predictionCol` contains
+output prediction results, i.e., cluster IDs.
+
+```python
+# extract and display the results
+field_names = output.get_schema().get_field_names()
+for result in t_env.to_data_stream(output).execute_and_collect():
+    features = result[field_names.index(kmeans.get_features_col())]
+    cluster_id = result[field_names.index(kmeans.get_prediction_col())]
+    print('Features: ' + str(features) + ' \tCluster Id: ' + str(cluster_id))
+```
+
+The complete code so far:
+
+```python
+from pyflink.common import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo
+from pyflink.ml.lib.clustering.kmeans import KMeans
+from pyflink.table import StreamTableEnvironment
+
+# create a new StreamExecutionEnvironment
+env = StreamExecutionEnvironment.get_execution_environment()
+
+# create a StreamTableEnvironment
+t_env = StreamTableEnvironment.create(env)
+
+# generate input data
+input_data = t_env.from_data_stream(
+    env.from_collection([
+        (Vectors.dense([0.0, 0.0]),),
+        (Vectors.dense([0.0, 0.3]),),
+        (Vectors.dense([0.3, 3.0]),),
+        (Vectors.dense([9.0, 0.0]),),
+        (Vectors.dense([9.0, 0.6]),),
+        (Vectors.dense([9.6, 0.0]),),
+    ],
+        type_info=Types.ROW_NAMED(
+            ['features'],
+            [DenseVectorTypeInfo()])))
+
+# create a kmeans object and initialize its parameters
+kmeans = KMeans().set_k(2).set_seed(1)
+
+# train the kmeans model
+model = kmeans.fit(input_data)
+
+# use the kmeans model for predictions
+output = model.transform(input_data)[0]
+
+# extract and display the results
+field_names = output.get_schema().get_field_names()
+for result in t_env.to_data_stream(output).execute_and_collect():
+    features = result[field_names.index(kmeans.get_features_col())]
+    cluster_id = result[field_names.index(kmeans.get_prediction_col())]
+    print('Features: ' + str(features) + ' \tCluster Id: ' + str(cluster_id))
+```
+
+## Executing a Flink ML Python Program
 
-After setting up Flink ML Python SDK, you can run a Flink ML example job as
-follows.
+After creating a python file (e.g. kmeans_example.py) and saving the code above
+into the file, you can run the example on the command line:
 
 ```shell
-$ python -m pyflink.examples.ml.clustering.kmeans_example
+python kmeans_example.py

Review Comment:
   I didn't realize this approach is using mini-cluster. It seems inconsistent that the Java quickstart uses remote Flink cluster but the Python quickstart uses mini-cluster.
   
   How much extra work is needed to use remote Flink cluster in this PR? We can do it in a followup PR if you prefer.



##########
docs/content/docs/development/build-and-install.md:
##########
@@ -61,9 +61,13 @@ from the root directory of Flink ML:
 ./flink-ml-dist/target/flink-ml-*-bin/flink-ml*/
 ```
 
-## Build Flink ML Python SDK
+The `mvn clean install` command would have installed the binary into your local
+Maven repository so other projects can refer to it and grab it from the
+repository. There is no additional step required for installation.
 
-### Prerequisites
+## Python SDK
+
+### Building

Review Comment:
   It seems that the commands used in this section are all about installing prerequisites.



##########
docs/content/docs/development/build-and-install.md:
##########
@@ -26,11 +26,11 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Building Flink ML from Source
+# Building And Installing Flink ML From Source
 
-This page covers how to build Flink ML from sources.
+This page covers how to build and install Flink ML from sources.
 
-## Build Flink ML Java SDK
+## Java SDK

Review Comment:
   How about using this title:
   
   Build and Install Java SDK



##########
docs/content/docs/try-flink-ml/java/build-your-own-project.md:
##########
@@ -133,6 +259,12 @@ Vector: [0.0, 0.3]	Cluster ID: 1
 Vector: [9.0, 0.0]	Cluster ID: 0
 ```
 
+<!-- TODO: figure out why the process above does not terminate. -->
+The program might get stucked after printing out the information above, and you

Review Comment:
   stucked -> stuck



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org