You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by rm...@apache.org on 2020/05/13 14:29:01 UTC

[flink] 07/08: [FLINK-11086][docs] Make HADOOP_CLASSPATH approach more prominent in docs

This is an automated email from the ASF dual-hosted git repository.

rmetzger pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git

commit 2cc63a6d0f6cec23df5b00797a3876907ecb3342
Author: Robert Metzger <rm...@apache.org>
AuthorDate: Mon May 4 12:09:48 2020 +0200

    [FLINK-11086][docs] Make HADOOP_CLASSPATH approach more prominent in docs
---
 docs/ops/deployment/hadoop.md        | 81 +++++++++++++++++-------------------
 docs/ops/deployment/hadoop.zh.md     | 81 +++++++++++++++++-------------------
 docs/ops/deployment/yarn_setup.md    | 23 ++++------
 docs/ops/deployment/yarn_setup.zh.md | 27 +++++-------
 4 files changed, 96 insertions(+), 116 deletions(-)

diff --git a/docs/ops/deployment/hadoop.md b/docs/ops/deployment/hadoop.md
index 24471853..71914a6 100644
--- a/docs/ops/deployment/hadoop.md
+++ b/docs/ops/deployment/hadoop.md
@@ -26,34 +26,13 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-## Referencing a Hadoop configuration
-
-You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.
-
-```sh
-HADOOP_CONF_DIR=/path/to/etc/hadoop
-```
-
-Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.
-
-Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details below.
 
 ## Providing Hadoop classes
 
 In order to use Hadoop features (e.g., YARN, HDFS) it is necessary to provide Flink with the required Hadoop classes,
 as these are not bundled by default.
 
-This can be done by 
-1) Adding the Hadoop classpath to Flink
-2) Putting the required jar files into /lib directory of the Flink distribution
-Option 1) requires very little work, integrates nicely with existing Hadoop setups and should be the
-preferred approach.
-However, Hadoop has a large dependency footprint that increases the risk for dependency conflicts to occur.
-If this happens, please refer to option 2).
-
-The following subsections explains these approaches in detail.
-
-### Adding Hadoop Classpaths
+This can be done by adding the Hadoop classpath to Flink through the `HADOOP_CLASSPATH` environment variable.
 
 Flink will use the environment variable `HADOOP_CLASSPATH` to augment the
 classpath that is used when starting Flink components such as the Client,
@@ -75,35 +54,24 @@ in the shell. Note that `hadoop` is the hadoop binary and that `classpath` is an
 
 Putting the Hadoop configuration in the same class path as the Hadoop libraries makes Flink pick up that configuration.
 
-### Adding Hadoop to /lib
-
-The Flink project releases Hadoop distributions for specific versions, that relocate or exclude several dependencies
-to reduce the risk of dependency clashes.
-These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
-For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
-the `/lib` directory of the Flink distribution.
-
-If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
-then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
-You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+## Referencing a Hadoop configuration
 
-<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
-vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
+You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.
 
-Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
+```sh
+HADOOP_CONF_DIR=/path/to/etc/hadoop
+```
 
-{% highlight bash %}
-mvn clean install -Dhadoop.version=2.6.5-custom
-{% endhighlight %}
+Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.
 
-After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
+Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details above.
 
 ## Running a job locally
 
 To run a job locally as one JVM process using the mini cluster, the required hadoop dependencies have to be explicitly
 added to the classpath of the started JVM process.
 
-To run an application using maven (also from IDE as a maven project), the required hadoop dependencies can be added
+To run an application using Maven (also from IDE as a Maven project), the required Hadoop dependencies can be added
 as provided to the pom.xml, e.g.:
 
 ```xml
@@ -115,9 +83,38 @@ as provided to the pom.xml, e.g.:
 </dependency>
 ```
 
-This way it should work both in local and cluster run where the provided dependencies are added elsewhere as described before.
+This way it should work both in local and cluster mode where the provided dependencies are added elsewhere as described before.
 
 To run or debug an application in IntelliJ Idea the provided dependencies can be included to the class path
 in the "Run|Edit Configurations" window.
 
+
+## Using `flink-shaded-hadoop-2-uber` jar for resolving dependency conflicts (legacy)
+
+<div class="alert alert-info" markdown="span">
+  <strong>Warning:</strong> Starting from Flink 1.11, using `flink-shaded-hadoop-2-uber` releases is not officially supported
+  by the Flink project anymore. Users are advised to provide Hadoop dependencies through `HADOOP_CLASSPATH` (see above).
+</div>
+
+The Flink project used to (until Flink 1.10) release Hadoop distributions for specific versions, that relocate or exclude several dependencies to reduce the risk of dependency clashes.
+These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
+the `/lib` directory of the Flink distribution.
+
+If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
+then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
+You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+
+<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
+vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
+
+Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
+
+{% highlight bash %}
+mvn clean install -Dhadoop.version=2.6.5-custom
+{% endhighlight %}
+
+After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
+
+
 {% top %}
diff --git a/docs/ops/deployment/hadoop.zh.md b/docs/ops/deployment/hadoop.zh.md
index 39c54cc..f9cf3e1 100644
--- a/docs/ops/deployment/hadoop.zh.md
+++ b/docs/ops/deployment/hadoop.zh.md
@@ -26,34 +26,13 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-## Referencing a Hadoop configuration
-
-You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.
-
-```sh
-HADOOP_CONF_DIR=/path/to/etc/hadoop
-```
-
-Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.
-
-Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details below.
 
 ## Providing Hadoop classes
 
 In order to use Hadoop features (e.g., YARN, HDFS) it is necessary to provide Flink with the required Hadoop classes,
 as these are not bundled by default.
 
-This can be done by 
-1) Adding the Hadoop classpath to Flink
-2) Putting the required jar files into /lib directory of the Flink distribution
-Option 1) requires very little work, integrates nicely with existing Hadoop setups and should be the
-preferred approach.
-However, Hadoop has a large dependency footprint that increases the risk for dependency conflicts to occur.
-If this happens, please refer to option 2).
-
-The following subsections explains these approaches in detail.
-
-### Adding Hadoop Classpaths
+This can be done by adding the Hadoop classpath to Flink through the `HADOOP_CLASSPATH` environment variable.
 
 Flink will use the environment variable `HADOOP_CLASSPATH` to augment the
 classpath that is used when starting Flink components such as the Client,
@@ -75,35 +54,24 @@ in the shell. Note that `hadoop` is the hadoop binary and that `classpath` is an
 
 Putting the Hadoop configuration in the same class path as the Hadoop libraries makes Flink pick up that configuration.
 
-### Adding Hadoop to /lib
-
-The Flink project releases Hadoop distributions for specific versions, that relocate or exclude several dependencies
-to reduce the risk of dependency clashes.
-These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
-For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
-the `/lib` directory of the Flink distribution.
-
-If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
-then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
-You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+## Referencing a Hadoop configuration
 
-<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
-vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
+You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.
 
-Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
+```sh
+HADOOP_CONF_DIR=/path/to/etc/hadoop
+```
 
-{% highlight bash %}
-mvn clean install -Dhadoop.version=2.6.5-custom
-{% endhighlight %}
+Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.
 
-After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
+Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details above.
 
 ## Running a job locally
 
 To run a job locally as one JVM process using the mini cluster, the required hadoop dependencies have to be explicitly
 added to the classpath of the started JVM process.
 
-To run an application using maven (also from IDE as a maven project), the required hadoop dependencies can be added
+To run an application using Maven (also from IDE as a Maven project), the required Hadoop dependencies can be added
 as provided to the pom.xml, e.g.:
 
 ```xml
@@ -115,9 +83,38 @@ as provided to the pom.xml, e.g.:
 </dependency>
 ```
 
-This way it should work both in local and cluster run where the provided dependencies are added elsewhere as described before.
+This way it should work both in local and cluster mode where the provided dependencies are added elsewhere as described before.
 
 To run or debug an application in IntelliJ Idea the provided dependencies can be included to the class path
 in the "Run|Edit Configurations" window.
 
+
+## Using `flink-shaded-hadoop-2-uber` jar for resolving dependency conflicts (legacy)
+
+<div class="alert alert-info" markdown="span">
+  <strong>Warning:</strong> Starting from Flink 1.11, using `flink-shaded-hadoop-2-uber` releases is not officially supported
+  by the Flink project anymore. Users are advised to provide Hadoop dependencies through `HADOOP_CLASSPATH` (see above).
+</div>
+
+The Flink project used to (until Flink 1.10) release Hadoop distributions for specific versions, that relocate or exclude several dependencies to reduce the risk of dependency clashes.
+These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
+the `/lib` directory of the Flink distribution.
+
+If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
+then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
+You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
+
+<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
+vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
+
+Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
+
+{% highlight bash %}
+mvn clean install -Dhadoop.version=2.6.5-custom
+{% endhighlight %}
+
+After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
+
+
 {% top %}
diff --git a/docs/ops/deployment/yarn_setup.md b/docs/ops/deployment/yarn_setup.md
index 2e04d45..4dcd290 100644
--- a/docs/ops/deployment/yarn_setup.md
+++ b/docs/ops/deployment/yarn_setup.md
@@ -33,11 +33,8 @@ under the License.
 Start a YARN session where the job manager gets 1 GB of heap space and the task managers 4 GB of heap space assigned:
 
 {% highlight bash %}
-# get the hadoop2 package from the Flink download page at
-# {{ site.download_url }}
-curl -O <flink_hadoop2_download_url>
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
-cd flink-{{ site.version }}/
+# If HADOOP_CLASSPATH is not set:
+#   export HADOOP_CLASSPATH=`hadoop classpath`
 ./bin/yarn-session.sh -jm 1024m -tm 4096m
 {% endhighlight %}
 
@@ -48,11 +45,8 @@ Once the session has been started, you can submit jobs to the cluster using the
 ### Run a Flink job on YARN
 
 {% highlight bash %}
-# get the hadoop2 package from the Flink download page at
-# {{ site.download_url }}
-curl -O <flink_hadoop2_download_url>
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
-cd flink-{{ site.version }}/
+# If HADOOP_CLASSPATH is not set:
+#   export HADOOP_CLASSPATH=`hadoop classpath`
 ./bin/flink run -m yarn-cluster -p 4 -yjm 1024m -ytm 4096m ./examples/batch/WordCount.jar
 {% endhighlight %}
 
@@ -62,11 +56,9 @@ Apache [Hadoop YARN](http://hadoop.apache.org/) is a cluster resource management
 
 **Requirements**
 
-- at least Apache Hadoop 2.2
+- at least Apache Hadoop 2.4.1
 - HDFS (Hadoop Distributed File System) (or another distributed file system supported by Hadoop)
 
-If you have troubles using the Flink YARN client, have a look in the [FAQ section](https://flink.apache.org/faq.html#yarn-deployment).
-
 ### Start Flink Session
 
 Follow these instructions to learn how to launch a Flink Session within your YARN cluster.
@@ -75,15 +67,16 @@ A session will start all required Flink services (JobManager and TaskManagers) s
 
 #### Download Flink
 
-Download a Flink package for Hadoop >= 2 from the [download page]({{ site.download_url }}). It contains the required files.
+Download a Flink package from the [download page]({{ site.download_url }}). It contains the required files.
 
 Extract the package using:
 
 {% highlight bash %}
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
+tar xvzf flink-{{ site.version }}-bin-scala*.tgz
 cd flink-{{site.version }}/
 {% endhighlight %}
 
+
 #### Start a Session
 
 Use the following command to start a session
diff --git a/docs/ops/deployment/yarn_setup.zh.md b/docs/ops/deployment/yarn_setup.zh.md
index 887bd71..7ff496b 100644
--- a/docs/ops/deployment/yarn_setup.zh.md
+++ b/docs/ops/deployment/yarn_setup.zh.md
@@ -33,11 +33,8 @@ under the License.
 Start a YARN session where the job manager gets 1 GB of heap space and the task managers 4 GB of heap space assigned:
 
 {% highlight bash %}
-# get the hadoop2 package from the Flink download page at
-# {{ site.download_url }}
-curl -O <flink_hadoop2_download_url>
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
-cd flink-{{ site.version }}/
+# If HADOOP_CLASSPATH is not set:
+#   export HADOOP_CLASSPATH=`hadoop classpath`
 ./bin/yarn-session.sh -jm 1024m -tm 4096m
 {% endhighlight %}
 
@@ -48,11 +45,8 @@ Once the session has been started, you can submit jobs to the cluster using the
 ### Run a Flink job on YARN
 
 {% highlight bash %}
-# get the hadoop2 package from the Flink download page at
-# {{ site.download_url }}
-curl -O <flink_hadoop2_download_url>
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
-cd flink-{{ site.version }}/
+# If HADOOP_CLASSPATH is not set:
+#   export HADOOP_CLASSPATH=`hadoop classpath`
 ./bin/flink run -m yarn-cluster -p 4 -yjm 1024m -ytm 4096m ./examples/batch/WordCount.jar
 {% endhighlight %}
 
@@ -62,11 +56,9 @@ Apache [Hadoop YARN](http://hadoop.apache.org/) is a cluster resource management
 
 **Requirements**
 
-- at least Apache Hadoop 2.2
+- at least Apache Hadoop 2.4.1
 - HDFS (Hadoop Distributed File System) (or another distributed file system supported by Hadoop)
 
-If you have troubles using the Flink YARN client, have a look in the [FAQ section](https://flink.apache.org/faq.html#yarn-deployment).
-
 ### Start Flink Session
 
 Follow these instructions to learn how to launch a Flink Session within your YARN cluster.
@@ -75,15 +67,16 @@ A session will start all required Flink services (JobManager and TaskManagers) s
 
 #### Download Flink
 
-Download a Flink package for Hadoop >= 2 from the [download page]({{ site.download_url }}). It contains the required files.
+Download a Flink package from the [download page]({{ site.download_url }}). It contains the required files.
 
 Extract the package using:
 
 {% highlight bash %}
-tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
+tar xvzf flink-{{ site.version }}-bin-scala*.tgz
 cd flink-{{site.version }}/
 {% endhighlight %}
 
+
 #### Start a Session
 
 Use the following command to start a session
@@ -125,7 +118,7 @@ If you don't want to change the configuration file to set configuration paramete
 
 The example invocation starts a single container for the ApplicationMaster which runs the Job Manager.
 
-The session cluster will automatically allocate additional containers which run the Task Managers when jobs are submitted to the cluster.
+The session cluster will automatically allocate additional containers which run the Task Managers when jobs are submitted to the cluster. 
 
 Once Flink is deployed in your YARN cluster, it will show you the connection details of the Job Manager.
 
@@ -338,4 +331,4 @@ The *JobManager* and AM are running in the same container. Once they successfull
 
 After that, the AM starts allocating the containers for Flink's TaskManagers, which will download the jar file and the modified configuration from the HDFS. Once these steps are completed, Flink is set up and ready to accept Jobs.
 
-{% top %}
+{% top %}
\ No newline at end of file