You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by xu...@apache.org on 2022/04/29 09:42:35 UTC

[hudi] branch asf-site updated: [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim bundles (#5454)

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 85c72daf0b [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim bundles (#5454)
85c72daf0b is described below

commit 85c72daf0b3df88d3556f51c921bed0485495e05
Author: Y Ethan Guo <et...@gmail.com>
AuthorDate: Fri Apr 29 02:42:27 2022 -0700

    [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim bundles (#5454)
---
 website/docs/deployment.md           | 15 +++++--
 website/docs/docker_demo.md          | 12 ++----
 website/docs/hoodie_deltastreamer.md |  8 +++-
 website/docs/quick-start-guide.md    | 76 +++++++++++++++++++++---------------
 website/docs/syncing_metastore.md    |  1 -
 5 files changed, 67 insertions(+), 45 deletions(-)

diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 739480205d..a4a57fb6b0 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -25,14 +25,23 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d
 
 ### DeltaStreamer
 
-[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes.
+[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes 
+from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables.  It runs as a spark application in two modes.
+
+To use DeltaStreamer in Spark, the `hudi-utilities-bundle` is required, by adding
+`--packages org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0` to the `spark-submit` command. From 0.11.0 release, we start
+to provide a new `hudi-utilities-slim-bundle` which aims to exclude dependencies that can cause conflicts and compatibility
+issues with different versions of Spark.  The `hudi-utilities-slim-bundle` should be used along with a Hudi Spark bundle 
+corresponding to the Spark version used, e.g., 
+`--packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely in Spark encounters compatibility issues.
 
  - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for eve [...]
 
 Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
+[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
@@ -80,7 +89,7 @@ Here is an example invocation for reading from kafka topic in a single-run mode
 Here is an example invocation for reading from kafka topic in a continuous mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
+[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
index eeccf117aa..26d41251bc 100644
--- a/website/docs/docker_demo.md
+++ b/website/docs/docker_demo.md
@@ -391,8 +391,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --deploy-mode client \
   --driver-memory 1G \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 ...
 
 Welcome to
@@ -793,8 +792,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 # Copy On Write Table:
 
@@ -1050,8 +1048,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 Welcome to
       ____              __
@@ -1247,8 +1244,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 # Read Optimized Query
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md
index ae87c579cd..6f2c80d5cf 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -5,7 +5,7 @@ keywords: [hudi, deltastreamer, hoodiedeltastreamer]
 
 ## DeltaStreamer
 
-The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the way to ingest from different sources such as DFS or Kafka, with the following capabilities.
+The `HoodieDeltaStreamer` utility (part of `hudi-utilities-bundle`) provides the way to ingest from different sources such as DFS or Kafka, with the following capabilities.
 
 - Exactly once ingestion of new events from Kafka, [incremental imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports) from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
 - Support json, avro or a custom record types for the incoming data
@@ -151,6 +151,12 @@ and then ingest it as follows.
 
 In some cases, you may want to migrate your existing table into Hudi beforehand. Please refer to [migration guide](/docs/migration_guide).
 
+From 0.11.0 release, we start to provide a new `hudi-utilities-slim-bundle` which aims to exclude dependencies that can
+cause conflicts and compatibility issues with different versions of Spark.  The `hudi-utilities-slim-bundle` should be
+used along with a Hudi Spark bundle corresponding the Spark version used to make utilities work with Spark, e.g.,
+`--packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely to run `HoodieDeltaStreamer` in Spark encounters compatibility issues.
+
 ### MultiTableDeltaStreamer
 
 `HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`, enables one to ingest multiple tables at a single go into hudi datasets. Currently it only supports sequential processing of tables to be ingested and COPY_ON_WRITE storage type. The command line options for `HoodieMultiTableDeltaStreamer` are pretty much similar to `HoodieDeltaStreamer` with the only exception that you are required to provide table wise configs in separate files in a dedicated config folder. The [...]
diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md
index 0841351f63..8f77a34dd7 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -20,7 +20,7 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [
 
 | Hudi            | Supported Spark 3 version     |
 |:----------------|:------------------------------|
-| 0.11.0          | 3.2.x (default build), 3.1.x  |
+| 0.11.0          | 3.2.x (default build, Spark bundle only), 3.1.x  |
 | 0.10.0          | 3.1.x (default build), 3.0.x  |
 | 0.7.0 - 0.9.0   | 3.0.x                         |
 | 0.6.0 and prior | not supported                 |
@@ -29,6 +29,16 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions [
 
 As of 0.9.0 release, Spark SQL DML support has been added and is experimental.
 
+In 0.11.0 release, we add support for Spark 3.2.x and continue the support for Spark 3.1.x and Spark 2.4.x.  We officially
+do not provide the support for Spark 3.0.x any more.  To make it easier for the users to pick the right Hudi Spark bundle
+in their deployment, we make the following adjustment to the naming of the bundles:
+
+- For each supported Spark minor version, there is a corresponding Hudi Spark bundle with the major and minor version 
+in the naming, i.e., `hudi-spark3.2-bundle`, `hudi-spark3.1-bundle`, and `hudi-spark2.4-bundle`.
+- We encourage users to migrate to using the new bundles above.  We keep the bundles with the legacy naming in this
+release, i.e., `hudi-spark3-bundle` targeting at Spark 3.2.x, the latest Spark 3 version, and `hudi-spark-bundle` for
+Spark 2.4.x.
+
 <Tabs
 defaultValue="scala"
 values={[
@@ -41,24 +51,25 @@ values={[
 From the extracted directory run spark-shell with Hudi as:
 
 ```scala
-// spark-shell for spark 3.1
+// spark-shell for spark 3.2
 spark-shell \
-  --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 \
-  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+  --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
+  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 
-// spark-shell for spark 3.2
+// spark-shell for spark 3.1
 spark-shell \
-  --packages org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3 \
+  --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
-// spark-shell for spark 2 with scala 2.12
+// spark-shell for spark 2.4 with scala 2.12
 spark-shell \
-  --packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
-// spark-shell for spark 2 with scala 2.11
+// spark-shell for spark 2.4 with scala 2.11
 spark-shell \
-  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
@@ -69,24 +80,25 @@ Hudi support using Spark SQL to write and read data with the **HoodieSparkSessio
 From the extracted directory run Spark SQL with Hudi as:
 
 ```shell
-# Spark SQL for spark 3.1
-spark-sql --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 \
+# Spark SQL for spark 3.2
+spark-sql --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
 --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 3.0
-spark-sql --packages org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3 \
+# Spark SQL for spark 3.1
+spark-sql --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 2 with scala 2.11
-spark-sql --packages org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4 \
+# Spark SQL for spark 2.4 with scala 2.11
+spark-sql --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 2 with scala 2.12
+# Spark SQL for spark 2.4 with scala 2.12
 spark-sql \
-  --packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 ```
@@ -100,24 +112,25 @@ From the extracted directory run pyspark with Hudi as:
 # pyspark
 export PYSPARK_PYTHON=$(which python3)
 
-# for spark3.1
+# for spark3.2
 pyspark
---packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
+--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 
-# for spark3.0
+# for spark3.1
 pyspark
---packages org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
+--packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
-# for spark2 with scala 2.12
+# for spark2.4 with scala 2.12
 pyspark
---packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
-# for spark2 with scala 2.11
+# for spark2.4 with scala 2.11
 pyspark
---packages org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
@@ -126,10 +139,9 @@ pyspark
 
 :::note Please note the following
 <ul>
-  <li>spark-avro module needs to be specified in --packages as it is not included with spark-shell by default</li>
-  <li>spark-avro and spark versions must match (we have used 3.1.2 for both above)</li>
-  <li>we have used hudi-spark-bundle built for scala 2.12 since the spark-avro module used also depends on 2.12. 
-         If spark-avro_2.11 is used, correspondingly hudi-spark-bundle_2.11 needs to be used. </li>
+  <li> For Spark 3.2, the additional spark_catalog config is required: 
+--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' </li>
+  <li> We have used hudi-spark-bundle built for scala 2.12 since the spark-avro module used can also depend on 2.12. </li>
 </ul>
 :::
 
@@ -1175,8 +1187,8 @@ more details please refer to [procedures](procedures).
 ## Where to go from here?
 
 You can also do the quickstart by [building hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source), 
-and using `--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.1?-*.*.*-SNAPSHOT.jar` in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`. Hudi also supports scala 2.12. Refer [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
+and using `--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.1?-*.*.*-SNAPSHOT.jar` in the spark-shell command above
+instead of `--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0`. Hudi also supports scala 2.12. Refer [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
 for more info.
 
 Also, we used Spark here to show case the capabilities of Hudi. However, Hudi can support multiple table types/query types and 
diff --git a/website/docs/syncing_metastore.md b/website/docs/syncing_metastore.md
index f1c1fdc582..1b2baa0f24 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -181,7 +181,6 @@ Assuming the metastore is configured properly, then start the spark-shell.
 
 ```
 $SPARK_INSTALL/bin/spark-shell   --jars $HUDI_SPARK_BUNDLE \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4 \ 
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```