You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/10/10 12:12:50 UTC
[incubator-hudi] branch asf-site updated: [HUDI-291] Simplify quickstart documentation (#937)

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 9eddc40  [HUDI-291] Simplify quickstart documentation (#937)
9eddc40 is described below

commit 9eddc40d469e560158fd1cea617cad0a61a5e748
Author: Bhavani Sudha Saktheeswaran <bh...@uber.com>
AuthorDate: Thu Oct 10 05:12:45 2019 -0700

    [HUDI-291] Simplify quickstart documentation (#937)
    
    - Uses spark-shell based examples to showcase Hudi core features
    - Info related to hive sync, hive, presto, etc are removed
---
 docs/quickstart.md | 308 ++++++++++++++++++++++++-----------------------------
 1 file changed, 140 insertions(+), 168 deletions(-)

diff --git a/docs/quickstart.md b/docs/quickstart.md
index 5007442..0f331fc 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -7,192 +7,164 @@ toc: false
 permalink: quickstart.html
 ---
 <br/>
-To get a quick peek at Hudi's capabilities, we have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) 
-that showcases this on a docker based setup with all dependent systems running locally. We recommend you replicate the same setup 
-and run the demo yourself, by following steps [here](docker_demo.html). Also, if you are looking for ways to migrate your existing data to Hudi, 
-refer to [migration guide](migration_guide.html).
 
-If you have Hive, Hadoop, Spark installed already & prefer to do it on your own setup, read on.
+This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi dataset of default storage type: 
+[Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
+After each write operation we will also show how to read the data both snapshot and incrementally.
 
-## Download Hudi
+## Build Hudi spark bundle jar
 
-Check out [code](https://github.com/apache/incubator-hudi) and normally build the maven project, from command line
+Hudi requires Java 8 to be installed on a *nix system. Check out [code](https://github.com/apache/incubator-hudi) and 
+normally build the maven project, from command line:
 
-```
-$ mvn clean install -DskipTests -DskipITs
-```
-
-Hudi works with Hive 2.3.x or higher versions. As long as Hive 2.x protocol can talk to Hive 1.x, you can use Hudi to 
-talk to older hive versions.
-
-For IDE, you can pull in the code into IntelliJ as a normal maven project. 
-You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run from IDE.
-
-
-### Version Compatibility
-
-Hudi requires Java 8 to be installed on a *nix system. Hudi works with Spark-2.x versions. 
-Further, we have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
-
-| Hadoop | Hive  | Spark | Instructions to Build Hudi |
-| ---- | ----- | ---- | ---- |
-| Apache hadoop-2.[7-8].x | Apache hive-2.3.[1-3] | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
-
-If your environment has other versions of hadoop/hive/spark, please try out Hudi 
-and let us know if there are any issues. 
-
-## Generate Sample Dataset
-
-### Environment Variables
-
-Please set the following environment variables according to your setup. We have given an example setup with CDH version
-
-```
-cd incubator-hudi 
-export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
-export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin
-export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_INSTALL=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
-export SPARK_HOME=/var/hadoop/setup/spark-2.3.1-bin-hadoop2.7
-export SPARK_INSTALL=$SPARK_HOME
-export SPARK_CONF_DIR=$SPARK_HOME/conf
-export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
-```
-
-### Run HoodieJavaApp
-
-Run __hudi-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
-to run from command-line
-
-```
-cd hudi-spark
-./run_hoodie_app.sh --help
-Usage: <main class> [options]
-  Options:
-    --help, -h
-       Default: false
-    --table-name, -n
-       table name for Hudi sample table
-       Default: hoodie_rt
-    --table-path, -p
-       path for Hudi sample table
-       Default: file:///tmp/hoodie/sample-table
-    --table-type, -t
-       One of COPY_ON_WRITE or MERGE_ON_READ
-       Default: COPY_ON_WRITE
-```
-
-The class lets you choose table names, output paths and one of the storage types. In your own applications, be sure to include the `hudi-spark` module as dependency
-and follow a similar pattern to write/read datasets via the datasource. 
-
-## Query a Hudi dataset
-
-Next, we will register the sample dataset into Hive metastore and try to query using [Hive](#hive), [Spark](#spark) & [Presto](#presto)
-
-### Start Hive Server locally
+``` 
+# checkout and build
+git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
+mvn clean install -DskipTests -DskipITs
 
+# Export the location of hudi-spark-bundle for later 
+mkdir -p /tmp/hudi && cp packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  /tmp/hudi/hudi-spark-bundle.jar 
+export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
 ```
-hdfs namenode # start name node
-hdfs datanode # start data node
-
-bin/hive --service metastore  # start metastore
-bin/hiveserver2 \
-  --hiveconf hive.root.logger=INFO,console \
-  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
-  --hiveconf hive.stats.autogather=false \
-  --hiveconf hive.aux.jars.path=/path/to/packaging/hudi-hive-bundle/target/hudi-hive-bundle-0.4.6-SNAPSHOT.jar
 
-```
+## Setup spark-shell
+Hudi works with Spark-2.x versions. You can follow instructions [here](https://spark.apache.org/downloads.html) for 
+setting up spark. 
 
-### Run Hive Sync Tool
-Hive Sync Tool will update/create the necessary metadata(schema and partitions) in hive metastore. This allows for schema evolution and incremental addition of new partitions written to.
-It uses an incremental approach by storing the last commit time synced in the TBLPROPERTIES and only syncing the commits from the last sync commit time stored.
-Both [Spark Datasource](writing_data.html#datasource-writer) & [DeltaStreamer](writing_data.html#deltastreamer) have capability to do this, after each write.
+From the extracted directory run spark-shell with Hudi as:
 
 ```
-cd hudi-hive
-./run_sync_tool.sh
-  --user hive
-  --pass hive
-  --database default
-  --jdbc-url "jdbc:hive2://localhost:10010/"
-  --base-path tmp/hoodie/sample-table/
-  --table hoodie_test
-  --partitioned-by field1,field2
-
+bin/spark-shell --jars $HUDI_SPARK_BUNDLE_PATH --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
-For some reason, if you want to do this by hand. Please 
-follow [this](https://cwiki.apache.org/confluence/display/HUDI/Registering+sample+dataset+to+Hive+via+beeline).
-
 
-### HiveQL {#hive}
+Setup table name, base path and a data generator to generate records for this guide.
 
-Let's first perform a query on the latest committed snapshot of the table
-
-```
-hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
-hive> set hive.stats.autogather=false;
-hive> add jar file:///path/to/hudi-hive-bundle-0.4.6-SNAPSHOT.jar;
-hive> select count(*) from hoodie_test;
-...
-OK
-100
-Time taken: 18.05 seconds, Fetched: 1 row(s)
-hive>
 ```
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
 
-### SparkSQL {#spark}
-
-Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
-
-```
-$ cd $SPARK_INSTALL
-$ spark-shell --jars $HUDI_SRC/packaging/hudi-spark-bundle/target/hudi-spark-bundle-0.4.6-SNAPSHOT.jar --driver-class-path $HADOOP_CONF_DIR  --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
-
-scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-scala> sqlContext.sql("show tables").show(10000)
-scala> sqlContext.sql("describe hoodie_test").show(10000)
-scala> sqlContext.sql("describe hoodie_test_rt").show(10000)
-scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
+val tableName = "hudi_cow_table"
+val basePath = "file:///tmp/hudi_cow_table"
+val dataGen = new DataGenerator
 ```
 
-### Presto {#presto}
-
-Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere.
-
-* Copy the hudi/packaging/hudi-presto-bundle/target/hudi-presto-bundle-*.jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
-* Startup your server and you should be able to query the same Hive table via Presto
+The [DataGenerator](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java) 
+can generate sample inserts and updates based on the the sample trip schema 
+[here](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 
-```
-show columns from hive.default.hoodie_test;
-select count(*) from hive.default.hoodie_test
-```
-
-### Incremental HiveQL
 
-Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past.
+## Insert data {#inserts}
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi dataset as below.
 
 ```
-hive> set hoodie.hoodie_test.consume.mode=INCREMENTAL;
-hive> set hoodie.hoodie_test.consume.start.timestamp=001;
-hive> set hoodie.hoodie_test.consume.max.commits=10;
-hive> select `_hoodie_commit_time`, rider, driver from hoodie_test where `_hoodie_commit_time` > '001' limit 10;
-OK
-All commits :[001, 002]
-002	rider-001	driver-001
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-001	driver-001
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-002	driver-002
-002	rider-001	driver-001
-Time taken: 0.056 seconds, Fetched: 10 row(s)
-hive>
-hive>
-```
-
-This is only supported for Read-optimized view for now."
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+``` 
+
+`mode(Overwrite)` overwrites and recreates the dataset if it already exists.
+You can check the data generated under `/tmp/hudi_cow_table/<region>/<country>/<city>/`. We provided a record key 
+(`uuid` in [schema](#sample-schema)), partition field (`region/county/city`) and combine logic (`ts` in 
+[schema](#sample-schema)) to ensure trip records are unique within each partition. For more info, refer to 
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-HowdoImodelthedatastoredinHudi?)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Datasets](https://hudi.apache.org/writing_data.html).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue 
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to 
+[Write operations](https://hudi.apache.org/writing_data.html#write-operations)
+ 
+## Query data {#query}
+Load the data files into a DataFrame.
+```
+val roViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where fare > 20.0").show()
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
+```
+This query provides a read optimized view of the ingested data. Since our partition path (`region/country/city`) is 3 levels nested 
+from base path we ve used `load(basePath + "/*/*/*/*")`. 
+Refer to [Storage Types and Views](https://hudi.apache.org/concepts.html#storage-types--views) for more info on all storage types and views supported.
+
+## Update data {#updates}
+This is similar to inserting new data. Generate updates to existing trips using the data generator, load into a DataFrame 
+and write DataFrame into the hudi dataset.
+
+```
+val updates = convertToStringList(dataGen.generateUpdates(10))
+val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Append).
+    save(basePath);
+```
+
+Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the dataset for the first time.
+[Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
+
+## Incremental query
+
+Hudi also provides capability to obtain a stream of records that changed since given commit timestamp. 
+This can be achieved using Hudi's incremental view and providing a begin time from which changes need to be streamed. 
+We do not need to specify endTime, if we want all changes after the given commit (as is the common case). 
+
+```
+val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
+val beginTime = commits(commits.length - 2) // commit time we are interested in
+
+// incrementally query data
+val incViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+    option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+    load(basePath);
+incViewDF.registerTempTable("hudi_incr_table")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+``` 
+This will give all changes that happened after the beginTime commit with the filter of fare > 20.0. The unique thing about this
+feature is that it now lets you author streaming pipelines on batch data.
+
+## Point in time query
+Lets look at how to query data as of a specific time. The specific time can be represented by pointing endTime to a 
+specific commit time and beginTime to "000" (denoting earliest possible commit time). 
+
+```
+val beginTime = "000" // Represents all commits > this time.
+val endTime = commits(commits.length - 2) // commit time we are interested in
+
+//incrementally query data
+val incViewDF = spark.read.format("org.apache.hudi").
+    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+    option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+    option(END_INSTANTTIME_OPT_KEY, endTime).
+    load(basePath);
+incViewDF.registerTempTable("hudi_incr_table")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+``` 
+
+## Where to go from here?
+Here, we used Spark to show case the capabilities of Hudi. However, Hudi can support multiple storage types/views and 
+Hudi datasets can be queried from query engines like Hive, Spark, Presto and much more. We have put together a 
+[demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases all of this on a docker based setup with all 
+dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following 
+steps [here](docker_demo.html) to get a taste for it. Also, if you are looking for ways to migrate your existing data 
+to Hudi, refer to [migration guide](migration_guide.html).