You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vb...@apache.org on 2019/05/06 21:54:15 UTC

[incubator-hudi] branch asf-site updated: Updating Apache Hudi Website with latest changes in docs

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 160b6a3  Updating Apache Hudi Website with latest changes in docs
160b6a3 is described below

commit 160b6a31515c6d9dcfff11dbbf5e540340c07141
Author: Balaji Varadarajan <va...@uber.com>
AuthorDate: Sun May 5 17:02:50 2019 -0700

    Updating Apache Hudi Website with latest changes in docs
---
 content/admin_guide.html | 41 ++++++++++++++------------------------
 content/docker_demo.html | 51 ++++++++++++++++++++++++++----------------------
 content/feed.xml         |  4 ++--
 content/powered_by.html  |  8 ++++++++
 docs/powered_by.md       |  4 ++++
 5 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/content/admin_guide.html b/content/admin_guide.html
index dde7918..da7d1be 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -372,8 +372,7 @@ hoodie-&gt;create --path /user/hive/warehouse/table1 --tableName hoodie_table_1
 
 <p>To see the description of hudi table, use the command:</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-hoodie:hoodie_table_1-&gt;desc
+<div class="highlighter-rouge"><pre class="highlight"><code>hoodie:hoodie_table_1-&gt;desc
 18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
     _________________________________________________________
     | Property                | Value                        |
@@ -384,7 +383,6 @@ hoodie:hoodie_table_1-&gt;desc
     | hoodie.table.name       | hoodie_table_1               |
     | hoodie.table.type       | COPY_ON_WRITE                |
     | hoodie.archivelog.folder|                              |
-
 </code></pre>
 </div>
 
@@ -450,7 +448,6 @@ Each commit has a monotonically increasing string/number called the <strong>comm
     ....
     ....
 hoodie:trips-&gt;
-
 </code></pre>
 </div>
 
@@ -551,7 +548,6 @@ pending compactions.</p>
     |==================================================================|
     | &lt;INSTANT_1&gt;            | REQUESTED| 35                           |
     | &lt;INSTANT_2&gt;            | INFLIGHT | 27                           |
-
 </code></pre>
 </div>
 
@@ -649,8 +645,6 @@ hoodie:stock_ticks_mor-&gt;compaction validate --instant 20181005222601
     | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error                                                                           |
     |=====================================================================================================================================================================================================================================================================================================|
     | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | false| All log files specified in compaction operation is not present. Missing ....    |
-
-
 </code></pre>
 </div>
 
@@ -664,40 +658,35 @@ so that are preserved. Hudi provides the following CLI to support it</p>
 
 <h5 id="unscheduling-compaction">UnScheduling Compaction</h5>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-hoodie:trips-&gt;compaction unscheduleFileId --fileId &lt;FileUUID&gt;
+<div class="highlighter-rouge"><pre class="highlight"><code>hoodie:trips-&gt;compaction unscheduleFileId --fileId &lt;FileUUID&gt;
 ....
 No File renames needed to unschedule file from pending compaction. Operation successful.
-
 </code></pre>
 </div>
 
-<p>In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI
-```</p>
+<p>In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI</p>
 
-<p>hoodie:trips-&gt;compaction unschedule –compactionInstant <compactionInstant>
+<div class="highlighter-rouge"><pre class="highlight"><code>hoodie:trips-&gt;compaction unschedule --compactionInstant &lt;compactionInstant&gt;
 .....
-No File renames needed to unschedule pending compaction. Operation successful.</compactionInstant></p>
+No File renames needed to unschedule pending compaction. Operation successful.
+</code></pre>
+</div>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-##### Repair Compaction
+<h5 id="repair-compaction">Repair Compaction</h5>
 
-The above compaction unscheduling operations could sometimes fail partially (e:g -&gt; DFS temporarily unavailable). With
+<p>The above compaction unscheduling operations could sometimes fail partially (e:g -&gt; DFS temporarily unavailable). With
 partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
-`compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
+<code class="highlighter-rouge">compaction validate</code>, you can notice invalid compaction operations if there is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
-consistent with the compaction plan
+consistent with the compaction plan</p>
 
+<div class="highlighter-rouge"><pre class="highlight"><code>hoodie:stock_ticks_mor-&gt;compaction repair --instant 20181005222611
+......
+Compaction successfully repaired
+.....
 </code></pre>
 </div>
 
-<p>hoodie:stock_ticks_mor-&gt;compaction repair –instant 20181005222611
-……
-Compaction successfully repaired
-…..</p>
-
-<p>```</p>
-
 <h2 id="metrics">Metrics</h2>
 
 <p>Once the Hudi Client is configured with the right datasetname and environment for metrics, it produces the following graphite metrics, that aid in debugging hudi datasets</p>
diff --git a/content/docker_demo.html b/content/docker_demo.html
index a59d879..e54125e 100644
--- a/content/docker_demo.html
+++ b/content/docker_demo.html
@@ -489,8 +489,11 @@ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 ....
 2018-09-24 22:20:00 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
 2018-09-24 22:20:00 INFO  SparkContext:54 - Successfully stopped SparkContext
+
+
+
 # Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties
+spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 ....
 2018-09-24 22:22:01 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
 2018-09-24 22:22:01 INFO  SparkContext:54 - Successfully stopped SparkContext
@@ -757,14 +760,16 @@ partitions, there is no need to run hive-sync</p>
 docker exec -it adhoc-2 /bin/bash
 
 # Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+
 
 # Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties
+spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 
 exit
 </code></pre>
 </div>
+
 <p>With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a new version of Parquet file getting created.
 See <code class="highlighter-rouge">http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31</code></p>
 
@@ -920,41 +925,41 @@ exit
 
 <p>With 2 batches of data ingested, lets showcase the support for incremental queries in Hudi Copy-On-Write datasets</p>
 
-<p>Lets take the same projection query example
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 –hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat –hiveconf hive.stats.autogather=false</p>
+<p>Lets take the same projection query example</p>
 
-<p>0: jdbc:hive2://hiveserver:10000&gt; select <code class="highlighter-rouge">_hoodie_commit_time</code>, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = ‘GOOG’;
-+———————-+———+———————-+———+————+———–+–+
+<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+
+0: jdbc:hive2://hiveserver:10000&gt; select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+———————-+———+———————-+———+————+———–+–+
++----------------------+---------+----------------------+---------+------------+-----------+--+
 | 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
 | 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+———————-+———+———————-+———+————+———–+–+</p>
++----------------------+---------+----------------------+---------+------------+-----------+--+
+</code></pre>
+</div>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-As you notice from the above queries, there are 2 commits - 20180924064621 and 20180924065039 in timeline order.
+<p>As you notice from the above queries, there are 2 commits - 20180924064621 and 20180924065039 in timeline order.
 When you follow the steps, you will be getting different timestamps for commits. Substitute them
-in place of the above timestamps.
+in place of the above timestamps.</p>
 
-To show the effects of incremental-query, let us assume that a reader has already seen the changes as part of
+<p>To show the effects of incremental-query, let us assume that a reader has already seen the changes as part of
 ingesting first batch. Now, for the reader to see effect of the second batch, he/she has to keep the start timestamp to
-the commit time of the first batch (20180924064621) and run incremental query
+the commit time of the first batch (20180924064621) and run incremental query</p>
 
-`Hudi incremental mode` provides efficient scanning for incremental queries by filtering out files that do not have any
-candidate rows using hudi-managed metadata.
+<p>Hudi incremental mode provides efficient scanning for incremental queries by filtering out files that do not have any
+candidate rows using hudi-managed metadata.</p>
 
-</code></pre>
-</div>
-<p>docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 –hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat –hiveconf hive.stats.autogather=false
+<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
 0: jdbc:hive2://hiveserver:10000&gt; set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
 No rows affected (0.009 seconds)
 0: jdbc:hive2://hiveserver:10000&gt;  set hoodie.stock_ticks_cow.consume.max.commits=3;
 No rows affected (0.009 seconds)
 0: jdbc:hive2://hiveserver:10000&gt; set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
-```</p>
+</code></pre>
+</div>
 
 <p>With the above setting, file-ids that do not have any updates from the commit 20180924065039 is filtered out without scanning.
 Here is the incremental query :</p>
diff --git a/content/feed.xml b/content/feed.xml
index 0c27805..af8b991 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml" rel="self" type="application/rss+xml"/>
-        <pubDate>Fri, 22 Mar 2019 19:49:42 +0000</pubDate>
-        <lastBuildDate>Fri, 22 Mar 2019 19:49:42 +0000</lastBuildDate>
+        <pubDate>Mon, 06 May 2019 21:51:11 +0000</pubDate>
+        <lastBuildDate>Mon, 06 May 2019 21:51:11 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
diff --git a/content/powered_by.html b/content/powered_by.html
index 2294ea1..870ca11 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -339,6 +339,14 @@
 It has been in production since Aug 2016, powering ~100 highly business critical tables on Hadoop, worth 100s of TBs(including top 10 including trips,riders,partners).
 It also powers several incremental Hive ETL pipelines and being currently integrated into Uber’s data dispersal system.</p>
 
+<h4 id="emis-health">EMIS Health</h4>
+
+<p>[EMIS Health][https://www.emishealth.com/] is the largest provider of Primary Care IT software in the UK with datasets including more than 500Bn healthcare records. HUDI is used to manage their analytics dataset in production and keeping them up-to-date with their upstream source. Presto is being used to query the data written in HUDI format.</p>
+
+<h4 id="yieldsio">Yields.io</h4>
+
+<p>Yields.io is the first FinTech platform that uses AI for automated model validation and real-time monitoring on an enterprise-wide scale. Their data lake is managed by Hudi. They are also actively building their infrastructure for incremental, cross language/platform machine learning using Hudi.</p>
+
 <h2 id="talks--presentations">Talks &amp; Presentations</h2>
 
 <ol>
diff --git a/docs/powered_by.md b/docs/powered_by.md
index e4058fd..36abcb0 100644
--- a/docs/powered_by.md
+++ b/docs/powered_by.md
@@ -18,6 +18,10 @@ It also powers several incremental Hive ETL pipelines and being currently integr
 
 [EMIS Health][https://www.emishealth.com/] is the largest provider of Primary Care IT software in the UK with datasets including more than 500Bn healthcare records. HUDI is used to manage their analytics dataset in production and keeping them up-to-date with their upstream source. Presto is being used to query the data written in HUDI format.
 
+#### Yields.io
+
+Yields.io is the first FinTech platform that uses AI for automated model validation and real-time monitoring on an enterprise-wide scale. Their data lake is managed by Hudi. They are also actively building their infrastructure for incremental, cross language/platform machine learning using Hudi.
+ 
 
 ## Talks & Presentations