You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/03/13 22:41:20 UTC

[incubator-hudi-site] 17/19: Refreshing site content

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git

commit 97b3106520c489612dd2187eb9ce4796d5f5c49f
Author: Vinoth Chandar <vi...@uber.com>
AuthorDate: Sat Mar 9 13:18:07 2019 -0800

    Refreshing site content
---
 content/.gitignore                                 |   1 -
 content/404.html                                   |   6 +-
 content/admin_guide.html                           |  50 ++-
 content/community.html                             |  11 +-
 content/comparison.html                            |  15 +-
 content/concepts.html                              |   8 +-
 content/configurations.html                        | 489 ++++++++++++++-------
 content/contributing.html                          |   8 +-
 content/css/customstyles.css                       |   4 +-
 content/css/theme-blue.css                         |   2 +-
 content/feed.xml                                   |   6 +-
 content/gcs_hoodie.html                            |  16 +-
 ...ommit_duration.png => hudi_commit_duration.png} | Bin
 .../{hoodie_intro_1.png => hudi_intro_1.png}       | Bin
 ...ie_log_format_v2.png => hudi_log_format_v2.png} | Bin
 ...uery_perf_hive.png => hudi_query_perf_hive.png} | Bin
 ..._perf_presto.png => hudi_query_perf_presto.png} | Bin
 ...ry_perf_spark.png => hudi_query_perf_spark.png} | Bin
 .../{hoodie_upsert_dag.png => hudi_upsert_dag.png} | Bin
 ...odie_upsert_perf1.png => hudi_upsert_perf1.png} | Bin
 ...odie_upsert_perf2.png => hudi_upsert_perf2.png} | Bin
 content/implementation.html                        |  22 +-
 content/incremental_processing.html                |  36 +-
 content/index.html                                 |  10 +-
 content/js/mydoc_scroll.html                       |   6 +-
 content/migration_guide.html                       |  15 +-
 content/news.html                                  |   8 +-
 content/news_archive.html                          |   6 +-
 content/powered_by.html                            |   7 +-
 content/privacy.html                               |   6 +-
 content/quickstart.html                            |  26 +-
 content/s3_hoodie.html                             |  19 +-
 content/search.json                                |  40 +-
 content/sql_queries.html                           |   8 +-
 content/strata-talk.html                           |   6 +-
 content/use_cases.html                             |  19 +-
 36 files changed, 558 insertions(+), 292 deletions(-)

diff --git a/content/.gitignore b/content/.gitignore
deleted file mode 100644
index e43b0f9..0000000
--- a/content/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-.DS_Store
diff --git a/content/404.html b/content/404.html
index 9491810..fedef9b 100644
--- a/content/404.html
+++ b/content/404.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" ">
+<meta name="keywords" content="">
 <title>Page Not Found | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/admin_guide.html b/content/admin_guide.html
index 470a219..3625cee 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="This section offers an overview of tools available to operate an ecosystem of Hudi datasets">
-<meta name="keywords" content=" admin">
+<meta name="keywords" content="hudi, administration, operation, devops">
 <title>Admin Guide | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -355,11 +359,11 @@
 
 <h2 id="admin-cli">Admin CLI</h2>
 
-<p>Once hoodie has been built via <code class="highlighter-rouge">mvn clean install -DskipTests</code>, the shell can be fired by via  <code class="highlighter-rouge">cd hoodie-cli &amp;&amp; ./hoodie-cli.sh</code>.
-A hoodie dataset resides on HDFS, in a location referred to as the <strong>basePath</strong> and we would need this location in order to connect to a Hoodie dataset.
-Hoodie library effectively manages this HDFS dataset internally, using .hoodie subfolder to track all metadata</p>
+<p>Once hudi has been built, the shell can be fired by via  <code class="highlighter-rouge">cd hoodie-cli &amp;&amp; ./hoodie-cli.sh</code>.
+A hudi dataset resides on DFS, in a location referred to as the <strong>basePath</strong> and we would need this location in order to connect to a Hudi dataset.
+Hudi library effectively manages this dataset internally, using .hoodie subfolder to track all metadata</p>
 
-<p>To initialize a hoodie table, use the following command.</p>
+<p>To initialize a hudi table, use the following command.</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>18/09/06 15:56:52 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
 ============================================
@@ -380,7 +384,7 @@ hoodie-&gt;create --path /user/hive/warehouse/table1 --tableName hoodie_table_1
 </code></pre>
 </div>
 
-<p>To see the description of hoodie table, use the command:</p>
+<p>To see the description of hudi table, use the command:</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>
 hoodie:hoodie_table_1-&gt;desc
@@ -398,7 +402,7 @@ hoodie:hoodie_table_1-&gt;desc
 </code></pre>
 </div>
 
-<p>Following is a sample command to connect to a Hoodie dataset contains uber trips.</p>
+<p>Following is a sample command to connect to a Hudi dataset contains uber trips.</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>hoodie:trips-&gt;connect --path /app/uber/trips
 
@@ -447,7 +451,7 @@ hoodie:trips-&gt;
 
 <h4 id="inspecting-commits">Inspecting Commits</h4>
 
-<p>The task of upserting or inserting a batch of incoming records is known as a <strong>commit</strong> in Hoodie. A commit provides basic atomicity guarantees such that only commited data is available for querying.
+<p>The task of upserting or inserting a batch of incoming records is known as a <strong>commit</strong> in Hudi. A commit provides basic atomicity guarantees such that only commited data is available for querying.
 Each commit has a monotonically increasing string/number called the <strong>commit number</strong>. Typically, this is the time at which we started the commit.</p>
 
 <p>To view some basic information about the last 10 commits,</p>
@@ -464,7 +468,7 @@ hoodie:trips-&gt;
 </code></pre>
 </div>
 
-<p>At the start of each write, Hoodie also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight</p>
+<p>At the start of each write, Hudi also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>$ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
 -rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 /app/uber/trips/.hoodie/20161005225920.inflight
@@ -522,7 +526,7 @@ order (See Concepts). The below commands allow users to view the file-slices for
 
 <h4 id="statistics">Statistics</h4>
 
-<p>Since Hoodie directly manages file sizes for HDFS dataset, it might be good to get an overall picture</p>
+<p>Since Hudi directly manages file sizes for DFS dataset, it might be good to get an overall picture</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>hoodie:trips-&gt;stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc true --limit 10
     ________________________________________________________________________________________________
@@ -534,7 +538,7 @@ order (See Concepts). The below commands allow users to view the file-slices for
 </code></pre>
 </div>
 
-<p>In case of Hoodie write taking much longer, it might be good to see the write amplification for any sudden increases</p>
+<p>In case of Hudi write taking much longer, it might be good to see the write amplification for any sudden increases</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>hoodie:trips-&gt;stats wa
     __________________________________________________________________________
@@ -547,7 +551,7 @@ order (See Concepts). The below commands allow users to view the file-slices for
 
 <h4 id="archived-commits">Archived Commits</h4>
 
-<p>In order to limit the amount of growth of .commit files on HDFS, Hoodie archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
+<p>In order to limit the amount of growth of .commit files on DFS, Hudi archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
 This is a sequence file that contains a mapping from commitNumber =&gt; json with raw information about the commit (same that is nicely rolled up above).</p>
 
 <h4 id="compactions">Compactions</h4>
@@ -692,7 +696,7 @@ No File renames needed to unschedule pending compaction. Operation successful.</
 <div class="highlighter-rouge"><pre class="highlight"><code>
 ##### Repair Compaction
 
-The above compaction unscheduling operations could sometimes fail partially (e:g -&gt; HDFS temporarily unavailable). With
+The above compaction unscheduling operations could sometimes fail partially (e:g -&gt; DFS temporarily unavailable). With
 partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
 `compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
@@ -710,7 +714,7 @@ Compaction successfully repaired
 
 <h2 id="metrics">Metrics</h2>
 
-<p>Once the Hoodie Client is configured with the right datasetname and environment for metrics, it produces the following graphite metrics, that aid in debugging hoodie datasets</p>
+<p>Once the Hudi Client is configured with the right datasetname and environment for metrics, it produces the following graphite metrics, that aid in debugging hudi datasets</p>
 
 <ul>
   <li><strong>Commit Duration</strong> - This is amount of time it took to successfully commit a batch of records</li>
@@ -722,29 +726,29 @@ Compaction successfully repaired
 
 <p>These metrics can then be plotted on a standard tool like grafana. Below is a sample commit duration chart.</p>
 
-<figure><img class="docimage" src="images/hoodie_commit_duration.png" alt="hoodie_commit_duration.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_commit_duration.png" alt="hudi_commit_duration.png" style="max-width: 1000px" /></figure>
 
 <h2 id="troubleshooting-failures">Troubleshooting Failures</h2>
 
-<p>Section below generally aids in debugging Hoodie failures. Off the bat, the following metadata is added to every record to help triage  issues easily using standard Hadoop SQL engines (Hive/Presto/Spark)</p>
+<p>Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage  issues easily using standard Hadoop SQL engines (Hive/Presto/Spark)</p>
 
 <ul>
-  <li><strong>_hoodie_record_key</strong> - Treated as a primary key within each HDFS partition, basis of all updates/inserts</li>
+  <li><strong>_hoodie_record_key</strong> - Treated as a primary key within each DFS partition, basis of all updates/inserts</li>
   <li><strong>_hoodie_commit_time</strong> - Last commit that touched this record</li>
   <li><strong>_hoodie_file_name</strong> - Actual file name containing the record (super useful to triage duplicates)</li>
   <li><strong>_hoodie_partition_path</strong> - Path from basePath that identifies the partition containing this record</li>
 </ul>
 
-<div class="bs-callout bs-callout-warning">Note that as of now, Hoodie assumes the application passes in the same deterministic partitionpath for a given recordKey. i.e the uniqueness of record key is only enforced within each partition</div>
+<div class="bs-callout bs-callout-warning">Note that as of now, Hudi assumes the application passes in the same deterministic partitionpath for a given recordKey. i.e the uniqueness of record key is only enforced within each partition</div>
 
 <h4 id="missing-records">Missing records</h4>
 
 <p>Please check if there were any write errors using the admin commands above, during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hoodie, but handed back to the application to decide what to do with it.</p>
+If you do find errors, then the record was not actually written by Hudi, but handed back to the application to decide what to do with it.</p>
 
 <h4 id="duplicates">Duplicates</h4>
 
-<p>First of all, please confirm if you do indeed have duplicates <strong>AFTER</strong> ensuring the query is accessing the Hoodie datasets <a href="sql_queries.html">properly</a> .</p>
+<p>First of all, please confirm if you do indeed have duplicates <strong>AFTER</strong> ensuring the query is accessing the Hudi datasets <a href="sql_queries.html">properly</a> .</p>
 
 <ul>
   <li>If confirmed, please use the metadata fields above, to identify the physical files &amp; partition files containing the records .</li>
@@ -754,10 +758,10 @@ If you do find errors, then the record was not actually written by Hoodie, but h
 
 <h4 id="spark-failures">Spark failures</h4>
 
-<p>Typical upsert() DAG looks like below. Note that Hoodie client also caches intermediate RDDs to intelligently profile workload and size files and spark parallelism.
+<p>Typical upsert() DAG looks like below. Note that Hudi client also caches intermediate RDDs to intelligently profile workload and size files and spark parallelism.
 Also Spark UI shows sortByKey twice due to the probe job also being shown, nonetheless its just a single sort.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_dag.png" alt="hoodie_upsert_dag.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_dag.png" alt="hudi_upsert_dag.png" style="max-width: 1000px" /></figure>
 
 <p>At a high level, there are two steps</p>
 
@@ -777,7 +781,7 @@ Also Spark UI shows sortByKey twice due to the probe job also being shown, nonet
   <li>Job 7 : Actual writing of data (update + insert + insert turned to updates to maintain file size)</li>
 </ul>
 
-<p>Depending on the exception source (Hoodie/Spark), the above knowledge of the DAG can be used to pinpoint the actual issue. The most often encountered failures result from YARN/HDFS temporary failures.
+<p>Depending on the exception source (Hudi/Spark), the above knowledge of the DAG can be used to pinpoint the actual issue. The most often encountered failures result from YARN/DFS temporary failures.
 In the future, a more sophisticated debug/management UI would be added to the project, that can help automate some of this debugging.</p>
 
 
diff --git a/content/community.html b/content/community.html
index 39488eb..34196f3 100644
--- a/content/community.html
+++ b/content/community.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, use cases, big data, apache">
 <title>Community | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -355,7 +359,7 @@
   <tbody>
     <tr>
       <td>For any general questions, user support, development discussions</td>
-      <td>Dev Mailing list (<a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">Subscribe</a>, <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#117;&#110;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;& [...]
+      <td>Dev Mailing list (<a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">Subscribe</a>, <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#117;&#110;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;& [...]
     </tr>
     <tr>
       <td>For reporting bugs or issues or discover known issues</td>
@@ -389,9 +393,10 @@ Apache Hudi follows the typical Apache vulnerability handling <a href="https://a
   <li>Ask (and/or) answer questions on our support channels listed above.</li>
   <li>Review code or HIPs</li>
   <li>Help improve documentation</li>
+  <li>Author blogs on our wiki</li>
   <li>Testing; Improving out-of-box experience by reporting bugs</li>
   <li>Share new ideas/directions to pursue or propose a new HIP</li>
-  <li>Contributing code to the project</li>
+  <li>Contributing code to the project (<a href="https://issues.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+component+%3D+newbie">newbie JIRAs</a>)</li>
 </ul>
 
 <h4 id="code-contributions">Code Contributions</h4>
diff --git a/content/comparison.html b/content/comparison.html
index 34082e0..59bcf75 100644
--- a/content/comparison.html
+++ b/content/comparison.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="apache, hudi, kafka, kudu, hive, hbase, stream processing">
 <title>Comparison | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -341,7 +345,7 @@
 
     
 
-  <p>Apache Hudi fills a big void for processing data on top of HDFS, and thus mostly co-exists nicely with these technologies. However,
+  <p>Apache Hudi fills a big void for processing data on top of DFS, and thus mostly co-exists nicely with these technologies. However,
 it would be useful to understand how Hudi fits into the current big data ecosystem, contrasting it with a few related systems
 and bring out the different tradeoffs these systems have accepted in their design.</p>
 
@@ -380,16 +384,15 @@ just for analytics. Finally, HBase does not support incremental processing primi
 <p>A popular question, we get is : “How does Hudi relate to stream processing systems?”, which we will try to answer here. Simply put, Hudi can integrate with
 batch (<code class="highlighter-rouge">copy-on-write storage</code>) and streaming (<code class="highlighter-rouge">merge-on-read storage</code>) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
 integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
-and later sent into a Hudi table via a Kafka topic/HDFS intermediate file. In more conceptual level, data processing
+and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing
 pipelines just consist of three components : <code class="highlighter-rouge">source</code>, <code class="highlighter-rouge">processing</code>, <code class="highlighter-rouge">sink</code>, with users ultimately running queries against the sink to use the results of the pipeline.
-Hudi can act as either a source or sink, that stores data on HDFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability
+Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability
 of Presto/SparkSQL/Hive for your queries.</p>
 
 <p>More advanced use cases revolve around the concepts of <a href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">incremental processing</a>, which effectively
 uses Hudi even inside the <code class="highlighter-rouge">processing</code> engine to speed up typical batch pipelines. For e.g: Hudi can be used as a state store inside a processing DAG (similar
 to how <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend">rocksDB</a> is used by Flink). This is an item on the roadmap
-and will eventually happen as a <a href="https://github.com/uber/hoodie/issues/8">Beam Runner</a></p>
-
+and will eventually happen as a <a href="https://issues.apache.org/jira/browse/HUDI-60">Beam Runner</a></p>
 
 
     <div class="tags">
diff --git a/content/concepts.html b/content/concepts.html
index 7e85d32..22754c4 100644
--- a/content/concepts.html
+++ b/content/concepts.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Here we introduce some basic concepts & give a broad technical overview of Hudi">
-<meta name="keywords" content=" concepts">
+<meta name="keywords" content="hudi, design, storage, views, timeline">
 <title>Concepts | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -343,7 +347,7 @@
 
     
 
-  <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over datasets on HDFS</p>
+  <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over datasets on DFS</p>
 
 <ul>
   <li>Upsert                     (how do I change the dataset?)</li>
diff --git a/content/configurations.html b/content/configurations.html
index 5f1adb8..73f66c9 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Here we list all possible configurations and what they mean">
-<meta name="keywords" content=" configurations">
+<meta name="keywords" content="garbage collection, hudi, jvm, configs, tuning">
 <title>Configurations | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -343,174 +347,360 @@
 
     
 
-  <h3 id="configuration">Configuration</h3>
+  <p>This page covers the different ways of configuring your job to write/read Hudi datasets. 
+At a high level, you can control behaviour at few levels.</p>
+
+<ul>
+  <li><strong><a href="#spark-datasource">Spark Datasource Configs</a></strong> : These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing view type to read.</li>
+  <li><strong><a href="#writeclient-configs">WriteClient Configs</a></strong> : Internally, the Hudi datasource uses a RDD based <code class="highlighter-rouge">HoodieWriteClient</code> api to actually perform writes to storage. These configs provide deep control over lower level aspects like 
+ file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.</li>
+  <li><strong><a href="#PAYLOAD_CLASS_OPT_KEY">RecordPayload Config</a></strong> : This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and 
+ stored old record. Hudi provides default implementations such as <code class="highlighter-rouge">OverwriteWithLatestAvroPayload</code> which simply update storage with the latest/last-written record. 
+ This can be overridden to a custom class extending <code class="highlighter-rouge">HoodieRecordPayload</code> class, on both datasource and WriteClient levels.</li>
+</ul>
+
+<h3 id="talking-to-cloud-storage">Talking to Cloud Storage</h3>
+
+<p>Immaterial of whether RDD/WriteClient APIs or Datasource is used, the following information helps configure access
+to cloud stores.</p>
+
+<ul>
+  <li><a href="s3_hoodie.html">AWS S3</a> <br />
+Configurations required for S3 and Hudi co-operability.</li>
+  <li><a href="gcs_hoodie.html">Google Cloud Storage</a> <br />
+Configurations required for GCS and Hudi co-operability.</li>
+</ul>
+
+<h3 id="spark-datasource">Spark Datasource Configs</h3>
+
+<p>Spark jobs using the datasource can be configured by passing the below options into the <code class="highlighter-rouge">option(k,v)</code> method as usual.
+The actual datasource level configs are listed below.</p>
+
+<h4 id="write-options">Write Options</h4>
+
+<p>Additionally, you can pass down any of the WriteClient level configs directly using <code class="highlighter-rouge">options()</code> or <code class="highlighter-rouge">option(k,v)</code> methods.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>inputDF.write()
+.format("com.uber.hoodie")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+</code></pre>
+</div>
+
+<p>Options useful for writing datasets via <code class="highlighter-rouge">write.format.option(...)</code></p>
+
+<ul>
+  <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.table.name</code> [Required]<br />
+<span style="color:grey">Hive table name, to register the dataset into.</span></li>
+  <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.operation</code>, Default: <code class="highlighter-rouge">upsert</code><br />
+<span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation. Use <code class="highlighter-rouge">bulkinsert</code> to load new data into a table, and there on use <code class="highlighter-rouge">upsert</code>/<code class="highlighter-rouge">insert</code>. 
+bulk insert uses a disk based write path to scale to load large inputs without need to cache it.</span></li>
+  <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.storage.type</code>, Default: <code class="highlighter-rouge">COPY_ON_WRITE</code> <br />
+<span style="color:grey">The storage type for the underlying data, for this write. This can’t change between writes.</span></li>
+  <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.precombine.field</code>, Default: <code class="highlighter-rouge">ts</code> <br />
+<span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
+we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span></li>
+  <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.payload.class</code>, Default: <code class="highlighter-rouge">com.uber.hoodie.OverwriteWithLatestAvroPayload</code> <br />
+<span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. 
+This will render any value set for <code class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code> in-effective</span></li>
+  <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.recordkey.field</code>, Default: <code class="highlighter-rouge">uuid</code> <br />
+<span style="color:grey">Record key field. Value to be used as the <code class="highlighter-rouge">recordKey</code> component of <code class="highlighter-rouge">HoodieKey</code>. Actual value
+will be obtained by invoking .toString() on the field value. Nested fields can be specified using
+the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
+  <li><a href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.partitionpath.field</code>, Default: <code class="highlighter-rouge">partitionpath</code> <br />
+<span style="color:grey">Partition path field. Value to be used at the <code class="highlighter-rouge">partitionPath</code> component of <code class="highlighter-rouge">HoodieKey</code>.
+Actual value ontained by invoking .toString()</span></li>
+  <li><a href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.keygenerator.class</code>, Default: <code class="highlighter-rouge">com.uber.hoodie.SimpleKeyGenerator</code> <br />
+<span style="color:grey">Key generator class, that implements will extract the key out of incoming <code class="highlighter-rouge">Row</code> object</span></li>
+  <li><a href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.commitmeta.key.prefix</code>, Default: <code class="highlighter-rouge">_</code> <br />
+<span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
+This is useful to store checkpointing information, in a consistent way with the hudi timeline</span></li>
+  <li><a href="#INSERT_DROP_DUPS_OPT_KEY">INSERT_DROP_DUPS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.write.insert.drop.duplicates</code>, Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">If set to true, filters out all duplicate records from incoming dataframe, during insert operations. </span></li>
+  <li><a href="#HIVE_SYNC_ENABLED_OPT_KEY">HIVE_SYNC_ENABLED_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.enable</code>, Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">When set to true, register/sync the dataset to Apache Hive metastore</span></li>
+  <li><a href="#HIVE_DATABASE_OPT_KEY">HIVE_DATABASE_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.database</code>, Default: <code class="highlighter-rouge">default</code> <br />
+<span style="color:grey">database to sync to</span></li>
+  <li><a href="#HIVE_TABLE_OPT_KEY">HIVE_TABLE_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.table</code>, [Required] <br />
+<span style="color:grey">table to sync to</span></li>
+  <li><a href="#HIVE_USER_OPT_KEY">HIVE_USER_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.username</code>, Default: <code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive user name to use</span></li>
+  <li><a href="#HIVE_PASS_OPT_KEY">HIVE_PASS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.password</code>, Default: <code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive password to use</span></li>
+  <li><a href="#HIVE_URL_OPT_KEY">HIVE_URL_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.jdbcurl</code>, Default: <code class="highlighter-rouge">jdbc:hive2://localhost:10000</code> <br />
+<span style="color:grey">Hive metastore url</span></li>
+  <li><a href="#HIVE_PARTITION_FIELDS_OPT_KEY">HIVE_PARTITION_FIELDS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.partition_fields</code>, Default: ` ` <br />
+<span style="color:grey">field in the dataset to use for determining hive partition columns.</span></li>
+  <li><a href="#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY">HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.partition_extractor_class</code>, Default: <code class="highlighter-rouge">com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor</code> <br />
+<span style="color:grey">Class used to extract partition field values into hive partition columns.</span></li>
+  <li><a href="#HIVE_ASSUME_DATE_PARTITION_OPT_KEY">HIVE_ASSUME_DATE_PARTITION_OPT_KEY</a><br />
+Property: <code class="highlighter-rouge">hoodie.datasource.hive_sync.assume_date_partitioning</code>, Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">Assume partitioning is yyyy/mm/dd</span></li>
+</ul>
+
+<h4 id="read-options">Read Options</h4>
+
+<p>Options useful for reading datasets via <code class="highlighter-rouge">read.format.option(...)</code></p>
 
 <ul>
-  <li><a href="#HoodieWriteConfig">HoodieWriteConfig</a> <br />
-<span style="color:grey">Top Level Config which is passed in when HoodieWriteClent is created.</span>
+  <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> <br />
+Property: <code class="highlighter-rouge">hoodie.datasource.view.type</code>, Default: <code class="highlighter-rouge">read_optimized</code> <br />
+<span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
+(or) Read Optimized mode (obtain latest view, based on columnar data)
+(or) Real time mode (obtain latest view, based on row &amp; columnar data)</span></li>
+  <li><a href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> <br /> 
+Property: <code class="highlighter-rouge">hoodie.datasource.read.begin.instanttime</code>, [Required in incremental mode] <br />
+<span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
+necessarily correspond to an instant on the timeline. New data written with an
+ <code class="highlighter-rouge">instant_time &gt; BEGIN_INSTANTTIME</code> are fetched out. For e.g: ‘20170901080000’ will get
+ all new data written after Sep 1, 2017 08:00AM.</span></li>
+  <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a> <br />
+Property: <code class="highlighter-rouge">hoodie.datasource.read.end.instanttime</code>, Default: latest instant (i.e fetches all new data since begin instant time) <br />
+<span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
+<code class="highlighter-rouge">instant_time &lt;= END_INSTANTTIME</code> are fetched out.</span></li>
+</ul>
+
+<h3 id="writeclient-configs">WriteClient Configs</h3>
+
+<p>Jobs programming directly against the RDD level apis can build a <code class="highlighter-rouge">HoodieWriteConfig</code> object and pass it in to the <code class="highlighter-rouge">HoodieWriteClient</code> constructor. 
+HoodieWriteConfig can be built using a builder pattern as below.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
+        .withPath(basePath)
+        .forTable(tableName)
+        .withSchema(schemaStr)
+        .withProps(props) // pass raw k,v pairs from a property file.
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
+        .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
+        ...
+        .build();
+</code></pre>
+</div>
+
+<p>Following subsections go over different aspects of write configs, explaining most important configs with their property names, default values.</p>
+
+<ul>
+  <li><a href="#withPath">withPath</a> (hoodie_base_path) 
+Property: <code class="highlighter-rouge">hoodie.base.path</code> [Required] <br />
+<span style="color:grey">Base DFS path under which all the data partitions are created. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span></li>
+  <li><a href="#withSchema">withSchema</a> (schema_str) <br /> 
+Property: <code class="highlighter-rouge">hoodie.avro.schema</code> [Required]<br />
+<span style="color:grey">This is the current reader avro schema for the dataset. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span></li>
+  <li><a href="#forTable">forTable</a> (table_name)<br /> 
+Property: <code class="highlighter-rouge">hoodie.table.name</code> [Required] <br />
+ <span style="color:grey">Table name for the dataset, will be used for registering with Hive. Needs to be same across runs.</span></li>
+  <li><a href="#withBulkInsertParallelism">withBulkInsertParallelism</a> (bulk_insert_parallelism = 1500) <br /> 
+Property: <code class="highlighter-rouge">hoodie.bulkinsert.shuffle.parallelism</code><br />
+<span style="color:grey">Bulk insert is meant to be used for large initial imports and this parallelism determines the initial number of files in your dataset. Tune this to achieve a desired optimal size during initial import.</span></li>
+  <li><a href="#withParallelism">withParallelism</a> (insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500)<br /> 
+Property: <code class="highlighter-rouge">hoodie.insert.shuffle.parallelism</code>, <code class="highlighter-rouge">hoodie.upsert.shuffle.parallelism</code><br />
+<span style="color:grey">Once data has been initially imported, this parallelism controls initial parallelism for reading input records. Ensure this value is high enough say: 1 partition for 1 GB of input data</span></li>
+  <li><a href="#combineInput">combineInput</a> (on_insert = false, on_update=true)<br /> 
+Property: <code class="highlighter-rouge">hoodie.combine.before.insert</code>, <code class="highlighter-rouge">hoodie.combine.before.upsert</code><br />
+<span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS</span></li>
+  <li><a href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a> (level = MEMORY_AND_DISK_SER)<br /> 
+Property: <code class="highlighter-rouge">hoodie.write.status.storage.level</code><br />
+<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because the Client can choose to inspect the WriteStatus and choose and commit or not based on the failures. This is a configuration for the storage level for this RDD </span></li>
+  <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true)<br /> 
+Property: <code class="highlighter-rouge">hoodie.auto.commit</code><br />
+<span style="color:grey">Should HoodieWriteClient autoCommit after insert and upsert. The client can choose to turn off auto-commit and commit on a “defined success condition”</span></li>
+  <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a> (assumeDatePartitioning = false)<br /> 
+Property: ` hoodie.assume.date.partitioning`<br />
+<span style="color:grey">Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions &lt; 0.3.1. Will be removed eventually </span></li>
+  <li><a href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a> (enabled = false)<br /> 
+Property: <code class="highlighter-rouge">hoodie.consistency.check.enabled</code><br />
+<span style="color:grey">Should HoodieWriteClient perform additional checks to ensure written files’ are listable on the underlying filesystem/storage. Set this to true, to workaround S3’s eventual consistency model and ensure all data written as a part of a commit is faithfully available for queries. </span></li>
+</ul>
+
+<h4 id="index-configs">Index configs</h4>
+<p>Following configs control indexing behavior, which tags incoming records as either inserts or updates to older records.</p>
+
+<ul>
+  <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
+  <span style="color:grey">This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span>
     <ul>
-      <li><a href="#withPath">withPath</a> (hoodie_base_path) <br />
-  <span style="color:grey">Base HDFS path under which all the data partitions are created. Hoodie stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span></li>
-      <li><a href="#withSchema">withSchema</a> (schema_str) <br />
-  <span style="color:grey">This is the current reader avro schema for the Hoodie Dataset. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span></li>
-      <li><a href="#withParallelism">withParallelism</a> (insert_shuffle_parallelism = 200, upsert_shuffle_parallelism = 200) <br />
-  <span style="color:grey">Insert DAG uses the insert_parallelism in every shuffle. Upsert DAG uses the upsert_parallelism in every shuffle. Typical workload is profiled and once a min parallelism is established, trade off between latency and cluster usage optimizations this is tuned and have a conservatively high number to optimize for latency and  </span></li>
-      <li><a href="#combineInput">combineInput</a> (on_insert = false, on_update=true) <br />
-  <span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in HDFS</span></li>
-      <li><a href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a> (level = MEMORY_AND_DISK_SER) <br />
-  <span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because the Client can choose to inspect the WriteStatus and choose and commit or not based on the failures. This is a configuration for the storage level for this RDD </span></li>
-      <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true) <br />
-  <span style="color:grey">Should HoodieWriteClient autoCommit after insert and upsert. The client can choose to turn off auto-commit and commit on a “defined success condition”</span></li>
-      <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a> (assumeDatePartitioning = false) <br />
-  <span style="color:grey">Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions &lt; 0.3.1. Will be removed eventually </span></li>
-      <li>
-        <p><a href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a> (enabled = false) <br />
-  <span style="color:grey">Should HoodieWriteClient perform additional checks to ensure written files’ are listable on the underlying filesystem/storage. Set this to true, to workaround S3’s eventual consistency model and ensure all data written as a part of a commit is faithfully available for queries. </span></p>
-      </li>
-      <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
-  <span style="color:grey">Hoodie uses a index to help find the FileID which contains an incoming record key. This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span>
-        <ul>
-          <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM) <br />
+      <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM) <br />
+  Property: <code class="highlighter-rouge">hoodie.index.type</code> <br />
   <span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span></li>
-          <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a> (60000) <br />
-  <span style="color:grey">Only applies if index type is BLOOM. <br />This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. <a href="https://github.com/uber/hoodie/issues/70">#70</a> tracks computing this dynamically. Warning: Setting this very low, will generate a lot of false positives and in [...]
-          <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (0.000000001) <br />
+      <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a> (numEntries = 60000) <br />
+  Property: <code class="highlighter-rouge">hoodie.index.bloom.num_entries</code> <br />
+  <span style="color:grey">Only applies if index type is BLOOM. <br />This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. <a href="https://issues.apache.org/jira/browse/HUDI-56">HUDI-56</a> tracks computing this dynamically. Warning: Setting this very low, will generate a lot of false positiv [...]
+      <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (fpp = 0.000000001) <br />
+  Property: <code class="highlighter-rouge">hoodie.index.bloom.fpp</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives</span></li>
-          <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a> (true) <br />
+      <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a> (pruneRanges = true) <br />
+  Property: <code class="highlighter-rouge">hoodie.bloom.index.prune.by.ranges</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.</span></li>
-          <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (true) <br />
+      <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (useCaching = true) <br />
+  Property: <code class="highlighter-rouge">hoodie.bloom.index.use.caching</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions</span></li>
-          <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0) <br />
+      <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0) <br />
+  Property: <code class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span></li>
-          <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) <br />
+      <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) [Required]<br />
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> <br />
   <span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum url to connect to.</span></li>
-          <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) <br />
+      <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) [Required]<br />
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.zkport</code> <br />
   <span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum port to connect to.</span></li>
-          <li><a href="#hbaseTableName">hbaseTableName</a> (tableName) <br />
-  <span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hoodie stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span></li>
-        </ul>
-      </li>
-      <li><a href="#withStorageConfig">withStorageConfig</a> (HoodieStorageConfig) <br />
-  <span style="color:grey">Storage related configs</span>
-        <ul>
-          <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
-  <span style="color:grey">Hoodie re-writes a single file during update (copy_on_write) or a compaction (merge_on_read). This is fundamental unit of parallelism. It is important that this is aligned with the underlying filesystem block size. </span></li>
-          <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize = 120MB) <br />
-  <span style="color:grey">Parquet RowGroup size. Its better than this is aligned with the file size, so that a single column within a file is stored continuously on disk</span></li>
-          <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB) <br />
+      <li><a href="#hbaseTableName">hbaseTableName</a> (tableName) [Required]<br />
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code> <br />
+  <span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="storage-configs">Storage configs</h4>
+<p>Controls aspects around sizing parquet and log files.</p>
+
+<ul>
+  <li><a href="#withStorageConfig">withStorageConfig</a> (HoodieStorageConfig) <br />
+    <ul>
+      <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.max.file.size</code> <br />
+  <span style="color:grey">Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span></li>
+      <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize = 120MB) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.block.size</code> <br />
+  <span style="color:grey">Parquet RowGroup size. Its better this is same as the file size, so that a single column within a file is stored continuously on disk</span></li>
+      <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.page.size</code> <br />
   <span style="color:grey">Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed seperately. </span></li>
-          <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB) <br />
+      <li><a href="#parquetCompressionRatio">parquetCompressionRatio</a> (parquetCompressionRatio = 0.1) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.compression.ratio</code> <br />
+  <span style="color:grey">Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files</span></li>
+      <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB) <br />
+  Property: <code class="highlighter-rouge">hoodie.logfile.max.size</code> <br />
   <span style="color:grey">LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version. </span></li>
-          <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a> (dataBlockSize = 256MB) <br />
+      <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a> (dataBlockSize = 256MB) <br />
+  Property: <code class="highlighter-rouge">hoodie.logfile.data.block.max.size</code> <br />
   <span style="color:grey">LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory. </span></li>
-        </ul>
-      </li>
-      <li><a href="#withCompactionConfig">withCompactionConfig</a> (HoodieCompactionConfig) <br />
-  <span style="color:grey">Cleaning and configurations related to compaction techniques</span>
-        <ul>
-          <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy = KEEP_LATEST_COMMITS) <br />
-  <span style="color:grey">Hoodie Cleaning policy. Hoodie will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span></li>
-          <li><a href="#retainCommits">retainCommits</a> (no_of_commits_to_retain = 24) <br />
+      <li><a href="#logFileToParquetCompressionRatio">logFileToParquetCompressionRatio</a> (logFileToParquetCompressionRatio = 0.35) <br />
+  Property: <code class="highlighter-rouge">hoodie.logfile.to.parquet.compression.ratio</code> <br />
+  <span style="color:grey">Expected additional compression as records move from log files to parquet. Used for merge_on_read storage to send inserts into log files &amp; control the size of compacted parquet file.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="compaction-configs">Compaction configs</h4>
+<p>Configs that control compaction (merging of log files onto a new parquet base file), cleaning (reclamation of older/unused file groups).</p>
+
+<ul>
+  <li><a href="#withCompactionConfig">withCompactionConfig</a> (HoodieCompactionConfig) <br />
+    <ul>
+      <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy = KEEP_LATEST_COMMITS) <br />
+  Property: <code class="highlighter-rouge">hoodie.cleaner.policy</code> <br />
+  <span style="color:grey"> Cleaning policy to be used. Hudi will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span></li>
+      <li><a href="#retainCommits">retainCommits</a> (no_of_commits_to_retain = 24) <br />
+  Property: <code class="highlighter-rouge">hoodie.cleaner.commits.retained</code> <br />
   <span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this dataset</span></li>
-          <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits = 96, maxCommits = 128) <br />
-  <span style="color:grey">Each commit is a small file in the .hoodie directory. Since HDFS is not designed to handle multiple small files, hoodie archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span></li>
-          <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a> (size = 0) <br />
-  <span style="color:grey">Small files can always happen because of the number of insert records in a paritition in a batch. Hoodie has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a “small file size”. This should be less &lt; maxFileSize and setting it to 0, turns off this feature. </span></li>
-          <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000) <br />
+      <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits = 96, maxCommits = 128) <br />
+  Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>, <code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
+  <span style="color:grey">Each commit is a small file in the <code class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span></li>
+      <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a> (size = 0) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.small.file.limit</code> <br />
+  <span style="color:grey">This should be less &lt; maxFileSize and setting it to 0, turns off this feature. Small files can always happen because of the number of insert records in a partition in a batch. Hudi has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a “small file size”.</span></li>
+      <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000) <br />
+  Property: <code class="highlighter-rouge">hoodie.copyonwrite.insert.split.size</code> <br />
   <span style="color:grey">Insert Write Parallelism. Number of inserts grouped for a single partition. Writing out 100MB files, with atleast 1kb records, means 100K records per file. Default is to overprovision to 500K. To improve insert latency, tune this to match the number of records in a single file. Setting this to a low number, will result in small files (particularly when compactionSmallFileSize is 0)</span></li>
-          <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true) <br />
-  <span style="color:grey">Should hoodie dynamically compute the insertSplitSize based on the last 24 commit’s metadata. Turned off by default. </span></li>
-          <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
-  <span style="color:grey">The average record size. If specified, hoodie will use this and not compute dynamically based on the last 24 commit’s metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span></li>
-          <li><a href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a> (true) <br />
-  <span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span></li>
-          <li><a href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a> (maxNumDeltaCommitsBeforeCompaction = 10) <br />
+      <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true) <br />
+  Property: <code class="highlighter-rouge">hoodie.copyonwrite.insert.auto.split</code> <br />
+  <span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit’s metadata. Turned off by default. </span></li>
+      <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
+  Property: <code class="highlighter-rouge">hoodie.copyonwrite.record.size.estimate</code> <br />
+  <span style="color:grey">The average record size. If specified, hudi will use this and not compute dynamically based on the last 24 commit’s metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span></li>
+      <li><a href="#withInlineCompaction">withInlineCompaction</a> (inlineCompaction = false) <br />
+  Property: <code class="highlighter-rouge">hoodie.compact.inline</code> <br />
+  <span style="color:grey">When set to true, compaction is triggered by the ingestion itself, right after a commit/deltacommit action as part of insert/upsert/bulk_insert</span></li>
+      <li><a href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a> (maxNumDeltaCommitsBeforeCompaction = 10) <br />
+  Property: <code class="highlighter-rouge">hoodie.compact.inline.max.delta.commits</code> <br />
   <span style="color:grey">Number of max delta commits to keep before triggering an inline compaction</span></li>
-          <li><a href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a> (false) <br />
+      <li><a href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a> (true) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.lazy.block.read</code> <br />
+  <span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span></li>
+      <li><a href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a> (false) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.reverse.log.read</code> <br />
   <span style="color:grey">HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the Reader reads the logfile in reverse direction, from pos=file_length to pos=0</span></li>
-        </ul>
-      </li>
-      <li><a href="#withMetricsConfig">withMetricsConfig</a> (HoodieMetricsConfig) <br />
-  <span style="color:grey">Hoodie publishes metrics on every commit, clean, rollback etc.</span>
-        <ul>
-          <li><a href="#on">on</a> (true) <br />
+      <li><a href="#withCleanerParallelism">withCleanerParallelism</a> (cleanerParallelism = 200) <br />
+  Property: <code class="highlighter-rouge">hoodie.cleaner.parallelism</code> <br />
+  <span style="color:grey">Increase this if cleaning becomes slow.</span></li>
+      <li><a href="#withCompactionStrategy">withCompactionStrategy</a> (compactionStrategy = com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.strategy</code> <br />
+  <span style="color:grey">Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data</span></li>
+      <li><a href="#withTargetIOPerCompactionInMB">withTargetIOPerCompactionInMB</a> (targetIOPerCompactionInMB = 500000) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.target.io</code> <br />
+  <span style="color:grey">Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.</span></li>
+      <li><a href="#withTargetPartitionsPerDayBasedCompaction">withTargetPartitionsPerDayBasedCompaction</a> (targetPartitionsPerCompaction = 10) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.daybased.target</code> <br />
+  <span style="color:grey">Used by com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.</span></li>
+      <li><a href="#payloadClassName">withPayloadClass</a> (payloadClassName = com.uber.hoodie.common.model.HoodieAvroPayload) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.payload.class</code> <br />
+  <span style="color:grey">This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="metrics-configs">Metrics configs</h4>
+<p>Enables reporting of Hudi metrics to graphite.</p>
+
+<ul>
+  <li><a href="#withMetricsConfig">withMetricsConfig</a> (HoodieMetricsConfig) <br />
+<span style="color:grey">Hudi publishes metrics on every commit, clean, rollback etc.</span>
+    <ul>
+      <li><a href="#on">on</a> (metricsOn = true) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.on</code> <br />
   <span style="color:grey">Turn sending metrics on/off. on by default.</span></li>
-          <li><a href="#withReporterType">withReporterType</a> (GRAPHITE) <br />
+      <li><a href="#withReporterType">withReporterType</a> (reporterType = GRAPHITE) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.reporter.type</code> <br />
   <span style="color:grey">Type of metrics reporter. Graphite is the default and the only value suppported.</span></li>
-          <li><a href="#toGraphiteHost">toGraphiteHost</a> () <br />
+      <li><a href="#toGraphiteHost">toGraphiteHost</a> (host = localhost) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.graphite.host</code> <br />
   <span style="color:grey">Graphite host to connect to</span></li>
-          <li><a href="#onGraphitePort">onGraphitePort</a> () <br />
+      <li><a href="#onGraphitePort">onGraphitePort</a> (port = 4756) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.graphite.port</code> <br />
   <span style="color:grey">Graphite port to connect to</span></li>
-          <li><a href="#usePrefix">usePrefix</a> () <br />
-  <span style="color:grey">Standard prefix for all metrics</span></li>
-        </ul>
-      </li>
-      <li><a href="#withMemoryConfig">withMemoryConfig</a> (HoodieMemoryConfig) <br />
-  <span style="color:grey">Memory related configs</span>
-        <ul>
-          <li><a href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a> (maxMemoryFractionPerPartitionMerge = 0.6) <br />
-  <span style="color:grey">This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge </span></li>
-          <li><a href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a> (maxMemorySizePerCompactionInBytes = 1GB) <br />
-  <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span></li>
-        </ul>
-      </li>
-      <li>
-        <p><a href="s3_hoodie.html">S3Configs</a> (Hoodie S3 Configs) <br />
-  <span style="color:grey">Configurations required for S3 and Hoodie co-operability.</span></p>
-      </li>
-      <li><a href="gcs_hoodie.html">GCSConfigs</a> (Hoodie GCS Configs) <br />
-  <span style="color:grey">Configurations required for GCS and Hoodie co-operability.</span></li>
+      <li><a href="#usePrefix">usePrefix</a> (prefix = “”) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.graphite.metric.prefix</code> <br />
+  <span style="color:grey">Standard prefix applied to all metrics. This helps to add datacenter, environment information for e.g</span></li>
     </ul>
   </li>
-  <li><a href="#datasource">Hoodie Datasource</a> <br />
-<span style="color:grey">Configs for datasource</span>
+</ul>
+
+<h4 id="memory-configs">Memory configs</h4>
+<p>Controls memory usage for compaction and merges, performed internally by Hudi</p>
+
+<ul>
+  <li><a href="#withMemoryConfig">withMemoryConfig</a> (HoodieMemoryConfig) <br />
+<span style="color:grey">Memory related configs</span>
     <ul>
-      <li><a href="#writeoptions">write options</a> (write.format.option(…)) <br />
-  <span style="color:grey"> Options useful for writing datasets </span>
-        <ul>
-          <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a> (Default: upsert) <br />
-  <span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation</span></li>
-          <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a> (Default: COPY_ON_WRITE) <br />
-  <span style="color:grey">The storage type for the underlying data, for this write. This can’t change between writes.</span></li>
-          <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a> (Default: None (mandatory)) <br />
-  <span style="color:grey">Hive table name, to register the dataset into.</span></li>
-          <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a> (Default: ts) <br />
-  <span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
-  we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span></li>
-          <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a> (Default: com.uber.hoodie.OverwriteWithLatestAvroPayload) <br />
-  <span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting.
-  This will render any value set for <code class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code> in-effective</span></li>
-          <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a> (Default: uuid) <br />
-  <span style="color:grey">Record key field. Value to be used as the <code class="highlighter-rouge">recordKey</code> component of <code class="highlighter-rouge">HoodieKey</code>. Actual value
-  will be obtained by invoking .toString() on the field value. Nested fields can be specified using
-  the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
-          <li><a href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a> (Default: partitionpath) <br />
-  <span style="color:grey">Partition path field. Value to be used at the <code class="highlighter-rouge">partitionPath</code> component of <code class="highlighter-rouge">HoodieKey</code>.
-  Actual value ontained by invoking .toString()</span></li>
-          <li><a href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a> (Default: com.uber.hoodie.SimpleKeyGenerator) <br />
-  <span style="color:grey">Key generator class, that implements will extract the key out of incoming <code class="highlighter-rouge">Row</code> object</span></li>
-          <li><a href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a> (Default: <code class="highlighter-rouge">_</code>) <br />
-  <span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
-  This is useful to store checkpointing information, in a consistent way with the hoodie timeline</span></li>
-        </ul>
-      </li>
-      <li><a href="#readoptions">read options</a> (read.format.option(…)) <br />
-  <span style="color:grey">Options useful for reading datasets</span>
-        <ul>
-          <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> (Default:  = read_optimized) <br />
-  <span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
-  (or) Read Optimized mode (obtain latest view, based on columnar data)
-  (or) Real time mode (obtain latest view, based on row &amp; columnar data)</span></li>
-          <li><a href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> (Default: None (Mandatory in incremental mode)) <br />
-  <span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
-  necessarily correspond to an instant on the timeline. New data written with an
-   <code class="highlighter-rouge">instant_time &gt; BEGIN_INSTANTTIME</code> are fetched out. For e.g: ‘20170901080000’ will get
-   all new data written after Sep 1, 2017 08:00AM.</span></li>
-          <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a> (Default: latest instant (i.e fetches all new data since begin instant time)) <br />
-  <span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
-  <code class="highlighter-rouge">instant_time &lt;= END_INSTANTTIME</code> are fetched out.</span></li>
-        </ul>
-      </li>
+      <li><a href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a> (maxMemoryFractionPerPartitionMerge = 0.6) <br />
+  Property: <code class="highlighter-rouge">hoodie.memory.merge.fraction</code> <br />
+  <span style="color:grey">This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge </span></li>
+      <li><a href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a> (maxMemorySizePerCompactionInBytes = 1GB) <br />
+  Property: <code class="highlighter-rouge">hoodie.memory.compaction.fraction</code> <br />
+  <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span></li>
     </ul>
   </li>
 </ul>
@@ -519,14 +709,11 @@
 
 <p>Writing data via Hudi happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.</p>
 
-<p><strong>Write operations</strong> : Use <code class="highlighter-rouge">bulkinsert</code> to load new data into a table, and there on use <code class="highlighter-rouge">upsert</code>/<code class="highlighter-rouge">insert</code>.
- Difference between them is that bulk insert uses a disk based write path to scale to load large inputs without need to cache it.</p>
-
-<p><strong>Input Parallelism</strong> : By default, Hoodie tends to over-partition input (i.e <code class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism <code class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code> such that its atleast input_data_size/500MB</p>
+<p><strong>Input Parallelism</strong> : By default, Hudi tends to over-partition input (i.e <code class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism <code class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code> such that its atleast input_data_size/500MB</p>
 
-<p><strong>Off-heap memory</strong> : Hoodie writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like <code class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are running into such failures.</p>
+<p><strong>Off-heap memory</strong> : Hudi writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like <code class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are running into such failures.</p>
 
-<p><strong>Spark Memory</strong> : Typically, hoodie needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some <code class="highlighter-rouge">spark.storage.memoryFraction</code> will generally help boost performance.</p>
+<p><strong>Spark Memory</strong> : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some <code class="highlighter-rouge">spark.storage.memoryFraction</code> will generally help boost performance.</p>
 
 <p><strong>Sizing files</strong> : Set <code class="highlighter-rouge">limitFileSize</code> above judiciously, to balance ingest/write latency vs number of files &amp; consequently metadata overhead associated with it.</p>
 
diff --git a/content/contributing.html b/content/contributing.html
index 1901952..9c9e61d 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" developer setup">
+<meta name="keywords" content="hudi, ide, developer, setup">
 <title>Developer Setup | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -380,6 +384,8 @@ have an open source license <a href="https://www.apache.org/legal/resolved.html#
       <li>Add adequate tests for your new functionality</li>
       <li>[Optional] For involved changes, its best to also run the entire integration test suite using <code class="highlighter-rouge">mvn clean install</code></li>
       <li>For website changes, please build the site locally &amp; test navigation, formatting &amp; links thoroughly</li>
+      <li>If your code change changes some aspect of documentation (e.g new config, default value change), 
+please ensure there is a another PR to <a href="https://github.com/apache/incubator-hudi/blob/asf-site/docs/README.md">update the docs</a> as well.</li>
     </ul>
   </li>
   <li>Format commit messages and the pull request title like <code class="highlighter-rouge">[HUDI-XXX] Fixes bug in Spark Datasource</code>,
diff --git a/content/css/customstyles.css b/content/css/customstyles.css
index d6667a5..56dcdba 100644
--- a/content/css/customstyles.css
+++ b/content/css/customstyles.css
@@ -1,5 +1,5 @@
 body {
-    font-size:15px;
+    font-size:14px;
 }
 
 .bs-callout {
@@ -607,7 +607,7 @@ a.fa.fa-envelope-o.mailto {
     font-weight: 600;
 }
 
-h3 {color: #ED1951; font-weight:normal; font-size:130%;}
+h3 {color: #545253; font-weight:normal; font-size:130%;}
 h4 {color: #808080; font-weight:normal; font-size:120%; font-style:italic;}
 
 .alert, .callout {
diff --git a/content/css/theme-blue.css b/content/css/theme-blue.css
index 9a923ef..46fbd0d 100644
--- a/content/css/theme-blue.css
+++ b/content/css/theme-blue.css
@@ -5,7 +5,7 @@
 }
 
 
-h3 {color: #ED1951; }
+h3 {color: #545253; }
 h4 {color: #808080; }
 
 .nav-tabs > li.active > a, .nav-tabs > li.active > a:hover, .nav-tabs > li.active > a:focus {
diff --git a/content/feed.xml b/content/feed.xml
index b21704e..cd76d50 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml" rel="self" type="application/rss+xml"/>
-        <pubDate>Mon, 25 Feb 2019 20:49:33 +0000</pubDate>
-        <lastBuildDate>Mon, 25 Feb 2019 20:49:33 +0000</lastBuildDate>
+        <pubDate>Sat, 09 Mar 2019 21:08:53 +0000</pubDate>
+        <lastBuildDate>Sat, 09 Mar 2019 21:08:53 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
@@ -25,7 +25,7 @@
         
         <item>
             <title>Connect with us at Strata San Jose March 2017</title>
-            <description>&lt;p&gt;We will be presenting Hoodie &amp;amp; general concepts around how incremental processing works at Uber.
+            <description>&lt;p&gt;We will be presenting Hudi &amp;amp; general concepts around how incremental processing works at Uber.
 Catch our talk &lt;strong&gt;“Incremental Processing on Hadoop At Uber”&lt;/strong&gt;&lt;/p&gt;
 
 </description>
diff --git a/content/gcs_hoodie.html b/content/gcs_hoodie.html
index f90992d..cb96011 100644
--- a/content/gcs_hoodie.html
+++ b/content/gcs_hoodie.html
@@ -4,8 +4,8 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to configure hudi with Google Cloud Storage.">
-<meta name="keywords" content=" sql hive gcs spark presto">
-<title>GCS Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, google cloud, storage, spark, presto">
+<title>GCS Filesystem | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -158,7 +162,7 @@
 
 
 
-  <a class="email" title="Submit feedback" href="#" onclick="javascript:window.location='mailto:dev@hudi.apache.org?subject=Hudi Documentation feedback&body=I have some feedback about the GCS Filesystem (experimental) page: ' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
+  <a class="email" title="Submit feedback" href="#" onclick="javascript:window.location='mailto:dev@hudi.apache.org?subject=Hudi Documentation feedback&body=I have some feedback about the GCS Filesystem page: ' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
 
 <li>
 
@@ -176,7 +180,7 @@
                                 searchInput: document.getElementById('search-input'),
                                 resultsContainer: document.getElementById('results-container'),
                                 dataSource: 'search.json',
-                                searchResultTemplate: '<li><a href="{url}" title="GCS Filesystem (experimental)">{title}</a></li>',
+                                searchResultTemplate: '<li><a href="{url}" title="GCS Filesystem">{title}</a></li>',
                     noResultsText: 'No results found.',
                             limit: 10,
                             fuzzy: true,
@@ -327,7 +331,7 @@
     <!-- Content Column -->
     <div class="col-md-9">
         <div class="post-header">
-   <h1 class="post-title-main">GCS Filesystem (experimental)</h1>
+   <h1 class="post-title-main">GCS Filesystem</h1>
 </div>
 
 
@@ -343,7 +347,7 @@
 
     
 
-  <p>Hudi works with HDFS by default and GCS <strong>regional</strong> buckets provide an HDFS API with strong consistency.</p>
+  <p>For Hudi storage on GCS, <strong>regional</strong> buckets provide an DFS API with strong consistency.</p>
 
 <h2 id="gcs-configs">GCS Configs</h2>
 
diff --git a/content/images/hoodie_commit_duration.png b/content/images/hudi_commit_duration.png
similarity index 100%
rename from content/images/hoodie_commit_duration.png
rename to content/images/hudi_commit_duration.png
diff --git a/content/images/hoodie_intro_1.png b/content/images/hudi_intro_1.png
similarity index 100%
rename from content/images/hoodie_intro_1.png
rename to content/images/hudi_intro_1.png
diff --git a/content/images/hoodie_log_format_v2.png b/content/images/hudi_log_format_v2.png
similarity index 100%
rename from content/images/hoodie_log_format_v2.png
rename to content/images/hudi_log_format_v2.png
diff --git a/content/images/hoodie_query_perf_hive.png b/content/images/hudi_query_perf_hive.png
similarity index 100%
rename from content/images/hoodie_query_perf_hive.png
rename to content/images/hudi_query_perf_hive.png
diff --git a/content/images/hoodie_query_perf_presto.png b/content/images/hudi_query_perf_presto.png
similarity index 100%
rename from content/images/hoodie_query_perf_presto.png
rename to content/images/hudi_query_perf_presto.png
diff --git a/content/images/hoodie_query_perf_spark.png b/content/images/hudi_query_perf_spark.png
similarity index 100%
rename from content/images/hoodie_query_perf_spark.png
rename to content/images/hudi_query_perf_spark.png
diff --git a/content/images/hoodie_upsert_dag.png b/content/images/hudi_upsert_dag.png
similarity index 100%
rename from content/images/hoodie_upsert_dag.png
rename to content/images/hudi_upsert_dag.png
diff --git a/content/images/hoodie_upsert_perf1.png b/content/images/hudi_upsert_perf1.png
similarity index 100%
rename from content/images/hoodie_upsert_perf1.png
rename to content/images/hudi_upsert_perf1.png
diff --git a/content/images/hoodie_upsert_perf2.png b/content/images/hudi_upsert_perf2.png
similarity index 100%
rename from content/images/hoodie_upsert_perf2.png
rename to content/images/hudi_upsert_perf2.png
diff --git a/content/implementation.html b/content/implementation.html
index d649a70..e524ec6 100644
--- a/content/implementation.html
+++ b/content/implementation.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" implementation">
+<meta name="keywords" content="hudi, index, storage, compaction, cleaning, implementation">
 <title>Implementation | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -347,7 +351,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces
 
 <ul>
   <li>
-    <p><strong>Indexing</strong> :  A big part of Hoodie’s efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to.
+    <p><strong>Indexing</strong> :  A big part of Hudi’s efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to.
  This index also helps the <code class="highlighter-rouge">HoodieWriteClient</code> separate upserted records into inserts and updates, so they can be treated differently.
  <code class="highlighter-rouge">HoodieReadClient</code> supports operations such as <code class="highlighter-rouge">filterExists</code> (used for de-duplication of table) and an efficient batch <code class="highlighter-rouge">read(keys)</code> api, that
  can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically
@@ -406,7 +410,7 @@ Any remaining records after that, are again packed into new file id groups, agai
 <p>In the case of Copy-On-Write, a single parquet file constitutes one <code class="highlighter-rouge">file slice</code> which contains one complete version of
 the file</p>
 
-<figure><img class="docimage" src="images/hoodie_log_format_v2.png" alt="hoodie_log_format_v2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_log_format_v2.png" alt="hudi_log_format_v2.png" style="max-width: 1000px" /></figure>
 
 <h4 id="merge-on-read">Merge On Read</h4>
 
@@ -575,7 +579,7 @@ incremental ingestion (writer at DC6) happened before the compaction (some time
 The below description is with regards to compaction from file-group perspective.
     <ul>
       <li><code class="highlighter-rouge">Reader querying at time between ingestion completion time for DC6 and compaction finish “Tc”</code>:
-Hoodie’s implementation will be changed to become aware of file-groups currently waiting for compaction and
+Hudi’s implementation will be changed to become aware of file-groups currently waiting for compaction and
 merge log-files corresponding to DC2-DC6 with the base-file corresponding to SC1. In essence, Hudi will create
 a pseudo file-slice by combining the 2 file-slices starting at base-commits SC1 and SC5 to one.
 For file-groups not waiting for compaction, the reader behavior is essentially the same - read latest file-slice
@@ -602,12 +606,12 @@ the conventional alternatives for achieving these tasks.</p>
 <p>Following shows the speed up obtained for NoSQL ingestion, by switching from bulk loads off HBase to Parquet to incrementally upserting
 on a Hudi dataset, on 5 tables ranging from small to huge.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_perf1.png" alt="hoodie_upsert_perf1.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" style="max-width: 1000px" /></figure>
 
 <p>Given Hudi can build the dataset incrementally, it opens doors for also scheduling ingesting more frequently thus reducing latency, with
 significant savings on the overall compute cost.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_perf2.png" alt="hoodie_upsert_perf2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" style="max-width: 1000px" /></figure>
 
 <p>Hudi upserts have been stress tested upto 4TB in a single commit across the t1 table.</p>
 
@@ -618,15 +622,15 @@ with no impact on queries. Following charts compare the Hudi vs non-Hudi dataset
 
 <p><strong>Hive</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_hive.png" alt="hoodie_query_perf_hive.png" style="max-width: 800px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_hive.png" alt="hudi_query_perf_hive.png" style="max-width: 800px" /></figure>
 
 <p><strong>Spark</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_spark.png" alt="hoodie_query_perf_spark.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_spark.png" alt="hudi_query_perf_spark.png" style="max-width: 1000px" /></figure>
 
 <p><strong>Presto</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_presto.png" alt="hoodie_query_perf_presto.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_presto.png" alt="hudi_query_perf_presto.png" style="max-width: 1000px" /></figure>
 
 
 
diff --git a/content/incremental_processing.html b/content/incremental_processing.html
index a694881..c487368 100644
--- a/content/incremental_processing.html
+++ b/content/incremental_processing.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we will discuss some available tools for ingesting data incrementally & consuming the changes.">
-<meta name="keywords" content=" incremental processing">
+<meta name="keywords" content="hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL">
 <title>Incremental Processing | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -349,7 +353,7 @@ discusses a few tools that can be used to achieve these on different contexts.</
 
 <h2 id="incremental-ingestion">Incremental Ingestion</h2>
 
-<p>Following means can be used to apply a delta or an incremental change to a Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
+<p>Following means can be used to apply a delta or an incremental change to a Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to DFS or
 even changes pulled from another Hudi dataset.</p>
 
 <h4 id="deltastreamer-tool">DeltaStreamer Tool</h4>
@@ -360,9 +364,10 @@ from different sources such as DFS or Kafka.</p>
 <p>The tool is a spark job (part of hoodie-utilities), that provides the following functionality</p>
 
 <ul>
-  <li>Ability to consume new events from Kafka, incremental imports from Sqoop or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or files under a folder on HDFS</li>
+  <li>Ability to consume new events from Kafka, incremental imports from Sqoop or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or files under a folder on DFS</li>
   <li>Support json, avro or a custom payload types for the incoming data</li>
-  <li>New data is written to a Hudi dataset, with support for checkpointing &amp; schemas and registered onto Hive</li>
+  <li>Pick up avro schemas from DFS or Confluent <a href="https://github.com/confluentinc/schema-registry">schema registry</a>.</li>
+  <li>New data is written to a Hudi dataset, with support for checkpointing and registered onto Hive</li>
 </ul>
 
 <p>Command line options describe capabilities in more detail (first build hoodie-utilities using <code class="highlighter-rouge">mvn clean package</code>).</p>
@@ -423,10 +428,10 @@ Usage: &lt;main class&gt; [options]
   * --target-table
       name of the target table in Hive
     --transformer-class
-      subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to 
-      transform raw source dataset to a target dataset (conforming to target 
-      schema) before writing. Default : Not set. E:g - 
-      com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which 
+      subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
+      transform raw source dataset to a target dataset (conforming to target
+      schema) before writing. Default : Not set. E:g -
+      com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
       allows a SQL query template to be passed as a transformation function)
 
 </code></pre>
@@ -453,7 +458,7 @@ provided under <code class="highlighter-rouge">hoodie-utilities/src/test/resourc
 </code></pre>
 </div>
 
-<p>In some cases, you may want to convert your existing dataset into Hoodie, before you can begin ingesting new data. This can be accomplished using the <code class="highlighter-rouge">hdfsparquetimport</code> command on the <code class="highlighter-rouge">hoodie-cli</code>.
+<p>In some cases, you may want to convert your existing dataset into Hudi, before you can begin ingesting new data. This can be accomplished using the <code class="highlighter-rouge">hdfsparquetimport</code> command on the <code class="highlighter-rouge">hoodie-cli</code>.
 Currently, there is support for converting parquet datasets.</p>
 
 <h4 id="via-custom-spark-job">Via Custom Spark Job</h4>
@@ -503,8 +508,6 @@ Usage: &lt;main class&gt; [options]
 </code></pre>
 </div>
 
-<div class="bs-callout bs-callout-info">Note that for now, due to jar mismatches between Spark &amp; Hive, its recommended to run this as a separate Java task in your workflow manager/cron. This is getting fix <a href="https://github.com/uber/hoodie/issues/123">here</a></div>
-
 <h2 id="incrementally-pulling">Incrementally Pulling</h2>
 
 <p>Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated &amp; new rows since a specified commit timestamp.
@@ -530,7 +533,7 @@ This class can be used within existing Spark jobs and offers the following funct
 
 <p>Please refer to <a href="configurations.html">configurations</a> section, to view all datasource options.</p>
 
-<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code> offers the following functionality using Hoodie’s implicit indexing.</p>
+<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code> offers the following functionality using Hudi’s implicit indexing.</p>
 
 <table>
   <tbody>
@@ -540,7 +543,7 @@ This class can be used within existing Spark jobs and offers the following funct
     </tr>
     <tr>
       <td>read(keys)</td>
-      <td>Read out the data corresponding to the keys as a DataFrame, using Hoodie’s own index for faster lookup</td>
+      <td>Read out the data corresponding to the keys as a DataFrame, using Hudi’s own index for faster lookup</td>
     </tr>
     <tr>
       <td>filterExists()</td>
@@ -590,7 +593,7 @@ e.g: <code class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
     </tr>
     <tr>
       <td>tmp</td>
-      <td>Directory where the temporary delta data is stored in HDFS. The directory structure will follow conventions. Please see the below section.</td>
+      <td>Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.</td>
       <td> </td>
     </tr>
     <tr>
@@ -610,12 +613,12 @@ e.g: <code class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
     </tr>
     <tr>
       <td>sourceDataPath</td>
-      <td>Source HDFS Base Path. This is where the Hudi metadata will be read.</td>
+      <td>Source DFS Base Path. This is where the Hudi metadata will be read.</td>
       <td> </td>
     </tr>
     <tr>
       <td>targetDataPath</td>
-      <td>Target HDFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly.</td>
+      <td>Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly.</td>
       <td> </td>
     </tr>
     <tr>
@@ -647,7 +650,6 @@ it will automatically use the backfill configuration, since applying the last 24
 is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).</p>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/index.html b/content/index.html
index bd31b4d..1a1c5ff 100644
--- a/content/index.html
+++ b/content/index.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.">
-<meta name="keywords" content="getting_started,  homepage">
+<meta name="keywords" content="big data, stream processing, cloud, hdfs, storage, upserts, change capture">
 <title>What is Hudi? | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -366,7 +370,7 @@ $('#toc').on('click', 'a', function() {
 
     
 
-  <p>Hudi (pronounced “Hoodie”) ingests &amp; manages storage of large analytical datasets on <a href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a> or cloud stores and provides three logical views for query access.</p>
+  <p>Hudi (pronounced “Hoodie”) ingests &amp; manages storage of large analytical datasets over DFS (<a href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a> or cloud stores) and provides three logical views for query access.</p>
 
 <ul>
   <li><strong>Read Optimized View</strong> - Provides excellent query performance on pure columnar storage, much like plain <a href="https://parquet.apache.org/">Parquet</a> tables.</li>
@@ -374,7 +378,7 @@ $('#toc').on('click', 'a', function() {
   <li><strong>Near-Real time Table</strong> - Provides queries on real-time data, using a combination of columnar &amp; row based storage (e.g Parquet + <a href="http://avro.apache.org/docs/current/mr.html">Avro</a>)</li>
 </ul>
 
-<figure><img class="docimage" src="images/hoodie_intro_1.png" alt="hoodie_intro_1.png" /></figure>
+<figure><img class="docimage" src="images/hudi_intro_1.png" alt="hudi_intro_1.png" /></figure>
 
 <p>By carefully managing how data is laid out in storage &amp; how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like <a href="https://prestodb.io">Presto</a> &amp; <a href="https://spark.apache.org/sql/">Spark</a>, while at the same time capable of being consumed incrementally from processing/ETL frameworks like <a href="https://hive.apache.org/">Hive</a> &amp;  [...]
 
diff --git a/content/js/mydoc_scroll.html b/content/js/mydoc_scroll.html
index b23a6ad..ee70719 100644
--- a/content/js/mydoc_scroll.html
+++ b/content/js/mydoc_scroll.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="This page demonstrates how you the integration of a script called ScrollTo, which is used here to link definitions of a JSON code sample to a list of definit...">
-<meta name="keywords" content="special_layouts,  json, scrolling, scrollto, jquery plugin">
+<meta name="keywords" content="json, scrolling, scrollto, jquery plugin">
 <title>Scroll layout | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/migration_guide.html b/content/migration_guide.html
index 7bcfa1d..03ea8a1 100644
--- a/content/migration_guide.html
+++ b/content/migration_guide.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset">
-<meta name="keywords" content=" migration guide">
+<meta name="keywords" content="hudi, migration, use case">
 <title>Migration Guide | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -362,7 +366,7 @@ Take this approach if your dataset is an append only type of dataset and you do
 
 <p>Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
  of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset,
+ make the update available to queries. Note that not only do you get to use all Hudi primitives on this dataset,
  there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
  . You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
  adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
@@ -371,9 +375,8 @@ Take this approach if your dataset is an append only type of dataset and you do
 <p>There are a few options when choosing this approach.</p>
 
 <h4 id="option-1">Option 1</h4>
-<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in
-parquet file
-format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.</p>
+<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.</p>
 
 <h4 id="option-2">Option 2</h4>
 <p>For huge datasets, this could be as simple as : for partition in [list of partitions in source dataset] {
@@ -385,7 +388,7 @@ format. This tool essentially starts a Spark Job to read the existing parquet da
 <p>Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
  <a href="quickstart.html">here</a>.</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install -DskipTests`, the shell can be
+<div class="highlighter-rouge"><pre class="highlight"><code>Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hoodie-cli &amp;&amp; ./hoodie-cli.sh`.
 
 hoodie-&gt;hdfsparquetimport
diff --git a/content/news.html b/content/news.html
index 645bae0..43d92a3 100644
--- a/content/news.html
+++ b/content/news.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes, announcements">
+<meta name="keywords" content="apache, hudi, news, blog, updates, release notes, announcements">
 <title>News | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -266,7 +270,7 @@
                 <a href="tag_news.html">news</a>
 
                 </span>
-        <p> We will be presenting Hoodie &amp; general concepts around how incremental processing works at Uber.
+        <p> We will be presenting Hudi &amp; general concepts around how incremental processing works at Uber.
 Catch our talk “Incremental Processing on Hadoop At Uber”
 
  </p>
diff --git a/content/news_archive.html b/content/news_archive.html
index 4d80715..d1986b5 100644
--- a/content/news_archive.html
+++ b/content/news_archive.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes, announcements">
+<meta name="keywords" content="news, blog, updates, release notes, announcements">
 <title>News | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/powered_by.html b/content/powered_by.html
index 8f4b0d4..99991ca 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" talks">
+<meta name="keywords" content="hudi, talks, presentation">
 <title>Talks & Powered By | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -383,7 +387,6 @@ October 2018, Spark+AI Summit Europe, London, UK</p>
 </ol>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/privacy.html b/content/privacy.html
index 704bd3d..1804b9f 100644
--- a/content/privacy.html
+++ b/content/privacy.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" privacy">
+<meta name="keywords" content="hudi, privacy">
 <title>Privacy Policy | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/quickstart.html b/content/quickstart.html
index a73534d..b7781b3 100644
--- a/content/quickstart.html
+++ b/content/quickstart.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content="quickstart,  quickstart">
+<meta name="keywords" content="hudi, quickstart">
 <title>Quickstart | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -362,7 +366,8 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
 
 <h2 id="version-compatibility">Version Compatibility</h2>
 
-<p>Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We have verified that Hudi works with the following combination of Hadoop/Hive/Spark.</p>
+<p>Hudi requires Java 8 to be installed on a *nix system. Hudi works with Spark-2.x versions. 
+Further, we have verified that Hudi works with the following combination of Hadoop/Hive/Spark.</p>
 
 <table>
   <thead>
@@ -395,8 +400,9 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
   </tbody>
 </table>
 
-<p>If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues. We are limited by our bandwidth to certify other combinations.
-It would be of great help if you can reach out to us with your setup and experience with hoodie.</p>
+<p>If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues.
+We are limited by our bandwidth to certify other combinations (e.g Docker on Windows).
+It would be of great help if you can reach out to us with your setup and experience with hudi.</p>
 
 <h2 id="generate-a-hudi-dataset">Generate a Hudi Dataset</h2>
 
@@ -424,7 +430,7 @@ Use the RDD API to perform more involved actions on a Hudi dataset</p>
 
 <h4 id="datasource-api">DataSource API</h4>
 
-<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class, to place a two commits (commit 1 =&gt; 100 inserts, commit 2 =&gt; 100 updates to previously inserted 100 records) onto your HDFS/local filesystem. Use the wrapper script
+<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class, to place a two commits (commit 1 =&gt; 100 inserts, commit 2 =&gt; 100 updates to previously inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
 to run from command-line</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-spark
@@ -679,9 +685,9 @@ data infrastructure is brought up in a local docker cluster within your computer
 
 <h3 id="setting-up-docker-cluster">Setting up Docker Cluster</h3>
 
-<h4 id="build-hoodie">Build Hoodie</h4>
+<h4 id="build-hudi">Build Hudi</h4>
 
-<p>The first step is to build hoodie
+<p>The first step is to build hudi
 <code class="highlighter-rouge">
 cd &lt;HUDI_WORKSPACE&gt;
 mvn package -DskipTests
@@ -801,7 +807,7 @@ automatically initializes the datasets in the file-system if they do not exist y
 <div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it adhoc-2 /bin/bash
 
 # Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 ....
 ....
 2018-09-24 22:20:00 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
@@ -1329,7 +1335,7 @@ scala&gt; spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, clo
 Again, You can use Hudi CLI to manually schedule and run compaction</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it adhoc-1 /bin/bash
-^[[Aroot@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
 ============================================
 *                                          *
 *     _    _                 _ _           *
@@ -1514,7 +1520,7 @@ scala&gt; spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, clo
 
 <h2 id="testing-hudi-in-local-docker-environment">Testing Hudi in Local Docker environment</h2>
 
-<p>You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hoodie.
+<p>You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
 <code class="highlighter-rouge">
 $ mvn pre-integration-test -DskipTests
 </code>
diff --git a/content/s3_hoodie.html b/content/s3_hoodie.html
index 217005c..0366721 100644
--- a/content/s3_hoodie.html
+++ b/content/s3_hoodie.html
@@ -4,8 +4,8 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to configure Hudi with S3 filesystem.">
-<meta name="keywords" content=" sql hive s3 spark presto">
-<title>S3 Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, aws, s3, spark, presto">
+<title>S3 Filesystem | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -158,7 +162,7 @@
 
 
 
-  <a class="email" title="Submit feedback" href="#" onclick="javascript:window.location='mailto:dev@hudi.apache.org?subject=Hudi Documentation feedback&body=I have some feedback about the S3 Filesystem (experimental) page: ' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
+  <a class="email" title="Submit feedback" href="#" onclick="javascript:window.location='mailto:dev@hudi.apache.org?subject=Hudi Documentation feedback&body=I have some feedback about the S3 Filesystem page: ' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
 
 <li>
 
@@ -176,7 +180,7 @@
                                 searchInput: document.getElementById('search-input'),
                                 resultsContainer: document.getElementById('results-container'),
                                 dataSource: 'search.json',
-                                searchResultTemplate: '<li><a href="{url}" title="S3 Filesystem (experimental)">{title}</a></li>',
+                                searchResultTemplate: '<li><a href="{url}" title="S3 Filesystem">{title}</a></li>',
                     noResultsText: 'No results found.',
                             limit: 10,
                             fuzzy: true,
@@ -327,7 +331,7 @@
     <!-- Content Column -->
     <div class="col-md-9">
         <div class="post-header">
-   <h1 class="post-title-main">S3 Filesystem (experimental)</h1>
+   <h1 class="post-title-main">S3 Filesystem</h1>
 </div>
 
 
@@ -343,11 +347,11 @@
 
     
 
-  <p>Hudi works with HDFS by default. There is an experimental work going on Hoodie-S3 compatibility.</p>
+  <p>In this page, we explain how to get your Hudi spark job to store into AWS S3.</p>
 
 <h2 id="aws-configs">AWS configs</h2>
 
-<p>There are two configurations required for Hoodie-S3 compatibility:</p>
+<p>There are two configurations required for Hudi-S3 compatibility:</p>
 
 <ul>
   <li>Adding AWS Credentials for Hudi</li>
@@ -415,7 +419,6 @@ export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
 </ul>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/search.json b/content/search.json
index 3f7eb15..0473b34 100644
--- a/content/search.json
+++ b/content/search.json
@@ -6,7 +6,7 @@
 {
 "title": "Admin Guide",
 "tags": "",
-"keywords": "admin",
+"keywords": "hudi, administration, operation, devops",
 "url": "admin_guide.html",
 "summary": "This section offers an overview of tools available to operate an ecosystem of Hudi datasets"
 }
@@ -17,7 +17,7 @@
 {
 "title": "Community",
 "tags": "",
-"keywords": "usecases",
+"keywords": "hudi, use cases, big data, apache",
 "url": "community.html",
 "summary": ""
 }
@@ -28,7 +28,7 @@
 {
 "title": "Comparison",
 "tags": "",
-"keywords": "usecases",
+"keywords": "apache, hudi, kafka, kudu, hive, hbase, stream processing",
 "url": "comparison.html",
 "summary": ""
 }
@@ -39,7 +39,7 @@
 {
 "title": "Concepts",
 "tags": "",
-"keywords": "concepts",
+"keywords": "hudi, design, storage, views, timeline",
 "url": "concepts.html",
 "summary": "Here we introduce some basic concepts & give a broad technical overview of Hudi"
 }
@@ -50,7 +50,7 @@
 {
 "title": "Configurations",
 "tags": "",
-"keywords": "configurations",
+"keywords": "garbage collection, hudi, jvm, configs, tuning",
 "url": "configurations.html",
 "summary": "Here we list all possible configurations and what they mean"
 }
@@ -61,7 +61,7 @@
 {
 "title": "Developer Setup",
 "tags": "",
-"keywords": "developer setup",
+"keywords": "hudi, ide, developer, setup",
 "url": "contributing.html",
 "summary": ""
 }
@@ -72,9 +72,9 @@
 
 
 {
-"title": "GCS Filesystem (experimental)",
+"title": "GCS Filesystem",
 "tags": "",
-"keywords": "sql hive gcs spark presto",
+"keywords": "hudi, hive, google cloud, storage, spark, presto",
 "url": "gcs_hoodie.html",
 "summary": "In this page, we go over how to configure hudi with Google Cloud Storage."
 }
@@ -85,7 +85,7 @@
 {
 "title": "Implementation",
 "tags": "",
-"keywords": "implementation",
+"keywords": "hudi, index, storage, compaction, cleaning, implementation",
 "url": "implementation.html",
 "summary": ""
 }
@@ -96,7 +96,7 @@
 {
 "title": "Incremental Processing",
 "tags": "",
-"keywords": "incremental processing",
+"keywords": "hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL",
 "url": "incremental_processing.html",
 "summary": "In this page, we will discuss some available tools for ingesting data incrementally & consuming the changes."
 }
@@ -107,7 +107,7 @@
 {
 "title": "What is Hudi?",
 "tags": "getting_started",
-"keywords": "homepage",
+"keywords": "big data, stream processing, cloud, hdfs, storage, upserts, change capture",
 "url": "index.html",
 "summary": "Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."
 }
@@ -118,7 +118,7 @@
 {
 "title": "Migration Guide",
 "tags": "",
-"keywords": "migration guide",
+"keywords": "hudi, migration, use case",
 "url": "migration_guide.html",
 "summary": "In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset"
 }
@@ -140,7 +140,7 @@
 {
 "title": "News",
 "tags": "",
-"keywords": "news, blog, updates, release notes, announcements",
+"keywords": "apache, hudi, news, blog, updates, release notes, announcements",
 "url": "news.html",
 "summary": ""
 }
@@ -162,7 +162,7 @@
 {
 "title": "Talks &amp; Powered By",
 "tags": "",
-"keywords": "talks",
+"keywords": "hudi, talks, presentation",
 "url": "powered_by.html",
 "summary": ""
 }
@@ -173,7 +173,7 @@
 {
 "title": "Privacy Policy",
 "tags": "",
-"keywords": "privacy",
+"keywords": "hudi, privacy",
 "url": "privacy.html",
 "summary": ""
 }
@@ -184,7 +184,7 @@
 {
 "title": "Quickstart",
 "tags": "quickstart",
-"keywords": "quickstart",
+"keywords": "hudi, quickstart",
 "url": "quickstart.html",
 "summary": ""
 }
@@ -193,9 +193,9 @@
 
 
 {
-"title": "S3 Filesystem (experimental)",
+"title": "S3 Filesystem",
 "tags": "",
-"keywords": "sql hive s3 spark presto",
+"keywords": "hudi, hive, aws, s3, spark, presto",
 "url": "s3_hoodie.html",
 "summary": "In this page, we go over how to configure Hudi with S3 filesystem."
 }
@@ -210,7 +210,7 @@
 {
 "title": "SQL Queries",
 "tags": "",
-"keywords": "sql hive spark presto",
+"keywords": "hudi, hive, spark, sql, presto",
 "url": "sql_queries.html",
 "summary": "In this page, we go over how to enable SQL queries on Hudi built tables."
 }
@@ -221,7 +221,7 @@
 {
 "title": "Use Cases",
 "tags": "",
-"keywords": "usecases",
+"keywords": "hudi, data ingestion, etl, real time, use cases",
 "url": "use_cases.html",
 "summary": "Following are some sample use-cases for Hudi, which illustrate the benefits in terms of faster processing & increased efficiency"
 }
diff --git a/content/sql_queries.html b/content/sql_queries.html
index 6936191..d7fa8cc 100644
--- a/content/sql_queries.html
+++ b/content/sql_queries.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to enable SQL queries on Hudi built tables.">
-<meta name="keywords" content=" sql hive spark presto">
+<meta name="keywords" content="hudi, hive, spark, sql, presto">
 <title>SQL Queries | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -368,8 +372,6 @@ to using the Hive Serde to read the data (planning/executions is still Spark). T
 towards Parquet reading, which we will address in the next method based on path filters.
 However benchmarks have not revealed any real performance degradation with Hudi &amp; SparkSQL, compared to native support.</p>
 
-<div class="bs-callout bs-callout-info">Get involved to improve this integration <a href="https://github.com/uber/hoodie/issues/7">here</a> and <a href="https://issues.apache.org/jira/browse/SPARK-19351">here</a> </div>
-
 <p>Sample command is provided below to spin up Spark Shell</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>$ spark-shell --jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
diff --git a/content/strata-talk.html b/content/strata-talk.html
index 13a8375..58b6f8a 100644
--- a/content/strata-talk.html
+++ b/content/strata-talk.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content="news,  ">
+<meta name="keywords" content="">
 <title>Hudi entered Apache Incubator | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/use_cases.html b/content/use_cases.html
index 6df8c34..dcdf403 100644
--- a/content/use_cases.html
+++ b/content/use_cases.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Following are some sample use-cases for Hudi, which illustrate the benefits in terms of faster processing & increased efficiency">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, data ingestion, etl, real time, use cases">
 <title>Use Cases | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI" target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a href="https://projects.apache.org/project.html?incubator-hudi" target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -350,7 +354,7 @@ In most (if not all) Hadoop deployments, it is unfortunately solved in a pieceme
 even though this data is arguably the most valuable for the entire organization.</p>
 
 <p>For RDBMS ingestion, Hudi provides <strong>faster loads via Upserts</strong>, as opposed costly &amp; inefficient bulk loads. For e.g, you can read the MySQL BIN log or <a href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports">Sqoop Incremental Import</a> and apply them to an
-equivalent Hudi table on HDFS. This would be much faster/efficient than a <a href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457">bulk merge job</a>
+equivalent Hudi table on DFS. This would be much faster/efficient than a <a href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457">bulk merge job</a>
 or <a href="http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/">complicated handcrafted merge workflows</a></p>
 
 <p>For NoSQL datastores like <a href="http://cassandra.apache.org/">Cassandra</a> / <a href="http://www.project-voldemort.com/voldemort/">Voldemort</a> / <a href="https://hbase.apache.org/">HBase</a>, even moderately big installations store billions of rows.
@@ -367,13 +371,13 @@ This is absolutely perfect for lower scale (<a href="https://blog.twitter.com/20
 But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization &amp; wasteful hardware/license costs.</p>
 
 <p>On the other hand, interactive SQL solutions on Hadoop such as Presto &amp; SparkSQL excel in <strong>queries that finish within few seconds</strong>.
-By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide a much efficient alternative, as well unlock real-time analytics on <strong>several magnitudes larger datasets</strong> stored in HDFS.
+By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide a much efficient alternative, as well unlock real-time analytics on <strong>several magnitudes larger datasets</strong> stored in DFS.
 Also, Hudi has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enables faster analytics on much fresher analytics, without increasing the operational overhead.</p>
 
 <h2 id="incremental-processing-pipelines">Incremental Processing Pipelines</h2>
 
 <p>One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows.
-Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition.
+Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new DFS Folder/Hive Partition.
 Let’s take a concrete example to illustrate this. An upstream workflow <code class="highlighter-rouge">U</code> can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour.
 Then, a downstream workflow <code class="highlighter-rouge">D</code>, kicks off immediately after <code class="highlighter-rouge">U</code> finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours.</p>
 
@@ -388,19 +392,18 @@ like 15 mins, and providing an end-end latency of 30 mins at <code class="highli
 
 <div class="bs-callout bs-callout-info">To achieve this, Hudi has embraced similar concepts from stream processing frameworks like <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations">Spark Streaming</a> , Pub/Sub systems like <a href="http://kafka.apache.org/documentation/#theconsumer">Kafka</a>
 or database replication technologies like <a href="https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187">Oracle XStream</a>.
-For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing &amp; Batch Processing) can be found <a href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">here</a></div>
+For the more curious, a more detailed explanation of the benefits of Incremental Processing (compared to Stream Processing &amp; Batch Processing) can be found <a href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">here</a></div>
 
-<h2 id="data-dispersal-from-hadoop">Data Dispersal From Hadoop</h2>
+<h2 id="data-dispersal-from-dfs">Data Dispersal From DFS</h2>
 
 <p>A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application.
 For e.g, a Spark Pipeline can <a href="https://eng.uber.com/telematics/">determine hard braking events on Hadoop</a> and load them into a serving store like ElasticSearch, to be used by the Uber application to increase safe driving. Typical architectures for this employ a <code class="highlighter-rouge">queue</code> between Hadoop and serving store, to prevent overwhelming the target serving store.
-A popular choice for this queue is Kafka and this model often results in <strong>redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)</strong></p>
+A popular choice for this queue is Kafka and this model often results in <strong>redundant storage of same data on DFS (for offline analysis on computed results) and Kafka (for dispersal)</strong></p>
 
 <p>Once again Hudi can efficiently solve this problem, by having the Spark Pipeline upsert output from
 each run into a Hudi dataset, which can then be incrementally tailed (just like a Kafka topic) for new data &amp; written into the serving store.</p>
 
 
-
     <div class="tags">
         
     </div>