You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/03/22 18:19:34 UTC

[incubator-hudi] branch asf-site updated: Major cleanup of docs structure/content

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 42f8482  Major cleanup of docs structure/content
42f8482 is described below

commit 42f848217f3c29745ee3946516647ca60b85afb4
Author: Vinoth Chandar <vi...@uber.com>
AuthorDate: Thu Mar 21 19:15:05 2019 -0700

    Major cleanup of docs structure/content
    
     - Reworked Concepts, Querying/Writing data pages
     - Added a section on storage management
     - Trimmed down quickstart and separated out demo page
     - Made all configs linkable
     - Added link to HIP process
     - Point code to asf repo
---
 docs/_data/topnav.yml                  |    2 +-
 docs/community.md                      |    3 +-
 docs/concepts.md                       |  173 ++---
 docs/configurations.md                 |  397 ++++++-----
 docs/css/customstyles.css              |    4 +
 docs/{quickstart.md => docker_demo.md} |  311 +--------
 docs/performance.md                    |   13 +-
 docs/querying_data.md                  |  128 +++-
 docs/quickstart.md                     | 1124 ++------------------------------
 docs/writing_data.md                   |  129 +---
 10 files changed, 529 insertions(+), 1755 deletions(-)

diff --git a/docs/_data/topnav.yml b/docs/_data/topnav.yml
index 3b9eca8..167ffb8 100644
--- a/docs/_data/topnav.yml
+++ b/docs/_data/topnav.yml
@@ -8,7 +8,7 @@ topnav:
     - title: Community
       url: /community.html
     - title: Code
-      external_url: https://github.com/uber/hoodie
+      external_url: https://github.com/apache/incubator-hudi
 
 #Topnav dropdowns
 topnav_dropdowns:
diff --git a/docs/community.md b/docs/community.md
index fd39c0f..76b276b 100644
--- a/docs/community.md
+++ b/docs/community.md
@@ -15,7 +15,8 @@ There are several ways to get in touch with the Hudi community.
 | For any general questions, user support, development discussions | Dev Mailing list ([Subscribe](mailto:dev-subscribe@hudi.apache.org), [Unsubscribe](mailto:dev-unsubscribe@hudi.apache.org), [Archives](https://lists.apache.org/list.html?dev@hudi.apache.org)). Empty email works for subscribe/unsubscribe. Please use [gists](https://gist.github.com) to share code/stacktraces on the email. |
 | For reporting bugs or issues or discover known issues | Please use [ASF Hudi JIRA](https://issues.apache.org/jira/projects/HUDI/summary). See [#here](#accounts) for access |
 | For quick pings & 1-1 chats | Join our [slack group](https://join.slack.com/t/apache-hudi/signup) |
-| For proposing large features, changes | Start a Hudi Improvement Process (HIP). Instructions coming soon. See [#here](#accounts) for access |
+| For proposing large features, changes | Start a Hudi Improvement Process (HIP). Instructions [here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091449#ApacheHudi(Incubating)-Designdocuments/HIPs).
+ See [#here](#accounts) for wiki access |
 | For stream of commits, pull requests etc | Commits Mailing list ([Subscribe](mailto:commits-subscribe@hudi.apache.org), [Unsubscribe](mailto:commits-unsubscribe@hudi.apache.org), [Archives](https://lists.apache.org/list.html?commits@hudi.apache.org)) |
 
 If you wish to report a security vulnerability, please contact [security@apache.org](mailto:security@apache.org).
diff --git a/docs/concepts.md b/docs/concepts.md
index a2f4322..6f22978 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -7,25 +7,40 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over datasets on DFS
 
  * Upsert                     (how do I change the dataset?)
- * Incremental consumption    (how do I fetch data that changed?)
+ * Incremental pull           (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity performed on the dataset, that helps provide `instantaenous` views of the dataset,
-while also efficiently supporting retrieval of data in the order of arrival into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the dataset at different `instants` of time that helps provide instantaneous views of the dataset,
+while also efficiently supporting retrieval of data in the order of arrival. A Hudi instant consists of the following components 
 
- * `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset.
-       Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation.
+ * `Action type` : Type of action performed on the dataset
+ * `Instant time` : Instant time is typically a timestamp (e.g: 20190117010349), which monotonically increases in the order of action's begin time.
+ * `state` : current state of the instant
+ 
+Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time.
+
+Key actions performed include
+
+ * `COMMITS` - A commit denotes an **atomic write** of a batch of records into a dataset.
  * `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
- * `DELTA_COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a
- MergeOnRead storage type of dataset
- * `COMPACTIONS` - Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based log files to columnar formats.
+ * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of records into a  MergeOnRead storage type of dataset, where some/all of the data could be just written to delta logs.
+ * `COMPACTION` - Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based log files to columnar formats. Internally, compaction manifests as a special commit on the timeline
+ * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled back, removing any partial files produced during such a write
+ * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will not delete them. It helps restore the dataset to a point on the timeline, in case of disaster/data recovery scenarios.
 
+Any given instant can be 
+in one of the following states
 
-{% include image.html file="Hudi_timeline.png" alt="Hudi_timeline.png" %}
+ * `REQUESTED` - Denotes an action has been scheduled, but has not initiated
+ * `INFLIGHT` - Denotes that the action is currently being performed
+ * `COMPLETED` - Denotes completion of an action on the timeline
+
+{% include image.html file="hudi_timeline.png" alt="hudi_timeline.png" %}
 
 Example above shows upserts happenings between 10:00 and 10:20 on a Hudi dataset, roughly every 5 mins, leaving commit metadata on the Hudi timeline, along
 with other background cleaning/compactions. One key observation to make is that the commit time indicates the `arrival time` of the data (10:20AM), while the actual data
@@ -35,57 +50,68 @@ When there is late arriving data (data intended for 9:00 arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on DFS. Dataset is broken up into partitions, which are folders containing data files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-    A structured hive/spark dataset managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables.
- * `Commit`
-    A commit marks a new batch of data applied to a dataset. Hudi maintains  monotonically increasing timestamps to track commits and guarantees that a commit is atomically
-    published.
- * `Commit Timeline`
-    Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime.
- * `File Slice`
-    Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id.
-    Hudi uses MVCC to provide atomicity and isolation of readers from a writer. This means that a logical fileId will
-    have many physical versions of it. Each of these physical version of a file represents a complete view of the
-    file as of a commit and is called File Slice
- * `File Group`
-    A file-group is a file-slice timeline. It is a list of file-slices in commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written). 
+In turn, `views` define how the underlying data is exposed to the queries (i.e how data is read). 
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with the notion of Read Optimized & Near-Real time tables, which are merely how the underlying data is exposed
-to the queries (i.e how data is read).
+| Storage Type  | Supported Views |
+|-------------- |------------------|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi (will) supports the following storage types.
+### Storage Types
+Hudi supports the following storage types.
 
-| Storage Type  | Supported Tables |
-|-------------- |------------------|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+  - [Copy On Write](#copy-on-write-storage) : Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.
+    
+Following table summarizes the trade-offs between these two storage types
+
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta log) |
+| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update cost) |
+| Write Amplification | Higher | Lower (depending on compaction strategy) |
 
-  - Copy On Write : A heavily read optimized storage type, that simply creates new versions of files corresponding to the records that changed.
-  - Merge On Read : Also provides a near-real time datasets in the order of 5 mins, by shifting some of the write cost, to the reads and merging incoming and on-disk data on-the-fly
 
-Regardless of the storage type, Hudi organizes a datasets into a directory structure under a `basepath`,
-very similar to Hive tables. Dataset is broken up into partitions, which are folders containing files for that partition.
-Each partition uniquely identified by its `partitionpath`, which is relative to the basepath.
+### Views
+Hudi supports the following views of stored data
 
-Within each partition, records are distributed into multiple files. Each file is identified by an unique `file id` and the `commit` that
-produced the file. Multiple files can share the same file id but written at different commits, in case of updates.
+ - **Read Optimized View** : Queries on this view see the latest snapshot of the dataset as of a given commit or compaction action. 
+    This view exposes only the base/columnar files in latest file slices to the queries and guarantees the same columnar query performance compared to a non-hudi columnar dataset. 
+ - **Incremental View** : Queries on this view only see new data written to the dataset, since a given commit/compaction. This view effectively provides change streams to enable incremental data pipelines. 
+ - **Realtime View** : Queries on this view see the latest snapshot of dataset as of a given delta commit action. This view provides near-real time datasets (few mins)
+     by merging the base and delta files of the latest file slice on-the-fly.
 
-Each record is uniquely identified by a `record key` and mapped to a file id forever. This mapping between record key
-and file id, never changes once the first version of a record has been written to a file. In short, the
- `file id` identifies a group of files, that contain all versions of a group of records.
+Following table summarizes the trade-offs between the different views.
+
+| Trade-off | ReadOptimized | RealTime |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + row based delta) |
 
 
-## Copy On Write
+## Copy On Write Storage
 
-As mentioned above, each commit on Copy On Write storage, produces new versions of files. In other words, we implicitly compact every
-commit, such that only columnar data exists. As a result, the write amplification (number of bytes written for 1 byte of incoming data)
- is much higher, where read amplification is close to zero. This is a much desired property for a system like Hadoop, which is predominantly read-heavy.
+File slices in Copy-On-Write storage only contain the base/columnar file and each commit produces new versions of base files. 
+In other words, we implicitly compact on every commit, such that only columnar data exists. As a result, the write amplification 
+(number of bytes written for 1 byte of incoming data) is much higher, where read amplification is zero. 
+This is a much desired property for analytical workloads, which is predominantly read-heavy.
 
 Following illustrates how this works conceptually, when  data written into copy-on-write storage  and two queries running on top of it.
 
@@ -93,39 +119,39 @@ Following illustrates how this works conceptually, when  data written into copy-
 {% include image.html file="hudi_cow.png" alt="hudi_cow.png" %}
 
 
-As data gets written, updates to existing file ids, produce a new version for that file id stamped with the commit and
-inserts allocate a new file id and write its first version for that file id. These file versions and their commits are color coded above.
-Normal SQL queries running against such dataset (eg: `select count(*)` counting the total records in that partition), first checks the timeline for latest commit
-and filters all but latest versions of each file id. As you can see, an old query does not see the current inflight commit's files colored in pink,
+As data gets written, updates to existing file groups produce a new slice for that file group stamped with the commit instant time, 
+while inserts allocate a new file group and write its first slice for that file group. These file slices and their commit instant times are color coded above.
+SQL queries running against such a dataset (eg: `select count(*)` counting the total records in that partition), first checks the timeline for the latest commit
+and filters all but latest file slices of each file group. As you can see, an old query does not see the current inflight commit's files color coded in pink,
 but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data.
 
-The intention of copy on write storage, is to fundamentally improve how datasets are managed today on Hadoop through
+The intention of copy on write storage, is to fundamentally improve how datasets are managed today through
 
   - First class support for atomically updating data at file-level, instead of rewriting whole tables/partitions
-  - Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristical approaches
+  - Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristics
   - Tight control file sizes to keep query performance excellent (small files hurt query performance considerably).
 
 
-## Merge On Read
+## Merge On Read Storage
 
 Merge on read storage is a superset of copy on write, in the sense it still provides a read optimized view of the dataset via the Read Optmized table.
-But, additionally stores incoming upserts for each file id, onto a `row based append log`, that enables providing near real-time data to the queries
- by applying the append log, onto the latest version of each file id on-the-fly during query time. Thus, this storage type attempts to balance read and write amplication intelligently, to provide near real-time queries.
-The most significant change here, would be to the compactor, which now carefully chooses which append logs need to be compacted onto
-their columnar base data, to keep the query performance in check (larger append logs would incur longer merge times with merge data on query side)
+Additionally, it stores incoming upserts for each file group, onto a row based delta log, that enables providing near real-time data to the queries
+ by applying the delta log, onto the latest version of each file id on-the-fly during query time. Thus, this storage type attempts to balance read and write amplication intelligently, to provide near real-time queries.
+The most significant change here, would be to the compactor, which now carefully chooses which delta logs need to be compacted onto
+their columnar base file, to keep the query performance in check (larger delta logs would incur longer merge times with merge data on query side)
 
 Following illustrates how the storage works, and shows queries on both near-real time table and read optimized table.
 
 {% include image.html file="hudi_mor.png" alt="hudi_mor.png" max-width="1000" %}
 
 
-There are lot of interesting things happening in this example, which bring out the subleties in the approach.
+There are lot of interesting things happening in this example, which bring out the subtleties in the approach.
 
  - We now have commits every 1 minute or so, something we could not do in the other storage type.
- - Within each file id group, now there is an append log, which holds incoming updates to records in the base columnar files. In the example, the append logs hold
+ - Within each file id group, now there is an delta log, which holds incoming updates to records in the base columnar files. In the example, the delta logs hold
  all the data from 10:05 to 10:10. The base columnar files are still versioned with the commit, as before.
  Thus, if one were to simply look at base files alone, then the storage layout looks exactly like a copy on write table.
- - A periodic compaction process reconciles these changes from the append log and produces a new version of base file, just like what happened at 10:05 in the example.
+ - A periodic compaction process reconciles these changes from the delta log and produces a new version of base file, just like what happened at 10:05 in the example.
  - There are two ways of querying the same underlying storage: ReadOptimized (RO) Table and Near-Realtime (RT) table, depending on whether we chose query performance or freshness of data.
  - The semantics around when data from a commit is available to a query changes in a subtle way for the RO table. Note, that such a query
  running at 10:10, wont see data after 10:05 above, while a query on the RT table always sees the freshest data.
@@ -133,23 +159,8 @@ There are lot of interesting things happening in this example, which bring out t
  strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the RO Table sees data
  published within X minutes in a consistent fashion.
 
-The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying
-data out to specialized systems, which may not be able to handle the data volume.
+The intention of merge on read storage is to enable near real-time processing directly on top of DFS, as opposed to copying
+data out to specialized systems, which may not be able to handle the data volume. There are also a few secondary side benefits to 
+this storage such as reduced write amplification by avoiding synchronous merge of data, i.e, the amount of data written per 1 bytes of data in a batch
 
-## Trade offs when choosing different storage types and views
-
-### Storage Types
 
-| Trade-off | CopyOnWrite | MergeOnRead |
-|-------------- |------------------| ------------------|
-| Data Latency | Higher   | Lower |
-| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta file) |
-| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update cost) |
-| Write Amplification | Higher | Lower (depending on compaction strategy) |
-
-### Hudi Views
-
-| Trade-off | ReadOptimized | RealTime |
-|-------------- |------------------| ------------------|
-| Data Latency | Higher   | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + row based delta) |
diff --git a/docs/configurations.md b/docs/configurations.md
index cc1cc09..e8cab52 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -3,7 +3,7 @@ title: Configurations
 keywords: garbage collection, hudi, jvm, configs, tuning
 sidebar: mydoc_sidebar
 permalink: configurations.html
-toc: false
+toc: true
 summary: "Here we list all possible configurations and what they mean"
 ---
 This page covers the different ways of configuring your job to write/read Hudi datasets. 
@@ -31,6 +31,10 @@ to cloud stores.
 Spark jobs using the datasource can be configured by passing the below options into the `option(k,v)` method as usual.
 The actual datasource level configs are listed below.
 
+
+
+
+
 #### Write Options
 
 Additionally, you can pass down any of the WriteClient level configs directly using `options()` or `option(k,v)` methods.
@@ -49,68 +53,86 @@ inputDF.write()
 
 Options useful for writing datasets via `write.format.option(...)`
 
-- [TABLE_NAME_OPT_KEY](#TABLE_NAME_OPT_KEY)<br/>
+##### TABLE_NAME_OPT_KEY {#TABLE_NAME_OPT_KEY}
   Property: `hoodie.datasource.write.table.name` [Required]<br/>
   <span style="color:grey">Hive table name, to register the dataset into.</span>
-- [OPERATION_OPT_KEY](#OPERATION_OPT_KEY)<br/>
+  
+##### OPERATION_OPT_KEY {#OPERATION_OPT_KEY}
   Property: `hoodie.datasource.write.operation`, Default: `upsert`<br/>
   <span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation. Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. 
   bulk insert uses a disk based write path to scale to load large inputs without need to cache it.</span>
-- [STORAGE_TYPE_OPT_KEY](#STORAGE_TYPE_OPT_KEY)<br/>
+  
+##### STORAGE_TYPE_OPT_KEY {#STORAGE_TYPE_OPT_KEY}
   Property: `hoodie.datasource.write.storage.type`, Default: `COPY_ON_WRITE` <br/>
   <span style="color:grey">The storage type for the underlying data, for this write. This can't change between writes.</span>
-- [PRECOMBINE_FIELD_OPT_KEY](#PRECOMBINE_FIELD_OPT_KEY)<br/>
+  
+##### PRECOMBINE_FIELD_OPT_KEY {#PRECOMBINE_FIELD_OPT_KEY}
   Property: `hoodie.datasource.write.precombine.field`, Default: `ts` <br/>
   <span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
 we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span>
-- [PAYLOAD_CLASS_OPT_KEY](#PAYLOAD_CLASS_OPT_KEY)<br/>
+
+##### PAYLOAD_CLASS_OPT_KEY {#PAYLOAD_CLASS_OPT_KEY}
   Property: `hoodie.datasource.write.payload.class`, Default: `com.uber.hoodie.OverwriteWithLatestAvroPayload` <br/>
   <span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. 
   This will render any value set for `PRECOMBINE_FIELD_OPT_VAL` in-effective</span>
-- [RECORDKEY_FIELD_OPT_KEY](#RECORDKEY_FIELD_OPT_KEY)<br/>
+  
+##### RECORDKEY_FIELD_OPT_KEY {#RECORDKEY_FIELD_OPT_KEY}
   Property: `hoodie.datasource.write.recordkey.field`, Default: `uuid` <br/>
   <span style="color:grey">Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value
 will be obtained by invoking .toString() on the field value. Nested fields can be specified using
 the dot notation eg: `a.b.c`</span>
-- [PARTITIONPATH_FIELD_OPT_KEY](#PARTITIONPATH_FIELD_OPT_KEY)<br/>
+
+##### PARTITIONPATH_FIELD_OPT_KEY {#PARTITIONPATH_FIELD_OPT_KEY}
   Property: `hoodie.datasource.write.partitionpath.field`, Default: `partitionpath` <br/>
   <span style="color:grey">Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`.
 Actual value ontained by invoking .toString()</span>
-- [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY)<br/>
+
+##### KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
   Property: `hoodie.datasource.write.keygenerator.class`, Default: `com.uber.hoodie.SimpleKeyGenerator` <br/>
   <span style="color:grey">Key generator class, that implements will extract the key out of incoming `Row` object</span>
-- [COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY)<br/>
+  
+##### COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
   Property: `hoodie.datasource.write.commitmeta.key.prefix`, Default: `_` <br/>
   <span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
 This is useful to store checkpointing information, in a consistent way with the hudi timeline</span>
-- [INSERT_DROP_DUPS_OPT_KEY](#INSERT_DROP_DUPS_OPT_KEY)<br/>
+
+##### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
   Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` <br/>
   <span style="color:grey">If set to true, filters out all duplicate records from incoming dataframe, during insert operations. </span>
-- [HIVE_SYNC_ENABLED_OPT_KEY](#HIVE_SYNC_ENABLED_OPT_KEY)<br/>
+  
+##### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
   <span style="color:grey">When set to true, register/sync the dataset to Apache Hive metastore</span>
-- [HIVE_DATABASE_OPT_KEY](#HIVE_DATABASE_OPT_KEY)<br/>
+  
+##### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.database`, Default: `default` <br/>
   <span style="color:grey">database to sync to</span>
-- [HIVE_TABLE_OPT_KEY](#HIVE_TABLE_OPT_KEY)<br/>
+  
+##### HIVE_TABLE_OPT_KEY {#HIVE_TABLE_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.table`, [Required] <br/>
   <span style="color:grey">table to sync to</span>
-- [HIVE_USER_OPT_KEY](#HIVE_USER_OPT_KEY)<br/>
+  
+##### HIVE_USER_OPT_KEY {#HIVE_USER_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.username`, Default: `hive` <br/>
   <span style="color:grey">hive user name to use</span>
-- [HIVE_PASS_OPT_KEY](#HIVE_PASS_OPT_KEY)<br/>
+  
+##### HIVE_PASS_OPT_KEY {#HIVE_PASS_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.password`, Default: `hive` <br/>
   <span style="color:grey">hive password to use</span>
-- [HIVE_URL_OPT_KEY](#HIVE_URL_OPT_KEY)<br/>
+  
+##### HIVE_URL_OPT_KEY {#HIVE_URL_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.jdbcurl`, Default: `jdbc:hive2://localhost:10000` <br/>
   <span style="color:grey">Hive metastore url</span>
-- [HIVE_PARTITION_FIELDS_OPT_KEY](#HIVE_PARTITION_FIELDS_OPT_KEY)<br/>
+  
+##### HIVE_PARTITION_FIELDS_OPT_KEY {#HIVE_PARTITION_FIELDS_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.partition_fields`, Default: ` ` <br/>
   <span style="color:grey">field in the dataset to use for determining hive partition columns.</span>
-- [HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY](#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY)<br/>
+  
+##### HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY {#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.partition_extractor_class`, Default: `com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor` <br/>
   <span style="color:grey">Class used to extract partition field values into hive partition columns.</span>
-- [HIVE_ASSUME_DATE_PARTITION_OPT_KEY](#HIVE_ASSUME_DATE_PARTITION_OPT_KEY)<br/>
+  
+##### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.assume_date_partitioning`, Default: `false` <br/>
   <span style="color:grey">Assume partitioning is yyyy/mm/dd</span>
 
@@ -118,22 +140,25 @@ This is useful to store checkpointing information, in a consistent way with the
 
 Options useful for reading datasets via `read.format.option(...)`
 
-- [VIEW_TYPE_OPT_KEY](#VIEW_TYPE_OPT_KEY) <br/>
+##### VIEW_TYPE_OPT_KEY {#VIEW_TYPE_OPT_KEY}
 Property: `hoodie.datasource.view.type`, Default: `read_optimized` <br/>
 <span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
 (or) Read Optimized mode (obtain latest view, based on columnar data)
 (or) Real time mode (obtain latest view, based on row & columnar data)</span>
-- [BEGIN_INSTANTTIME_OPT_KEY](#BEGIN_INSTANTTIME_OPT_KEY) <br/> 
+
+##### BEGIN_INSTANTTIME_OPT_KEY {#BEGIN_INSTANTTIME_OPT_KEY} 
 Property: `hoodie.datasource.read.begin.instanttime`, [Required in incremental mode] <br/>
 <span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
 necessarily correspond to an instant on the timeline. New data written with an
  `instant_time > BEGIN_INSTANTTIME` are fetched out. For e.g: '20170901080000' will get
  all new data written after Sep 1, 2017 08:00AM.</span>
-- [END_INSTANTTIME_OPT_KEY](#END_INSTANTTIME_OPT_KEY) <br/>
+ 
+##### END_INSTANTTIME_OPT_KEY {#END_INSTANTTIME_OPT_KEY}
 Property: `hoodie.datasource.read.end.instanttime`, Default: latest instant (i.e fetches all new data since begin instant time) <br/>
 <span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
 `instant_time <= END_INSTANTTIME` are fetched out.</span>
 
+
 ### WriteClient Configs {#writeclient-configs}
 
 Jobs programming directly against the RDD level apis can build a `HoodieWriteConfig` object and pass it in to the `HoodieWriteClient` constructor. 
@@ -153,184 +178,232 @@ HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
 
 Following subsections go over different aspects of write configs, explaining most important configs with their property names, default values.
 
-- [withPath](#withPath) (hoodie_base_path) 
+##### withPath(hoodie_base_path) {#withPath}
 Property: `hoodie.base.path` [Required] <br/>
 <span style="color:grey">Base DFS path under which all the data partitions are created. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span>
-- [withSchema](#withSchema) (schema_str) <br/> 
+
+##### withSchema(schema_str) {#withSchema} 
 Property: `hoodie.avro.schema` [Required]<br/>
 <span style="color:grey">This is the current reader avro schema for the dataset. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span>
-- [forTable](#forTable) (table_name)<br/> 
+
+##### forTable(table_name) {#forTable} 
 Property: `hoodie.table.name` [Required] <br/>
  <span style="color:grey">Table name for the dataset, will be used for registering with Hive. Needs to be same across runs.</span>
-- [withBulkInsertParallelism](#withBulkInsertParallelism) (bulk_insert_parallelism = 1500) <br/> 
+
+##### withBulkInsertParallelism(bulk_insert_parallelism = 1500) {#withBulkInsertParallelism} 
 Property: `hoodie.bulkinsert.shuffle.parallelism`<br/>
 <span style="color:grey">Bulk insert is meant to be used for large initial imports and this parallelism determines the initial number of files in your dataset. Tune this to achieve a desired optimal size during initial import.</span>
-- [withParallelism](#withParallelism) (insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500)<br/> 
+
+##### withParallelism(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500) {#withParallelism} 
 Property: `hoodie.insert.shuffle.parallelism`, `hoodie.upsert.shuffle.parallelism`<br/>
 <span style="color:grey">Once data has been initially imported, this parallelism controls initial parallelism for reading input records. Ensure this value is high enough say: 1 partition for 1 GB of input data</span>
-- [combineInput](#combineInput) (on_insert = false, on_update=true)<br/> 
+
+##### combineInput(on_insert = false, on_update=true) {#combineInput} 
 Property: `hoodie.combine.before.insert`, `hoodie.combine.before.upsert`<br/>
 <span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS</span>
-- [withWriteStatusStorageLevel](#withWriteStatusStorageLevel) (level = MEMORY_AND_DISK_SER)<br/> 
+
+##### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER) {#withWriteStatusStorageLevel} 
 Property: `hoodie.write.status.storage.level`<br/>
 <span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because the Client can choose to inspect the WriteStatus and choose and commit or not based on the failures. This is a configuration for the storage level for this RDD </span>
-- [withAutoCommit](#withAutoCommit) (autoCommit = true)<br/> 
+
+##### withAutoCommit(autoCommit = true) {#withAutoCommit} 
 Property: `hoodie.auto.commit`<br/>
 <span style="color:grey">Should HoodieWriteClient autoCommit after insert and upsert. The client can choose to turn off auto-commit and commit on a "defined success condition"</span>
-- [withAssumeDatePartitioning](#withAssumeDatePartitioning) (assumeDatePartitioning = false)<br/> 
-Property: ` hoodie.assume.date.partitioning`<br/>
+
+##### withAssumeDatePartitioning(assumeDatePartitioning = false) {#withAssumeDatePartitioning} 
+Property: `hoodie.assume.date.partitioning`<br/>
 <span style="color:grey">Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually </span>
-- [withConsistencyCheckEnabled](#withConsistencyCheckEnabled) (enabled = false)<br/> 
+
+##### withConsistencyCheckEnabled(enabled = false) {#withConsistencyCheckEnabled} 
 Property: `hoodie.consistency.check.enabled`<br/>
 <span style="color:grey">Should HoodieWriteClient perform additional checks to ensure written files' are listable on the underlying filesystem/storage. Set this to true, to workaround S3's eventual consistency model and ensure all data written as a part of a commit is faithfully available for queries. </span>
 
 #### Index configs
 Following configs control indexing behavior, which tags incoming records as either inserts or updates to older records. 
 
-- [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/>
-    <span style="color:grey">This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span>
-    - [withIndexType](#withIndexType) (indexType = BLOOM) <br/>
-    Property: `hoodie.index.type` <br/>
-    <span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span>
-    - [bloomFilterNumEntries](#bloomFilterNumEntries) (numEntries = 60000) <br/>
-    Property: `hoodie.index.bloom.num_entries` <br/>
-    <span style="color:grey">Only applies if index type is BLOOM. <br/>This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks computing this dynamically. Warning: Setting this very low, will generate a lot of false positives and ind [...]
-    - [bloomFilterFPP](#bloomFilterFPP) (fpp = 0.000000001) <br/>
-    Property: `hoodie.index.bloom.fpp` <br/>
-    <span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives</span>
-    - [bloomIndexPruneByRanges](#bloomIndexPruneByRanges) (pruneRanges = true) <br/>
-    Property: `hoodie.bloom.index.prune.by.ranges` <br/>
-    <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.</span>
-    - [bloomIndexUseCaching](#bloomIndexUseCaching) (useCaching = true) <br/>
-    Property: `hoodie.bloom.index.use.caching` <br/>
-    <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions</span>
-    - [bloomIndexParallelism](#bloomIndexParallelism) (0) <br/>
-    Property: `hoodie.bloom.index.parallelism` <br/>
-    <span style="color:grey">Only applies if index type is BLOOM. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span>
-    - [hbaseZkQuorum](#hbaseZkQuorum) (zkString) [Required]<br/>
-    Property: `hoodie.index.hbase.zkquorum` <br/>
-    <span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum url to connect to.</span>
-    - [hbaseZkPort](#hbaseZkPort) (port) [Required]<br/>
-    Property: `hoodie.index.hbase.zkport` <br/>
-    <span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum port to connect to.</span>
-    - [hbaseTableName](#hbaseTableName) (tableName) [Required]<br/>
-    Property: `hoodie.index.hbase.table` <br/>
-    <span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span>
+[withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/>
+<span style="color:grey">This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span>
+        
+##### withIndexType(indexType = BLOOM) {#withIndexType}
+Property: `hoodie.index.type` <br/>
+<span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span>
+
+##### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
+Property: `hoodie.index.bloom.num_entries` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks computing this dynamically. Warning: Setting this very low, will generate a lot of false positives and index l [...]
+
+##### bloomFilterFPP(fpp = 0.000000001) {#bloomFilterFPP}
+Property: `hoodie.index.bloom.fpp` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives</span>
+
+##### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
+Property: `hoodie.bloom.index.prune.by.ranges` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.</span>
+
+##### bloomIndexUseCaching(useCaching = true) {#bloomIndexUseCaching}
+Property: `hoodie.bloom.index.use.caching` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions</span>
 
+##### bloomIndexParallelism(0) {#bloomIndexParallelism}
+Property: `hoodie.bloom.index.parallelism` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span>
+
+##### hbaseZkQuorum(zkString) [Required] {#hbaseZkQuorum}  
+Property: `hoodie.index.hbase.zkquorum` <br/>
+<span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum url to connect to.</span>
+
+##### hbaseZkPort(port) [Required] {#hbaseZkPort}  
+Property: `hoodie.index.hbase.zkport` <br/>
+<span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum port to connect to.</span>
+
+##### hbaseTableName(tableName)  [Required] {#hbaseTableName}
+Property: `hoodie.index.hbase.table` <br/>
+<span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span>
+
+    
 #### Storage configs
 Controls aspects around sizing parquet and log files.
 
-- [withStorageConfig](#withStorageConfig) (HoodieStorageConfig) <br/>
-    - [limitFileSize](#limitFileSize) (size = 120MB) <br/>
-    Property: `hoodie.parquet.max.file.size` <br/>
-    <span style="color:grey">Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span>
-    - [parquetBlockSize](#parquetBlockSize) (rowgroupsize = 120MB) <br/>
-    Property: `hoodie.parquet.block.size` <br/>
-    <span style="color:grey">Parquet RowGroup size. Its better this is same as the file size, so that a single column within a file is stored continuously on disk</span>
-    - [parquetPageSize](#parquetPageSize) (pagesize = 1MB) <br/>
-    Property: `hoodie.parquet.page.size` <br/>
-    <span style="color:grey">Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed seperately. </span>
-    - [parquetCompressionRatio](#parquetCompressionRatio) (parquetCompressionRatio = 0.1) <br/>
-    Property: `hoodie.parquet.compression.ratio` <br/>
-    <span style="color:grey">Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files</span>
-    - [parquetCompressionCodec](#parquetCompressionCodec) (parquetCompressionCodec = gzip) <br/>
-    Property: `hoodie.parquet.compression.codec` <br/>
-    <span style="color:grey">Parquet compression codec name. Default is gzip. Possible options are [gzip | snappy | uncompressed | lzo]</span>
-    - [logFileMaxSize](#logFileMaxSize) (logFileSize = 1GB) <br/>
-    Property: `hoodie.logfile.max.size` <br/>
-    <span style="color:grey">LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version. </span>
-    - [logFileDataBlockMaxSize](#logFileDataBlockMaxSize) (dataBlockSize = 256MB) <br/>
-    Property: `hoodie.logfile.data.block.max.size` <br/>
-    <span style="color:grey">LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory. </span>
-    - [logFileToParquetCompressionRatio](#logFileToParquetCompressionRatio) (logFileToParquetCompressionRatio = 0.35) <br/>
-    Property: `hoodie.logfile.to.parquet.compression.ratio` <br/>
-    <span style="color:grey">Expected additional compression as records move from log files to parquet. Used for merge_on_read storage to send inserts into log files & control the size of compacted parquet file.</span>
+[withStorageConfig](#withStorageConfig) (HoodieStorageConfig) <br/>
+
+##### limitFileSize (size = 120MB) {#limitFileSize}
+Property: `hoodie.parquet.max.file.size` <br/>
+<span style="color:grey">Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span>
+
+##### parquetBlockSize(rowgroupsize = 120MB) {#parquetBlockSize} 
+Property: `hoodie.parquet.block.size` <br/>
+<span style="color:grey">Parquet RowGroup size. Its better this is same as the file size, so that a single column within a file is stored continuously on disk</span>
+
+##### parquetPageSize(pagesize = 1MB) {#parquetPageSize} 
+Property: `hoodie.parquet.page.size` <br/>
+<span style="color:grey">Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed seperately. </span>
+
+##### parquetCompressionRatio(parquetCompressionRatio = 0.1) {#parquetCompressionRatio} 
+Property: `hoodie.parquet.compression.ratio` <br/>
+<span style="color:grey">Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files</span>
+
+##### parquetCompressionCodec(parquetCompressionCodec = gzip) {#parquetCompressionCodec} 
+Property: `hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Parquet compression codec name. Default is gzip. Possible options are [gzip | snappy | uncompressed | lzo]</span>
+
+##### logFileMaxSize(logFileSize = 1GB) {#logFileMaxSize} 
+Property: `hoodie.logfile.max.size` <br/>
+<span style="color:grey">LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version. </span>
+
+##### logFileDataBlockMaxSize(dataBlockSize = 256MB) {#logFileDataBlockMaxSize} 
+Property: `hoodie.logfile.data.block.max.size` <br/>
+<span style="color:grey">LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory. </span>
+
+##### logFileToParquetCompressionRatio(logFileToParquetCompressionRatio = 0.35) {#logFileToParquetCompressionRatio} 
+Property: `hoodie.logfile.to.parquet.compression.ratio` <br/>
+<span style="color:grey">Expected additional compression as records move from log files to parquet. Used for merge_on_read storage to send inserts into log files & control the size of compacted parquet file.</span>
  
+    
 #### Compaction configs
 Configs that control compaction (merging of log files onto a new parquet base file), cleaning (reclamation of older/unused file groups).
+[withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
+
+##### withCleanerPolicy(policy = KEEP_LATEST_COMMITS) {#withCleanerPolicy} 
+Property: `hoodie.cleaner.policy` <br/>
+<span style="color:grey"> Cleaning policy to be used. Hudi will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span>
+
+##### retainCommits(no_of_commits_to_retain = 24) {#retainCommits} 
+Property: `hoodie.cleaner.commits.retained` <br/>
+<span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this dataset</span>
+
+##### archiveCommitsWith(minCommits = 96, maxCommits = 128) {#archiveCommitsWith} 
+Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
+<span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span>
+
+##### compactionSmallFileSize(size = 0) {#compactionSmallFileSize} 
+Property: `hoodie.parquet.small.file.limit` <br/>
+<span style="color:grey">This should be less < maxFileSize and setting it to 0, turns off this feature. Small files can always happen because of the number of insert records in a partition in a batch. Hudi has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a "small file size".</span>
+
+##### insertSplitSize(size = 500000) {#insertSplitSize} 
+Property: `hoodie.copyonwrite.insert.split.size` <br/>
+<span style="color:grey">Insert Write Parallelism. Number of inserts grouped for a single partition. Writing out 100MB files, with atleast 1kb records, means 100K records per file. Default is to overprovision to 500K. To improve insert latency, tune this to match the number of records in a single file. Setting this to a low number, will result in small files (particularly when compactionSmallFileSize is 0)</span>
+
+##### autoTuneInsertSplits(true) {#autoTuneInsertSplits} 
+Property: `hoodie.copyonwrite.insert.auto.split` <br/>
+<span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
+
+##### approxRecordSize(size = 1024) {#approxRecordSize} 
+Property: `hoodie.copyonwrite.record.size.estimate` <br/>
+<span style="color:grey">The average record size. If specified, hudi will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
+
+##### withInlineCompaction(inlineCompaction = false) {#withInlineCompaction} 
+Property: `hoodie.compact.inline` <br/>
+<span style="color:grey">When set to true, compaction is triggered by the ingestion itself, right after a commit/deltacommit action as part of insert/upsert/bulk_insert</span>
+
+##### withMaxNumDeltaCommitsBeforeCompaction(maxNumDeltaCommitsBeforeCompaction = 10) {#withMaxNumDeltaCommitsBeforeCompaction} 
+Property: `hoodie.compact.inline.max.delta.commits` <br/>
+<span style="color:grey">Number of max delta commits to keep before triggering an inline compaction</span>
+
+##### withCompactionLazyBlockReadEnabled(true) {#withCompactionLazyBlockReadEnabled} 
+Property: `hoodie.compaction.lazy.block.read` <br/>
+<span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span>
+
+##### withCompactionReverseLogReadEnabled(false) {#withCompactionReverseLogReadEnabled} 
+Property: `hoodie.compaction.reverse.log.read` <br/>
+<span style="color:grey">HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the Reader reads the logfile in reverse direction, from pos=file_length to pos=0</span>
+
+##### withCleanerParallelism(cleanerParallelism = 200) {#withCleanerParallelism} 
+Property: `hoodie.cleaner.parallelism` <br/>
+<span style="color:grey">Increase this if cleaning becomes slow.</span>
+
+##### withCompactionStrategy(compactionStrategy = com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy) {#withCompactionStrategy} 
+Property: `hoodie.compaction.strategy` <br/>
+<span style="color:grey">Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data</span>
+
+##### withTargetIOPerCompactionInMB(targetIOPerCompactionInMB = 500000) {#withTargetIOPerCompactionInMB} 
+Property: `hoodie.compaction.target.io` <br/>
+<span style="color:grey">Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.</span>
+
+##### withTargetPartitionsPerDayBasedCompaction(targetPartitionsPerCompaction = 10) {#withTargetPartitionsPerDayBasedCompaction} 
+Property: `hoodie.compaction.daybased.target` <br/>
+<span style="color:grey">Used by com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.</span>    
+
+##### withPayloadClass(payloadClassName = com.uber.hoodie.common.model.HoodieAvroPayload) {#payloadClassName} 
+Property: `hoodie.compaction.payload.class` <br/>
+<span style="color:grey">This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.</span>
 
-- [withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
-    - [withCleanerPolicy](#withCleanerPolicy) (policy = KEEP_LATEST_COMMITS) <br/>
-    Property: `hoodie.cleaner.policy` <br/>
-    <span style="color:grey"> Cleaning policy to be used. Hudi will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span>
-    - [retainCommits](#retainCommits) (no_of_commits_to_retain = 24) <br/>
-    Property: `hoodie.cleaner.commits.retained` <br/>
-    <span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this dataset</span>
-    - [archiveCommitsWith](#archiveCommitsWith) (minCommits = 96, maxCommits = 128) <br/>
-    Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
-    <span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span>
-    - [compactionSmallFileSize](#compactionSmallFileSize) (size = 0) <br/>
-    Property: `hoodie.parquet.small.file.limit` <br/>
-    <span style="color:grey">This should be less < maxFileSize and setting it to 0, turns off this feature. Small files can always happen because of the number of insert records in a partition in a batch. Hudi has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a "small file size".</span>
-    - [insertSplitSize](#insertSplitSize) (size = 500000) <br/>
-    Property: `hoodie.copyonwrite.insert.split.size` <br/>
-    <span style="color:grey">Insert Write Parallelism. Number of inserts grouped for a single partition. Writing out 100MB files, with atleast 1kb records, means 100K records per file. Default is to overprovision to 500K. To improve insert latency, tune this to match the number of records in a single file. Setting this to a low number, will result in small files (particularly when compactionSmallFileSize is 0)</span>
-    - [autoTuneInsertSplits](#autoTuneInsertSplits) (true) <br/>
-    Property: `hoodie.copyonwrite.insert.auto.split` <br/>
-    <span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
-    - [approxRecordSize](#approxRecordSize) () <br/>
-    Property: `hoodie.copyonwrite.record.size.estimate` <br/>
-    <span style="color:grey">The average record size. If specified, hudi will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
-    - [withInlineCompaction](#withInlineCompaction) (inlineCompaction = false) <br/>
-    Property: `hoodie.compact.inline` <br/>
-    <span style="color:grey">When set to true, compaction is triggered by the ingestion itself, right after a commit/deltacommit action as part of insert/upsert/bulk_insert</span>
-    - [withMaxNumDeltaCommitsBeforeCompaction](#withMaxNumDeltaCommitsBeforeCompaction) (maxNumDeltaCommitsBeforeCompaction = 10) <br/>
-    Property: `hoodie.compact.inline.max.delta.commits` <br/>
-    <span style="color:grey">Number of max delta commits to keep before triggering an inline compaction</span>
-    - [withCompactionLazyBlockReadEnabled](#withCompactionLazyBlockReadEnabled) (true) <br/>
-    Property: `hoodie.compaction.lazy.block.read` <br/>
-    <span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span>
-    - [withCompactionReverseLogReadEnabled](#withCompactionReverseLogReadEnabled) (false) <br/>
-    Property: `hoodie.compaction.reverse.log.read` <br/>
-    <span style="color:grey">HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the Reader reads the logfile in reverse direction, from pos=file_length to pos=0</span>
-    - [withCleanerParallelism](#withCleanerParallelism) (cleanerParallelism = 200) <br/>
-    Property: `hoodie.cleaner.parallelism` <br/>
-    <span style="color:grey">Increase this if cleaning becomes slow.</span>
-    - [withCompactionStrategy](#withCompactionStrategy) (compactionStrategy = com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy) <br/>
-    Property: `hoodie.compaction.strategy` <br/>
-    <span style="color:grey">Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data</span>
-    - [withTargetIOPerCompactionInMB](#withTargetIOPerCompactionInMB) (targetIOPerCompactionInMB = 500000) <br/>
-    Property: `hoodie.compaction.target.io` <br/>
-    <span style="color:grey">Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.</span>
-    - [withTargetPartitionsPerDayBasedCompaction](#withTargetPartitionsPerDayBasedCompaction) (targetPartitionsPerCompaction = 10) <br/>
-    Property: `hoodie.compaction.daybased.target` <br/>
-    <span style="color:grey">Used by com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.</span>    
-    - [withPayloadClass](#payloadClassName) (payloadClassName = com.uber.hoodie.common.model.HoodieAvroPayload) <br/>
-    Property: `hoodie.compaction.payload.class` <br/>
-    <span style="color:grey">This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.</span>
 
     
 #### Metrics configs
 Enables reporting of Hudi metrics to graphite.
-
-- [withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
+[withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
 <span style="color:grey">Hudi publishes metrics on every commit, clean, rollback etc.</span>
-    - [on](#on) (metricsOn = true) <br/>
-    Property: `hoodie.metrics.on` <br/>
-    <span style="color:grey">Turn sending metrics on/off. on by default.</span>
-    - [withReporterType](#withReporterType) (reporterType = GRAPHITE) <br/>
-    Property: `hoodie.metrics.reporter.type` <br/>
-    <span style="color:grey">Type of metrics reporter. Graphite is the default and the only value suppported.</span>
-    - [toGraphiteHost](#toGraphiteHost) (host = localhost) <br/>
-    Property: `hoodie.metrics.graphite.host` <br/>
-    <span style="color:grey">Graphite host to connect to</span>
-    - [onGraphitePort](#onGraphitePort) (port = 4756) <br/>
-    Property: `hoodie.metrics.graphite.port` <br/>
-    <span style="color:grey">Graphite port to connect to</span>
-    - [usePrefix](#usePrefix) (prefix = "") <br/>
-    Property: `hoodie.metrics.graphite.metric.prefix` <br/>
-    <span style="color:grey">Standard prefix applied to all metrics. This helps to add datacenter, environment information for e.g</span>
 
+##### on(metricsOn = true) {#on} 
+Property: `hoodie.metrics.on` <br/>
+<span style="color:grey">Turn sending metrics on/off. on by default.</span>
+
+##### withReporterType(reporterType = GRAPHITE) {#withReporterType} 
+Property: `hoodie.metrics.reporter.type` <br/>
+<span style="color:grey">Type of metrics reporter. Graphite is the default and the only value suppported.</span>
+
+##### toGraphiteHost(host = localhost) {#toGraphiteHost} 
+Property: `hoodie.metrics.graphite.host` <br/>
+<span style="color:grey">Graphite host to connect to</span>
+
+##### onGraphitePort(port = 4756) {#onGraphitePort} 
+Property: `hoodie.metrics.graphite.port` <br/>
+<span style="color:grey">Graphite port to connect to</span>
+
+##### usePrefix(prefix = "") {#usePrefix} 
+Property: `hoodie.metrics.graphite.metric.prefix` <br/>
+<span style="color:grey">Standard prefix applied to all metrics. This helps to add datacenter, environment information for e.g</span>
+    
 #### Memory configs
 Controls memory usage for compaction and merges, performed internally by Hudi
-
-- [withMemoryConfig](#withMemoryConfig) (HoodieMemoryConfig) <br/>
+[withMemoryConfig](#withMemoryConfig) (HoodieMemoryConfig) <br/>
 <span style="color:grey">Memory related configs</span>
-    - [withMaxMemoryFractionPerPartitionMerge](#withMaxMemoryFractionPerPartitionMerge) (maxMemoryFractionPerPartitionMerge = 0.6) <br/>
-    Property: `hoodie.memory.merge.fraction` <br/>
-    <span style="color:grey">This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge </span>
-    - [withMaxMemorySizePerCompactionInBytes](#withMaxMemorySizePerCompactionInBytes) (maxMemorySizePerCompactionInBytes = 1GB) <br/>
-    Property: `hoodie.memory.compaction.fraction` <br/>
-    <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span>
 
+##### withMaxMemoryFractionPerPartitionMerge(maxMemoryFractionPerPartitionMerge = 0.6) {#withMaxMemoryFractionPerPartitionMerge} 
+Property: `hoodie.memory.merge.fraction` <br/>
+<span style="color:grey">This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge </span>
+
+##### withMaxMemorySizePerCompactionInBytes(maxMemorySizePerCompactionInBytes = 1GB) {#withMaxMemorySizePerCompactionInBytes} 
+Property: `hoodie.memory.compaction.fraction` <br/>
+<span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span>
 
diff --git a/docs/css/customstyles.css b/docs/css/customstyles.css
index 56dcdba..8ef7e65 100644
--- a/docs/css/customstyles.css
+++ b/docs/css/customstyles.css
@@ -610,6 +610,10 @@ a.fa.fa-envelope-o.mailto {
 h3 {color: #545253; font-weight:normal; font-size:130%;}
 h4 {color: #808080; font-weight:normal; font-size:120%; font-style:italic;}
 
+h5 {
+    font-weight: normal;
+}
+
 .alert, .callout {
     overflow: hidden;
 }
diff --git a/docs/quickstart.md b/docs/docker_demo.md
similarity index 83%
copy from docs/quickstart.md
copy to docs/docker_demo.md
index 317bb17..23a5a4f 100644
--- a/docs/quickstart.md
+++ b/docs/docker_demo.md
@@ -1,314 +1,21 @@
 ---
-title: Quickstart
-keywords: hudi, quickstart
-tags: [quickstart]
+title: Docker Demo
+keywords: hudi, docker, demo
+tags: [hudi, demo]
 sidebar: mydoc_sidebar
 toc: false
-permalink: quickstart.html
+permalink: docker_demo.html
 ---
 
 
 
-## Download Hudi
-
-Check out code and pull it into Intellij as a normal maven project. Normally build the maven project, from command line
-
-```
-$ mvn clean install -DskipTests -DskipITs
-```
-
-To work with older version of Hive (pre Hive-1.2.1), use
-```
-$ mvn clean install -DskipTests -DskipITs -Dhive11
-```
-
-{% include callout.html content="You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run Spark from IDE. 
-Setup your local hadoop/hive test environment, so you can play with entire ecosystem." type="info" %}
-
-<br/>Please refer to [migration guide](migration_guide.html), for recommended ways to migrate your existing dataset to Hudi.
-
-## Version Compatibility
-
-Hudi requires Java 8 to be installed on a *nix system. Hudi works with Spark-2.x versions. 
-Further, we have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
-
-| Hadoop | Hive  | Spark | Instructions to Build Hudi |
-| ---- | ----- | ---- | ---- |
-| 2.6.0-cdh5.7.2 | 1.1.0-cdh5.7.2 | spark-2.[1-3].x | Use “mvn clean install -DskipTests -Dhadoop.version=2.6.0-cdh5.7.2 -Dhive.version=1.1.0-cdh5.7.2” |
-| Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
-| Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
-
-If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues.
-We are limited by our bandwidth to certify other combinations (e.g Docker on Windows).
-It would be of great help if you can reach out to us with your setup and experience with hudi.
-
-## Generate a Hudi Dataset
-
-### Requirements & Environment Variable
-
-Please set the following environment variablies according to your setup. We have given an example setup with CDH version
-
-```
-export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
-export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin
-export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_INSTALL=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
-export SPARK_HOME=/var/hadoop/setup/spark-2.3.1-bin-hadoop2.7
-export SPARK_INSTALL=$SPARK_HOME
-export SPARK_CONF_DIR=$SPARK_HOME/conf
-export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
-```
-
-### Supported API's
-
-Use the DataSource API to quickly start reading or writing Hudi datasets in few lines of code. Ideal for most
-ingestion use-cases.
-Use the RDD API to perform more involved actions on a Hudi dataset
-
-#### DataSource API
-
-Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
-to run from command-line
-
-```
-cd hoodie-spark
-./run_hoodie_app.sh --help
-Usage: <main class> [options]
-  Options:
-    --help, -h
-       Default: false
-    --table-name, -n
-       table name for Hudi sample table
-       Default: hoodie_rt
-    --table-path, -p
-       path for Hudi sample table
-       Default: file:///tmp/hoodie/sample-table
-    --table-type, -t
-       One of COPY_ON_WRITE or MERGE_ON_READ
-       Default: COPY_ON_WRITE
-
-
-```
-
-The class lets you choose table names, output paths and one of the storage types. In your own applications, be sure to include the `hoodie-spark` module as dependency
-and follow a similar pattern to write/read datasets via the datasource.
-
-#### RDD API
-
-RDD level APIs give you more power and control over things, via the `hoodie-client` module .
-Refer to  __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example.
-
-
-
-## Query a Hudi dataset
-
-### Register Dataset to Hive Metastore
-
-Now, lets see how we can publish this data into Hive.
-
-#### Starting up Hive locally
-
-```
-hdfs namenode # start name node
-hdfs datanode # start data node
-
-bin/hive --service metastore  # start metastore
-bin/hiveserver2 \
-  --hiveconf hive.root.logger=INFO,console \
-  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
-  --hiveconf ive.stats.autogather=false \
-  --hiveconf hive.aux.jars.path=hoodie/packaging/hoodie-hadoop-mr-bundle/target/hoodie-hadoop-mr-bundle-0.4.3-SNAPSHOT.jar
-
-```
-
-
-#### Hive Sync Tool
-
-Hive Sync Tool will update/create the necessary metadata(schema and partitions) in hive metastore.
-This allows for schema evolution and incremental addition of new partitions written to.
-It uses an incremental approach by storing the last commit time synced in the TBLPROPERTIES and only syncing the commits from the last sync commit time stored.
-This can be run as frequently as the ingestion pipeline to make sure new partitions and schema evolution changes are reflected immediately.
-
-```
-cd hoodie-hive
-./run_sync_tool.sh
-  --user hive
-  --pass hive
-  --database default
-  --jdbc-url "jdbc:hive2://localhost:10010/"
-  --base-path tmp/hoodie/sample-table/
-  --table hoodie_test
-  --partitioned-by field1,field2
-
-```
-
-
-
-#### Manually via Beeline
-Add in the hoodie-hadoop-mr-bundler jar so, Hive can read the Hudi dataset and answer the query.
-Also, For reading Hudi tables using hive, the following configs needs to be setup
-
-```
-hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
-hive> set hive.stats.autogather=false;
-hive> add jar file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar;
-Added [file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar] to class path
-Added resources: [file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar]
-```
-
-Then, you need to create a __ReadOptimized__ Hive table as below (only type supported as of now)and register the sample partitions
-
-```
-drop table hoodie_test;
-CREATE EXTERNAL TABLE hoodie_test(`_row_key`  string,
-`_hoodie_commit_time` string,
-`_hoodie_commit_seqno` string,
- rider string,
- driver string,
- begin_lat double,
- begin_lon double,
- end_lat double,
- end_lon double,
- fare double)
-PARTITIONED BY (`datestr` string)
-ROW FORMAT SERDE
-   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
-STORED AS INPUTFORMAT
-   'com.uber.hoodie.hadoop.HoodieInputFormat'
-OUTPUTFORMAT
-   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
-LOCATION
-   'hdfs:///tmp/hoodie/sample-table';
-
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
-
-set mapreduce.framework.name=yarn;
-```
-
-And you can generate a __Realtime__ Hive table, as below
-
-```
-DROP TABLE hoodie_rt;
-CREATE EXTERNAL TABLE hoodie_rt(
-`_hoodie_commit_time` string,
-`_hoodie_commit_seqno` string,
-`_hoodie_record_key` string,
-`_hoodie_partition_path` string,
-`_hoodie_file_name` string,
- timestamp double,
- `_row_key` string,
- rider string,
- driver string,
- begin_lat double,
- begin_lon double,
- end_lat double,
- end_lon double,
- fare double)
-PARTITIONED BY (`datestr` string)
-ROW FORMAT SERDE
-   'com.uber.hoodie.hadoop.realtime.HoodieParquetSerde'
-STORED AS INPUTFORMAT
-   'com.uber.hoodie.hadoop.realtime.HoodieRealtimeInputFormat'
-OUTPUTFORMAT
-   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
-LOCATION
-   'file:///tmp/hoodie/sample-table';
-
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'file:///tmp/hoodie/sample-table/2016/03/15';
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'file:///tmp/hoodie/sample-table/2015/03/16';
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'file:///tmp/hoodie/sample-table/2015/03/17';
-
-```
-
-
-
-### Using different query engines
-
-Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported.
-
-#### HiveQL
-
-Let's first perform a query on the latest committed snapshot of the table
-
-```
-hive> select count(*) from hoodie_test;
-...
-OK
-100
-Time taken: 18.05 seconds, Fetched: 1 row(s)
-hive>
-```
-
-#### SparkSQL
-
-Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
-
-```
-$ cd $SPARK_INSTALL
-$ spark-shell --jars $HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.3-SNAPSHOT.jar --driver-class-path $HADOOP_CONF_DIR  --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
-
-scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-scala> sqlContext.sql("show tables").show(10000)
-scala> sqlContext.sql("describe hoodie_test").show(10000)
-scala> sqlContext.sql("describe hoodie_rt").show(10000)
-scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
-```
-
-You can also use the sample queries in __hoodie-utilities/src/test/java/HoodieSparkSQLExample.java__ for running on `hoodie_rt`
-
-#### Presto
-
-Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere.
-
-* Copy the hudi/packaging/hoodie-presto-bundle/target/hoodie-presto-bundle-*.jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
-* Startup your server and you should be able to query the same Hive table via Presto
-
-```
-show columns from hive.default.hoodie_test;
-select count(*) from hive.default.hoodie_test
-```
-
-
-
-## Incremental Queries of a Hudi dataset
-
-Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past.
-
-```
-hive> set hoodie.hoodie_test.consume.mode=INCREMENTAL;
-hive> set hoodie.hoodie_test.consume.start.timestamp=001;
-hive> set hoodie.hoodie_test.consume.max.commits=10;
-hive> select `_hoodie_commit_time`, rider, driver from hoodie_test where `_hoodie_commit_time` > '001' limit 10;
-OK
-All commits :[001, 002]
-002	rider-001	driver-001
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-001	driver-001
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-001	driver-001
-002	rider-002	driver-002
-002	rider-002	driver-002
-002	rider-001	driver-001
-Time taken: 0.056 seconds, Fetched: 10 row(s)
-hive>
-hive>
-```
-
-
-{% include note.html content="This is only supported for Read-optimized tables for now." %}
-
 
 ## A Demo using docker containers
 
 Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
 data infrastructure is brought up in a local docker cluster within your computer.
 
-The steps assume you are using Mac laptop
+The steps have been tested on a Mac laptop
 
 ### Prerequisites
 
@@ -316,7 +23,8 @@ The steps assume you are using Mac laptop
   * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat
   * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
 
-  ```
+
+```
    127.0.0.1 adhoc-1
    127.0.0.1 adhoc-2
    127.0.0.1 namenode
@@ -326,7 +34,10 @@ The steps assume you are using Mac laptop
    127.0.0.1 kafkabroker
    127.0.0.1 sparkmaster
    127.0.0.1 zookeeper
-  ```
+```
+
+Also, this has not been tested on some environments like Docker on Windows.
+
 
 ### Setting up Docker Cluster
 
diff --git a/docs/performance.md b/docs/performance.md
index 6795171..a4c0f27 100644
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -1,14 +1,17 @@
 ---
-title: Implementation
+title: Performance
 keywords: hudi, index, storage, compaction, cleaning, implementation
 sidebar: mydoc_sidebar
 toc: false
 permalink: performance.html
 ---
-## Performance
 
 In this section, we go over some real world performance numbers for Hudi upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks. Following shows the speed up obtained for NoSQL ingestion, 
+the conventional alternatives for achieving these tasks. 
+
+## Upserts
+
+Following shows the speed up obtained for NoSQL ingestion, 
 by switching from bulk loads off HBase to Parquet to incrementally upserting on a Hudi dataset, on 5 tables ranging from small to huge.
 
 {% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" max-width="1000" %}
@@ -20,7 +23,7 @@ significant savings on the overall compute cost.
 
 Hudi upserts have been stress tested upto 4TB in a single commit across the t1 table.
 
-### Tuning
+## Tuning
 
 Writing data via Hudi happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.
 
@@ -76,7 +79,7 @@ Below is a full working production config
 ````
 
 
-#### Read Optimized Query Performance
+## Read Optimized Query Performance
 
 The major design goal for read optimized view is to achieve the latency reduction & efficiency gains in previous section,
 with no impact on queries. Following charts compare the Hudi vs non-Hudi datasets across Hive/Presto/Spark queries and demonstrate this.
diff --git a/docs/querying_data.md b/docs/querying_data.md
index 452c92d..81186e3 100644
--- a/docs/querying_data.md
+++ b/docs/querying_data.md
@@ -7,57 +7,131 @@ toc: false
 summary: In this page, we go over how to enable SQL queries on Hudi built tables.
 ---
 
-Hudi registers the dataset into the Hive metastore backed by `HoodieInputFormat`. This makes the data accessible to
-Hive & Spark & Presto automatically. To be able to perform normal SQL queries on such a dataset, we need to get the individual query engines
-to call `HoodieInputFormat.getSplits()`, during query planning such that the right versions of files are exposed to it.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](concepts.html#views). 
+Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto.
 
+Specifically, there are two Hive tables named off [table name](configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
+For e.g, if `table name = hudi_tbl`, then we get  
 
-In the following sections, we cover the configs needed across different query engines to achieve this.
+ - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieInputFormat`, exposing purely columnar data.
+ - `hudi_tbl_rt` realizes the real time view of the dataset  backed by `HoodieRealtimeInputFormat`, exposing merged view of base and log data.
 
-{% include callout.html content="Instructions are currently only for Copy-on-write storage" type="info" %}
+As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows 
+since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
+joined with other tables (datasets/dimensions), to [write out deltas](writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. 
 
+In sections, below we will discuss in detail how to access all the 3 views on each query engine.
 
 ## Hive
 
-For HiveServer2 access, [install](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
-the hoodie-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar into the aux jars path and we should be able to recognize the Hudi tables and query them correctly.
-
-For beeline access, the `hive.input.format` variable needs to be set to the fully qualified path name of the inputformat `com.uber.hoodie.hadoop.HoodieInputFormat`
-For Tez, additionally the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
+In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hoodie-hadoop-hive-bundle-x.y.z-SNAPSHOT.jar` 
+in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format 
+classes with its dependencies are available for query planning & execution. 
+
+### Read Optimized table {#hive-ro-view}
+In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the  fully qualified path name of the 
+inputformat `com.uber.hoodie.hadoop.HoodieInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set 
+to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
+
+### Real time table {#hive-rt-view}
+In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that
+queries can pick up the custom RecordReader as well.
+
+### Incremental Pulling {#hive-incr-pull}
+
+`HiveIncrementalPuller` allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and 
+incremental primitives (speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the hive query and saves its results in a temp table.
+that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
+e.g: `/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`.The Delta Hive table registered will be of the form `{tmpdb}.{source_table}_{last_commit_included}`.
+
+The following are the configuration options for HiveIncrementalPuller
+
+| **Config** | **Description** | **Default** |
+|hiveUrl| Hive Server 2 URL to connect to |  |
+|hiveUser| Hive Server 2 Username |  |
+|hivePass| Hive Server 2 Password |  |
+|queue| YARN Queue name |  |
+|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
+|extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
+|sourceTable| Source Table Name. Needed to set hive environment properties. |  |
+|targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
+|sourceDataPath| Source DFS Base Path. This is where the Hudi metadata will be read. |  |
+|targetDataPath| Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
+|tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
+|fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are pulled from.  |  |
+|maxCommits| Number of commits to include in the pull. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
+|help| Utility Help |  |
+
+
+Setting fromCommitTime=0 and maxCommits=-1 will pull in the entire source dataset and can be used to initiate backfills. If the target dataset is a Hudi dataset,
+then the utility can determine if the target dataset has no commits or is behind more than 24 hour (this is configurable),
+it will automatically use the backfill configuration, since applying the last 24 hours incrementally could take more time than doing a backfill. The current limitation of the tool
+is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
 
 ## Spark
 
-There are two ways of running Spark SQL on Hudi datasets.
+Spark provides much easier deployment & management of Hudi jars and bundles into jobs/notebooks. At a high level, there are two ways to access Hudi datasets in Spark.
 
-First method involves, setting `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback
-to using the Hive Serde to read the data (planning/executions is still Spark). This turns off optimizations in Spark
-towards Parquet reading, which we will address in the next method based on path filters.
-However benchmarks have not revealed any real performance degradation with Hudi & SparkSQL, compared to native support.
+ - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to how standard datasources (e.g: `spark.read.parquet`) work.
+ - **Read as Hive tables** : Supports all three views, including the real time view, relying on the custom Hudi input formats again like Hive.
+ 
+ In general, your spark job needs a dependency to `hoodie-spark` or `hoodie-spark-bundle-x.y.z.jar` needs to be on the class path of driver & executors (hint: use `--jars` argument)
+ 
+### Read Optimized table {#spark-ro-view}
 
-Sample command is provided below to spin up Spark Shell
+To read RO table as a Hive table using SparkSQL, simply push a path filter into sparkContext as follows. 
+This method retains Spark built-in optimizations for reading Parquet files like vectorized reading on Hudi tables.
 
 ```
-$ spark-shell --jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
+spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
+```
 
-scala> sqlContext.sql("select count(*) from uber.trips where datestr = '2016-10-02'").show()
+If you prefer to glob paths on DFS via the datasource, you can simply do something like below to get a Spark dataframe to work with. 
 
 ```
+Dataset<Row> hoodieROViewDF = spark.read().format("com.uber.hoodie")
+// pass any path glob, can include hudi & non-hudi datasets
+.load("/glob/path/pattern");
+```
+ 
+### Real time table {#spark-rt-view}
+Currently, real time table can only be queried as a Hive table in Spark. In order to do this, set `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
+to using the Hive Serde to read the data (planning/executions is still Spark). 
 
+```
+$ spark-shell --jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
 
-For scheduled Spark jobs, a dependency to [hoodie-hadoop-mr](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-hadoop-mr) and [hoodie-client](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-client) modules needs to be added
-and the same config needs to be set on `SparkConf` or conveniently via `HoodieReadClient.addHoodieSupport(conf)`
-
-{% include callout.html content="Don't instantiate a HoodieWriteClient against a table you don't own. Hudi is a single writer & multiple reader system as of now. You may accidentally cause incidents otherwise.
-" type="warning" %}
+scala> sqlContext.sql("select count(*) from hudi_rt where datestr = '2016-10-02'").show()
+```
 
-The second method uses a new feature in Spark 2.x, which allows for the work of HoodieInputFormat to be done via a path filter as below. This method uses Spark built-in optimizations for
-reading Parquet files, just like queries on non-Hudi tables.
+### Incremental Pulling {#spark-incr-pull}
+The `hoodie-spark` module offers the DataSource API, a more elegant way to pull data from Hudi dataset and process it via Spark.
+A sample incremental pull, that will obtain all records written since `beginInstantTime`, looks like below.
 
 ```
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
+ Dataset<Row> hoodieIncViewDF = spark.read()
+     .format("com.uber.hoodie")
+     .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
+             DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
+     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
+            <beginInstantTime>)
+     .load(tablePath); // For incremental view, pass in the root/base path of dataset
 ```
 
+Please refer to [configurations](configurations.html#spark-datasource) section, to view all datasource options.
+
+Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
+
+| **API** | **Description** |
+| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own index for faster lookup |
+| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
+| checkExists(keys) | Check if the provided keys exist in a Hudi dataset |
+
 
 ## Presto
 
-Presto requires the `hoodie-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+Presto is a popular query engine, providing interactive query performance. Hudi RO tables can be queries seamlessly in Presto. 
+This requires the `hoodie-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 317bb17..416aca9 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -6,12 +6,18 @@ sidebar: mydoc_sidebar
 toc: false
 permalink: quickstart.html
 ---
+<br/>
+To get a quick peek at Hudi's capabilities, we have put together a [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) 
+that showcases this on a docker based setup with all dependent systems running locally. We recommend you replicate the same setup 
+and run the demo yourself, by following steps [here](docker_demo.html). Also, if you are looking for ways to migrate your existing data to Hudi, 
+refer to [migration guide](migration_guide.html).
 
-
+If you have Hive, Hadoop, Spark installed already & prefer to do it on your own setup, read on.
 
 ## Download Hudi
 
-Check out code and pull it into Intellij as a normal maven project. Normally build the maven project, from command line
+Check out [code](https://github.com/apache/incubator-hudi) or download [latest release](https://github.com/apache/incubator-hudi/archive/hoodie-0.4.5.zip) 
+and normally build the maven project, from command line
 
 ```
 $ mvn clean install -DskipTests -DskipITs
@@ -22,12 +28,12 @@ To work with older version of Hive (pre Hive-1.2.1), use
 $ mvn clean install -DskipTests -DskipITs -Dhive11
 ```
 
-{% include callout.html content="You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run Spark from IDE. 
-Setup your local hadoop/hive test environment, so you can play with entire ecosystem." type="info" %}
+{% include callout.html content="For IDE, you can pull in the code into IntelliJ as a normal maven project. 
+You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run from IDE." 
+type="info" %}
 
-<br/>Please refer to [migration guide](migration_guide.html), for recommended ways to migrate your existing dataset to Hudi.
 
-## Version Compatibility
+### Version Compatibility
 
 Hudi requires Java 8 to be installed on a *nix system. Hudi works with Spark-2.x versions. 
 Further, we have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
@@ -38,17 +44,17 @@ Further, we have verified that Hudi works with the following combination of Hado
 | Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 | Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 
-If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues.
-We are limited by our bandwidth to certify other combinations (e.g Docker on Windows).
-It would be of great help if you can reach out to us with your setup and experience with hudi.
+{% include callout.html content="If your environment has other versions of hadoop/hive/spark, please try out Hudi 
+and let us know if there are any issues. "  type="info" %}
 
-## Generate a Hudi Dataset
+## Generate Sample Dataset
 
-### Requirements & Environment Variable
+### Environment Variables
 
-Please set the following environment variablies according to your setup. We have given an example setup with CDH version
+Please set the following environment variables according to your setup. We have given an example setup with CDH version
 
 ```
+cd incubator-hudi 
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
 export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin
 export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
@@ -60,13 +66,7 @@ export SPARK_CONF_DIR=$SPARK_HOME/conf
 export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
 ```
 
-### Supported API's
-
-Use the DataSource API to quickly start reading or writing Hudi datasets in few lines of code. Ideal for most
-ingestion use-cases.
-Use the RDD API to perform more involved actions on a Hudi dataset
-
-#### DataSource API
+### Run HoodieJavaApp
 
 Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
 to run from command-line
@@ -87,27 +87,16 @@ Usage: <main class> [options]
     --table-type, -t
        One of COPY_ON_WRITE or MERGE_ON_READ
        Default: COPY_ON_WRITE
-
-
 ```
 
 The class lets you choose table names, output paths and one of the storage types. In your own applications, be sure to include the `hoodie-spark` module as dependency
-and follow a similar pattern to write/read datasets via the datasource.
-
-#### RDD API
-
-RDD level APIs give you more power and control over things, via the `hoodie-client` module .
-Refer to  __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example.
-
-
+and follow a similar pattern to write/read datasets via the datasource. 
 
 ## Query a Hudi dataset
 
-### Register Dataset to Hive Metastore
+Next, we will register the sample dataset into Hive metastore and try to query using [Hive](#hive), [Spark](#spark) & [Presto](#presto)
 
-Now, lets see how we can publish this data into Hive.
-
-#### Starting up Hive locally
+### Start Hive Server locally
 
 ```
 hdfs namenode # start name node
@@ -117,18 +106,15 @@ bin/hive --service metastore  # start metastore
 bin/hiveserver2 \
   --hiveconf hive.root.logger=INFO,console \
   --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
-  --hiveconf ive.stats.autogather=false \
-  --hiveconf hive.aux.jars.path=hoodie/packaging/hoodie-hadoop-mr-bundle/target/hoodie-hadoop-mr-bundle-0.4.3-SNAPSHOT.jar
+  --hiveconf hive.stats.autogather=false \
+  --hiveconf hive.aux.jars.path=/path/to/packaging/hoodie-hive-bundle/target/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar
 
 ```
 
-
-#### Hive Sync Tool
-
-Hive Sync Tool will update/create the necessary metadata(schema and partitions) in hive metastore.
-This allows for schema evolution and incremental addition of new partitions written to.
+### Run Hive Sync Tool
+Hive Sync Tool will update/create the necessary metadata(schema and partitions) in hive metastore. This allows for schema evolution and incremental addition of new partitions written to.
 It uses an incremental approach by storing the last commit time synced in the TBLPROPERTIES and only syncing the commits from the last sync commit time stored.
-This can be run as frequently as the ingestion pipeline to make sure new partitions and schema evolution changes are reflected immediately.
+Both [Spark Datasource](writing_data.html#datasource-writer) & [DeltaStreamer](writing_data.html#deltastreamer) have capability to do this, after each write.
 
 ```
 cd hoodie-hive
@@ -142,98 +128,19 @@ cd hoodie-hive
   --partitioned-by field1,field2
 
 ```
+{% include callout.html content="For some reason, if you want to do this by hand. Please 
+follow [this](https://cwiki.apache.org/confluence/display/HUDI/Registering+sample+dataset+to+Hive+via+beeline)." 
+type="info" %}
 
 
+### HiveQL {#hive}
 
-#### Manually via Beeline
-Add in the hoodie-hadoop-mr-bundler jar so, Hive can read the Hudi dataset and answer the query.
-Also, For reading Hudi tables using hive, the following configs needs to be setup
+Let's first perform a query on the latest committed snapshot of the table
 
 ```
 hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 hive> set hive.stats.autogather=false;
-hive> add jar file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar;
-Added [file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar] to class path
-Added resources: [file:///tmp/hoodie-hadoop-mr-bundle-0.4.3.jar]
-```
-
-Then, you need to create a __ReadOptimized__ Hive table as below (only type supported as of now)and register the sample partitions
-
-```
-drop table hoodie_test;
-CREATE EXTERNAL TABLE hoodie_test(`_row_key`  string,
-`_hoodie_commit_time` string,
-`_hoodie_commit_seqno` string,
- rider string,
- driver string,
- begin_lat double,
- begin_lon double,
- end_lat double,
- end_lon double,
- fare double)
-PARTITIONED BY (`datestr` string)
-ROW FORMAT SERDE
-   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
-STORED AS INPUTFORMAT
-   'com.uber.hoodie.hadoop.HoodieInputFormat'
-OUTPUTFORMAT
-   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
-LOCATION
-   'hdfs:///tmp/hoodie/sample-table';
-
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
-ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
-
-set mapreduce.framework.name=yarn;
-```
-
-And you can generate a __Realtime__ Hive table, as below
-
-```
-DROP TABLE hoodie_rt;
-CREATE EXTERNAL TABLE hoodie_rt(
-`_hoodie_commit_time` string,
-`_hoodie_commit_seqno` string,
-`_hoodie_record_key` string,
-`_hoodie_partition_path` string,
-`_hoodie_file_name` string,
- timestamp double,
- `_row_key` string,
- rider string,
- driver string,
- begin_lat double,
- begin_lon double,
- end_lat double,
- end_lon double,
- fare double)
-PARTITIONED BY (`datestr` string)
-ROW FORMAT SERDE
-   'com.uber.hoodie.hadoop.realtime.HoodieParquetSerde'
-STORED AS INPUTFORMAT
-   'com.uber.hoodie.hadoop.realtime.HoodieRealtimeInputFormat'
-OUTPUTFORMAT
-   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
-LOCATION
-   'file:///tmp/hoodie/sample-table';
-
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'file:///tmp/hoodie/sample-table/2016/03/15';
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'file:///tmp/hoodie/sample-table/2015/03/16';
-ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'file:///tmp/hoodie/sample-table/2015/03/17';
-
-```
-
-
-
-### Using different query engines
-
-Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported.
-
-#### HiveQL
-
-Let's first perform a query on the latest committed snapshot of the table
-
-```
+hive> add jar file:///path/to/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar;
 hive> select count(*) from hoodie_test;
 ...
 OK
@@ -242,24 +149,22 @@ Time taken: 18.05 seconds, Fetched: 1 row(s)
 hive>
 ```
 
-#### SparkSQL
+### SparkSQL {#spark}
 
 Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
 
 ```
 $ cd $SPARK_INSTALL
-$ spark-shell --jars $HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.3-SNAPSHOT.jar --driver-class-path $HADOOP_CONF_DIR  --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
+$ spark-shell --jars $HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.6-SNAPSHOT.jar --driver-class-path $HADOOP_CONF_DIR  --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
 
 scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 scala> sqlContext.sql("show tables").show(10000)
 scala> sqlContext.sql("describe hoodie_test").show(10000)
-scala> sqlContext.sql("describe hoodie_rt").show(10000)
+scala> sqlContext.sql("describe hoodie_test_rt").show(10000)
 scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
 ```
 
-You can also use the sample queries in __hoodie-utilities/src/test/java/HoodieSparkSQLExample.java__ for running on `hoodie_rt`
-
-#### Presto
+### Presto {#presto}
 
 Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere.
 
@@ -271,9 +176,7 @@ show columns from hive.default.hoodie_test;
 select count(*) from hive.default.hoodie_test
 ```
 
-
-
-## Incremental Queries of a Hudi dataset
+### Incremental HiveQL
 
 Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past.
 
@@ -299,953 +202,4 @@ hive>
 hive>
 ```
 
-
-{% include note.html content="This is only supported for Read-optimized tables for now." %}
-
-
-## A Demo using docker containers
-
-Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your computer.
-
-The steps assume you are using Mac laptop
-
-### Prerequisites
-
-  * Docker Setup :  For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
-  * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat
-  * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
-
-  ```
-   127.0.0.1 adhoc-1
-   127.0.0.1 adhoc-2
-   127.0.0.1 namenode
-   127.0.0.1 datanode1
-   127.0.0.1 hiveserver
-   127.0.0.1 hivemetastore
-   127.0.0.1 kafkabroker
-   127.0.0.1 sparkmaster
-   127.0.0.1 zookeeper
-  ```
-
-### Setting up Docker Cluster
-
-
-#### Build Hudi
-
-The first step is to build hudi
-```
-cd <HUDI_WORKSPACE>
-mvn package -DskipTests
-```
-
-#### Bringing up Demo Cluster
-
-The next step is to run the docker compose script and setup configs for bringing up the cluster.
-This should pull the docker images from docker hub and setup docker cluster.
-
-```
-cd docker
-./setup_demo.sh
-....
-....
-....
-Stopping spark-worker-1            ... done
-Stopping hiveserver                ... done
-Stopping hivemetastore             ... done
-Stopping historyserver             ... done
-.......
-......
-Creating network "hudi_demo" with the default driver
-Creating hive-metastore-postgresql ... done
-Creating namenode                  ... done
-Creating zookeeper                 ... done
-Creating kafkabroker               ... done
-Creating hivemetastore             ... done
-Creating historyserver             ... done
-Creating hiveserver                ... done
-Creating datanode1                 ... done
-Creating sparkmaster               ... done
-Creating adhoc-1                   ... done
-Creating adhoc-2                   ... done
-Creating spark-worker-1            ... done
-Copying spark default config and setting up configs
-Copying spark default config and setting up configs
-Copying spark default config and setting up configs
-varadarb-C02SG7Q3G8WP:docker varadarb$ docker ps
-```
-
-At this point, the docker cluster will be up and running. The demo cluster brings up the following services
-
-   * HDFS Services (NameNode, DataNode)
-   * Spark Master and Worker
-   * Hive Services (Metastore, HiveServer2 along with PostgresDB)
-   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source for the demo)
-   * Adhoc containers to run Hudi/Hive CLI commands
-
-### Demo
-
-Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction.
-
-Take a look at the directory `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity.
-The first batch contains stocker tracker data for some stock symbols during the first hour of trading window
-(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 mins (10:30 - 11 a.m). Hudi will
-be used to ingest these batches to a dataset which will contain the latest stock tracker data at hour level granularity.
-The batches are windowed intentionally so that the second batch contains updates to some of the rows in the first batch.
-
-#### Step 1 : Publish the first batch to Kafka
-
-Upload the first batch to Kafka topic 'stock ticks'
-
-```
-cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
-
-To check if the new topic shows up, use
-kafkacat -b kafkabroker -L -J | jq .
-{
-  "originating_broker": {
-    "id": 1001,
-    "name": "kafkabroker:9092/1001"
-  },
-  "query": {
-    "topic": "*"
-  },
-  "brokers": [
-    {
-      "id": 1001,
-      "name": "kafkabroker:9092"
-    }
-  ],
-  "topics": [
-    {
-      "topic": "stock_ticks",
-      "partitions": [
-        {
-          "partition": 0,
-          "leader": 1001,
-          "replicas": [
-            {
-              "id": 1001
-            }
-          ],
-          "isrs": [
-            {
-              "id": 1001
-            }
-          ]
-        }
-      ]
-    }
-  ]
-}
-
-```
-
-#### Step 2: Incrementally ingest data from Kafka topic
-
-Hudi comes with a tool named DeltaStreamer. This tool can connect to variety of data sources (including Kafka) to
-pull changes and apply to Hudi dataset using upsert/insert primitives. Here, we will use the tool to download
-json data from kafka topic and ingest to both COW and MOR tables we initialized in the previous step. This tool
-automatically initializes the datasets in the file-system if they do not exist yet.
-
-```
-docker exec -it adhoc-2 /bin/bash
-
-# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
-....
-....
-2018-09-24 22:20:00 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
-2018-09-24 22:20:00 INFO  SparkContext:54 - Successfully stopped SparkContext
-# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties
-....
-2018-09-24 22:22:01 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
-2018-09-24 22:22:01 INFO  SparkContext:54 - Successfully stopped SparkContext
-....
-
-# As part of the setup (Look at setup_demo.sh), the configs needed for DeltaStreamer is uploaded to HDFS. The configs
-# contain mostly Kafa connectivity settings, the avro-schema to be used for ingesting along with key and partitioning fields.
-
-exit
-```
-
-You can use HDFS web-browser to look at the datasets
-`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
-
-You can explore the new partition folder created in the dataset along with a "deltacommit"
-file under .hoodie which signals a successful commit.
-
-There will be a similar setup when you browse the MOR dataset
-`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor`
-
-
-#### Step 3: Sync with Hive
-
-At this step, the datasets are available in HDFS. We need to sync with Hive to create new Hive tables and add partitions
-inorder to run Hive queries against those datasets.
-
-```
-docker exec -it adhoc-2 /bin/bash
-
-# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS and sync the HDFS state to Hive
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh  --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cow
-.....
-2018-09-24 22:22:45,568 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
-.....
-
-# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR storage)
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh  --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor
-...
-2018-09-24 22:23:09,171 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
-...
-2018-09-24 22:23:09,559 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor_rt
-....
-exit
-```
-After executing the above command, you will notice
-
-1. A hive table named `stock_ticks_cow` created which provides Read-Optimized view for the Copy On Write dataset.
-2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the Merge On Read dataset. The former
-provides the ReadOptimized view for the Hudi dataset and the later provides the realtime-view for the dataset.
-
-
-#### Step 4 (a): Run Hive Queries
-
-Run a hive query to find the latest timestamp ingested for stock symbol 'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the same value "10:29 a.m" as Hudi create a
-parquet file for the first batch of data.
-
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
-# List Tables
-0: jdbc:hive2://hiveserver:10000> show tables;
-+---------------------+--+
-|      tab_name       |
-+---------------------+--+
-| stock_ticks_cow     |
-| stock_ticks_mor     |
-| stock_ticks_mor_rt  |
-+---------------------+--+
-2 rows selected (0.801 seconds)
-0: jdbc:hive2://hiveserver:10000>
-
-
-# Look at partitions that were added
-0: jdbc:hive2://hiveserver:10000> show partitions stock_ticks_mor_rt;
-+----------------+--+
-|   partition    |
-+----------------+--+
-| dt=2018-08-31  |
-+----------------+--+
-1 row selected (0.24 seconds)
-
-
-# COPY-ON-WRITE Queries:
-=========================
-
-
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:29:00  |
-+---------+----------------------+--+
-
-Now, run a projection query:
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924221953       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-
-# Merge-On-Read Queries:
-==========================
-
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
-
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:29:00  |
-+---------+----------------------+--+
-1 row selected (6.326 seconds)
-
-
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
-
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:29:00  |
-+---------+----------------------+--+
-1 row selected (1.606 seconds)
-
-
-# Run projection query against Read Optimized and Realtime tables
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-exit
-exit
-```
-
-#### Step 4 (b): Run Spark-SQL Queries
-Hudi support Spark as query processor just like Hive. Here are the same hive queries
-running in spark-sql
-
-```
-docker exec -it adhoc-1 /bin/bash
-$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
-...
-
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
-      /_/
-
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
-Type in expressions to have them evaluated.
-Type :help for more information.
-
-scala>
-scala> spark.sql("show tables").show(100, false)
-+--------+------------------+-----------+
-|database|tableName         |isTemporary|
-+--------+------------------+-----------+
-|default |stock_ticks_cow   |false      |
-|default |stock_ticks_mor   |false      |
-|default |stock_ticks_mor_rt|false      |
-+--------+------------------+-----------+
-
-# Copy-On-Write Table
-
-## Run max timestamp query against COW table
-
-scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
-[Stage 0:>                                                          (0 + 1) / 1]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
-SLF4J: Defaulting to no-operation (NOP) logger implementation
-SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
-+------+-------------------+
-|symbol|max(ts)            |
-+------+-------------------+
-|GOOG  |2018-08-31 10:29:00|
-+------+-------------------+
-
-## Projection Query
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
-+-------------------+------+-------------------+------+---------+--------+
-|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
-+-------------------+------+-------------------+------+---------+--------+
-|20180924221953     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
-|20180924221953     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
-+-------------------+------+-------------------+------+---------+--------+
-
-# Merge-On-Read Queries:
-==========================
-
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
-
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+------+-------------------+
-|symbol|max(ts)            |
-+------+-------------------+
-|GOOG  |2018-08-31 10:29:00|
-+------+-------------------+
-
-
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
-
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+------+-------------------+
-|symbol|max(ts)            |
-+------+-------------------+
-|GOOG  |2018-08-31 10:29:00|
-+------+-------------------+
-
-# Run projection query against Read Optimized and Realtime tables
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
-+-------------------+------+-------------------+------+---------+--------+
-|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
-+-------------------+------+-------------------+------+---------+--------+
-|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
-|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
-+-------------------+------+-------------------+------+---------+--------+
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
-+-------------------+------+-------------------+------+---------+--------+
-|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
-+-------------------+------+-------------------+------+---------+--------+
-|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
-|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
-+-------------------+------+-------------------+------+---------+--------+
-
-```
-
-
-#### Step 5: Upload second batch to Kafka and run DeltaStreamer to ingest
-
-Upload the second batch of data and ingest this batch using delta-streamer. As this batch does not bring in any new
-partitions, there is no need to run hive-sync
-
-```
-cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
-
-# Within Docker container, run the ingestion command
-docker exec -it adhoc-2 /bin/bash
-
-# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties
-
-# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties
-
-exit
-```
-With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a new version of Parquet file getting created.
-See `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
-
-With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
-Take a look at the HDFS filesystem to get an idea: `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
-
-#### Step 6(a): Run Hive Queries
-
-With Copy-On-Write table, the read-optimized view immediately sees the changes as part of second batch once the batch
-got committed as each ingestion creates newer versions of parquet files.
-
-With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
-This is the time, when ReadOptimized and Realtime views will provide different results. ReadOptimized view will still
-return "10:29 am" as it will only read from the Parquet file. Realtime View will do on-the-fly merge and return
-latest committed data which is "10:59 a.m".
-
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
-
-# Copy On Write Table:
-
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-1 row selected (1.932 seconds)
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
-
-
-# Merge On Read Table:
-
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:29:00  |
-+---------+----------------------+--+
-1 row selected (1.6 seconds)
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-# Realtime View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-exit
-exit
-```
-
-#### Step 6(b): Run Spark SQL Queries
-
-Running the same queries in Spark-SQL:
-
-```
-docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
-
-# Copy On Write Table:
-
-scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+------+-------------------+
-|symbol|max(ts)            |
-+------+-------------------+
-|GOOG  |2018-08-31 10:59:00|
-+------+-------------------+
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
-
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
-
-
-# Merge On Read Table:
-
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:29:00  |
-+---------+----------------------+--+
-1 row selected (1.6 seconds)
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-# Realtime View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-exit
-exit
-```
-
-#### Step 7 : Incremental Query for COPY-ON-WRITE Table
-
-With 2 batches of data ingested, lets showcase the support for incremental queries in Hudi Copy-On-Write datasets
-
-Lets take the same projection query example
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-```
-
-As you notice from the above queries, there are 2 commits - 20180924064621 and 20180924065039 in timeline order.
-When you follow the steps, you will be getting different timestamps for commits. Substitute them
-in place of the above timestamps.
-
-To show the effects of incremental-query, let us assume that a reader has already seen the changes as part of
-ingesting first batch. Now, for the reader to see effect of the second batch, he/she has to keep the start timestamp to
-the commit time of the first batch (20180924064621) and run incremental query
-
-`Hudi incremental mode` provides efficient scanning for incremental queries by filtering out files that do not have any
-candidate rows using hudi-managed metadata.
-
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
-0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
-No rows affected (0.009 seconds)
-0: jdbc:hive2://hiveserver:10000>  set hoodie.stock_ticks_cow.consume.max.commits=3;
-No rows affected (0.009 seconds)
-0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
-```
-
-With the above setting, file-ids that do not have any updates from the commit 20180924065039 is filtered out without scanning.
-Here is the incremental query :
-
-```
-0: jdbc:hive2://hiveserver:10000>
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-1 row selected (0.83 seconds)
-0: jdbc:hive2://hiveserver:10000>
-```
-
-##### Incremental Query with Spark SQL:
-```
-docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
-      /_/
-
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
-Type in expressions to have them evaluated.
-Type :help for more information.
-
-scala> import com.uber.hoodie.DataSourceReadOptions
-import com.uber.hoodie.DataSourceReadOptions
-
-# In the below query, 20180925045257 is the first commit's timestamp
-scala> val hoodieIncViewDF =  spark.read.format("com.uber.hoodie").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY, DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
-SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
-SLF4J: Defaulting to no-operation (NOP) logger implementation
-SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
-hoodieIncViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 15 more fields]
-
-scala> hoodieIncViewDF.registerTempTable("stock_ticks_cow_incr_tmp1")
-warning: there was one deprecation warning; re-run with -deprecation for details
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow_incr_tmp1 where  symbol = 'GOOG'").show(100, false);
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-```
-
-
-#### Step 8: Schedule and Run Compaction for Merge-On-Read dataset
-
-Lets schedule and run a compaction to create a new version of columnar  file so that read-optimized readers will see fresher data.
-Again, You can use Hudi CLI to manually schedule and run compaction
-
-```
-docker exec -it adhoc-1 /bin/bash
-root@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
-============================================
-*                                          *
-*     _    _                 _ _           *
-*    | |  | |               | (_)          *
-*    | |__| | ___   ___   __| |_  ___      *
-*    |  __  |/ _ \ / _ \ / _` | |/ _ \     *
-*    | |  | | (_) | (_) | (_| | |  __/     *
-*    |_|  |_|\___/ \___/ \__,_|_|\___|     *
-*                                          *
-============================================
-
-Welcome to Hoodie CLI. Please type help if you are looking for help.
-hoodie->connect --path /user/hive/warehouse/stock_ticks_mor
-18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
-18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
-18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
-Metadata for table stock_ticks_mor loaded
-
-# Ensure no compactions are present
-
-hoodie:stock_ticks_mor->compactions show all
-18/09/24 06:59:54 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED]]
-    ___________________________________________________________________
-    | Compaction Instant Time| State    | Total FileIds to be Compacted|
-    |==================================================================|
-
-# Schedule a compaction. This will use Spark Launcher to schedule compaction
-hoodie:stock_ticks_mor->compaction schedule
-....
-Compaction successfully completed for 20180924070031
-
-# Now refresh and check again. You will see that there is a new compaction requested
-
-hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
-18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
-18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
-18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
-Metadata for table stock_ticks_mor loaded
-
-hoodie:stock_ticks_mor->compactions show all
-18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], [20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], [==>20180924063245__compaction__REQUESTED]]
-    ___________________________________________________________________
-    | Compaction Instant Time| State    | Total FileIds to be Compacted|
-    |==================================================================|
-    | 20180924070031         | REQUESTED| 1                            |
-
-# Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
-hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 --parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc --retry 1  
-....
-Compaction successfully completed for 20180924070031
-
-
-## Now check if compaction is completed
-
-hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
-18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
-18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
-18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
-Metadata for table stock_ticks_mor loaded
-
-hoodie:stock_ticks->compactions show all
-18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], [20180924070031__commit__COMPLETED]]
-    ___________________________________________________________________
-    | Compaction Instant Time| State    | Total FileIds to be Compacted|
-    |==================================================================|
-    | 20180924070031         | COMPLETED| 1                            |
-
-```
-
-#### Step 9: Run Hive Queries including incremental queries
-
-You will see that both ReadOptimized and Realtime Views will show the latest committed data.
-Lets also run the incremental query for MOR table.
-From looking at the below query output, it will be clear that the fist commit time for the MOR table is 20180924064636
-and the second commit time is 20180924070031
-
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
-
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-1 row selected (1.6 seconds)
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-# Realtime View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
-WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-# Incremental View:
-
-0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
-No rows affected (0.008 seconds)
-# Max-Commits covers both second batch and compaction commit
-0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.max.commits=3;
-No rows affected (0.007 seconds)
-0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
-No rows affected (0.013 seconds)
-# Query:
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064636';
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-exit
-exit
-```
-
-##### Read Optimized and Realtime Views for MOR with Spark-SQL after compaction
-
-```
-docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
-
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-1 row selected (1.6 seconds)
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-
-# Realtime View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
-+---------+----------------------+--+
-| symbol  |         _c1          |
-+---------+----------------------+--+
-| GOOG    | 2018-08-31 10:59:00  |
-+---------+----------------------+--+
-
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
-| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
-+----------------------+---------+----------------------+---------+------------+-----------+--+
-```
-
-
-This brings the demo to an end.
-
-## Testing Hudi in Local Docker environment
-
-You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
-```
-$ mvn pre-integration-test -DskipTests
-```
-The above command builds docker images for all the services with
-current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker images.
-
-To bring down the containers
-```
-$ cd hoodie-integ-test
-$ mvn docker-compose:down
-```
-
-If you want to bring up the docker containers, use
-```
-$ cd hoodie-integ-test
-$  mvn docker-compose:up -DdetachedMode=true
-```
-
-Hudi is a library that is operated in a broader data analytics/ingestion environment
-involving Hadoop, Hive and Spark. Interoperability with all these systems is a key objective for us. We are
-actively adding integration-tests under __hoodie-integ-test/src/test/java__ that makes use of this
-docker environment (See __hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java__ )
-
-
-#### Building Local Docker Containers:
-
-The docker images required for demo and running integration test are already in docker-hub. The docker images
-and compose scripts are carefully implemented so that they serve dual-purpose
-
-1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
-2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
-   docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local jars override
-   inbuilt jars by mounting local HUDI workspace over the docker location
-
-This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth, they can still build local images
-run the script
-`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
-
-Here are the commands:
-
-```
-cd docker
-./build_local_docker_images.sh
-.....
-
-[INFO] Reactor Summary:
-[INFO]
-[INFO] hoodie ............................................. SUCCESS [  1.709 s]
-[INFO] hoodie-common ...................................... SUCCESS [  9.015 s]
-[INFO] hoodie-hadoop-mr ................................... SUCCESS [  1.108 s]
-[INFO] hoodie-client ...................................... SUCCESS [  4.409 s]
-[INFO] hoodie-hive ........................................ SUCCESS [  0.976 s]
-[INFO] hoodie-spark ....................................... SUCCESS [ 26.522 s]
-[INFO] hoodie-utilities ................................... SUCCESS [ 16.256 s]
-[INFO] hoodie-cli ......................................... SUCCESS [ 11.341 s]
-[INFO] hoodie-hadoop-mr-bundle ............................ SUCCESS [  1.893 s]
-[INFO] hoodie-hive-bundle ................................. SUCCESS [ 14.099 s]
-[INFO] hoodie-spark-bundle ................................ SUCCESS [ 58.252 s]
-[INFO] hoodie-hadoop-docker ............................... SUCCESS [  0.612 s]
-[INFO] hoodie-hadoop-base-docker .......................... SUCCESS [04:04 min]
-[INFO] hoodie-hadoop-namenode-docker ...................... SUCCESS [  6.142 s]
-[INFO] hoodie-hadoop-datanode-docker ...................... SUCCESS [  7.763 s]
-[INFO] hoodie-hadoop-history-docker ....................... SUCCESS [  5.922 s]
-[INFO] hoodie-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
-[INFO] hoodie-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
-[INFO] hoodie-hadoop-sparkmaster-docker ................... SUCCESS [  2.964 s]
-[INFO] hoodie-hadoop-sparkworker-docker ................... SUCCESS [  3.032 s]
-[INFO] hoodie-hadoop-sparkadhoc-docker .................... SUCCESS [  2.764 s]
-[INFO] hoodie-integ-test .................................. SUCCESS [  1.785 s]
-[INFO] ------------------------------------------------------------------------
-[INFO] BUILD SUCCESS
-[INFO] ------------------------------------------------------------------------
-[INFO] Total time: 09:15 min
-[INFO] Finished at: 2018-09-10T17:47:37-07:00
-[INFO] Final Memory: 236M/1848M
-[INFO] ------------------------------------------------------------------------
-```
+{% include note.html content="This is only supported for Read-optimized view for now." %}
diff --git a/docs/writing_data.md b/docs/writing_data.md
index 28fd03e..c060134 100644
--- a/docs/writing_data.md
+++ b/docs/writing_data.md
@@ -4,31 +4,23 @@ keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
 sidebar: mydoc_sidebar
 permalink: writing_data.html
 toc: false
-summary: In this page, we will discuss some available tools for ingesting data incrementally & consuming the changes.
+summary: In this page, we will discuss some available tools for incrementally ingesting & storing data.
 ---
 
-As discussed in the concepts section, the two basic primitives needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-data using Hudi are `upserts` (to apply changes to a dataset) and `incremental pulls` (to obtain a change stream/log from a dataset). This section
-discusses a few tools that can be used to achieve these on different contexts.
+In this section, we will cover ways to ingest new changes from external sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) tool, as well as 
+speeding up large Spark jobs via upserts using the [Hudi datasource](#datasource-writer). Such datasets can then be [queried](querying_data.html) using various query engines.
 
-## Incremental Ingestion
+## DeltaStreamer
 
-Following means can be used to apply a delta or an incremental change to a Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to DFS or
-even changes pulled from another Hudi dataset.
+The `HoodieDeltaStreamer` utility (part of hoodie-utilities) provides the way to ingest from different sources such as DFS or Kafka, with the following capabilities.
 
-#### DeltaStreamer Tool
+ - Exactly once ingestion of new events from Kafka, [incremental imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
+ - Support json, avro or a custom record types for the incoming data
+ - Manage checkpoints, rollback & recovery 
+ - Leverage Avro schemas from DFS or Confluent [schema registry](https://github.com/confluentinc/schema-registry).
+ - Support for plugging in transformations
 
-The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`, and support simply row-row ingestion (no transformations)
-from different sources such as DFS or Kafka.
-
-The tool is a spark job (part of hoodie-utilities), that provides the following functionality
-
- - Ability to consume new events from Kafka, incremental imports from Sqoop or output of `HiveIncrementalPuller` or files under a folder on DFS
- - Support json, avro or a custom payload types for the incoming data
- - Pick up avro schemas from DFS or Confluent [schema registry](https://github.com/confluentinc/schema-registry).
- - New data is written to a Hudi dataset, with support for checkpointing and registered onto Hive
-
-Command line options describe capabilities in more detail (first build hoodie-utilities using `mvn clean package`).
+Command line options describe capabilities in more detail
 
 ```
 [hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
@@ -95,7 +87,6 @@ Usage: <main class> [options]
 
 ```
 
-
 The tool takes a hierarchically composed property file and has pluggable interfaces for extracting data, key generation and providing schema. Sample configs for ingesting from kafka and dfs are
 provided under `hoodie-utilities/src/test/resources/delta-streamer-config`.
 
@@ -117,12 +108,12 @@ and then ingest it as follows.
   --op BULK_INSERT
 ```
 
-In some cases, you may want to convert your existing dataset into Hudi, before you can begin ingesting new data. This can be accomplished using the `hdfsparquetimport` command on the `hoodie-cli`.
-Currently, there is support for converting parquet datasets.
+In some cases, you may want to migrate your existing dataset into Hudi beforehand. Please refer to [migration guide](migration_guide.html). 
 
-#### Via Custom Spark Job
+## Datasource Writer
 
-The `hoodie-spark` module offers the DataSource API to write any data frame into a Hudi dataset. Following is how we can upsert a dataframe, while specifying the field names that need to be used
+The `hoodie-spark` module offers the DataSource API to write (and also read) any data frame into a Hudi dataset.
+Following is how we can upsert a dataframe, while specifying the field names that need to be used
 for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
 
 
@@ -138,15 +129,16 @@ inputDF.write()
        .save(basePath);
 ```
 
-Please refer to [configurations](configurations.html) section, to view all datasource options.
-
-#### Syncing to Hive
+## Syncing to Hive
 
-Once new data is written to a Hudi dataset, via tools like above, we need the ability to sync with Hive and reflect the table schema such that queries can pick up new columns and partitions. To do this, Hudi provides a `HiveSyncTool`, which can be
-invoked as below, once you have built the hoodie-hive module.
+Both tools above support syncing of the dataset's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
+In case, its preferable to run this from commandline or in an independent jvm, Hudi provides a `HiveSyncTool`, which can be invoked as below, 
+once you have built the hoodie-hive module.
 
 ```
- [hoodie-hive]$ java -cp target/hoodie-hive-0.3.6-SNAPSHOT-jar-with-dependencies.jar:target/jars/* com.uber.hoodie.hive.HiveSyncTool --help
+cd hoodie-hive
+./run_sync_tool.sh
+ [hoodie-hive]$ ./run_sync_tool.sh --help
 Usage: <main class> [options]
   Options:
   * --base-path
@@ -154,7 +146,6 @@ Usage: <main class> [options]
   * --database
        name of the target database in Hive
     --help, -h
-
        Default: false
   * --jdbc-url
        Hive jdbc connect url
@@ -164,70 +155,22 @@ Usage: <main class> [options]
        name of the target table in Hive
   * --user
        Hive username
-
-
 ```
 
-## Incrementally Pulling
-
-Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified commit timestamp.
-This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to produce deltas to a target Hudi dataset. Then, using the delta streamer tool these deltas can be upserted into the
-target Hudi dataset to complete the pipeline.
-
-#### Via Spark Job
-The `hoodie-spark` module offers the DataSource API, offers a more elegant way to pull data from Hudi dataset (plus more) and process it via Spark.
-This class can be used within existing Spark jobs and offers the following functionality.
-
-A sample incremental pull, that will obtain all records written since `beginInstantTime`, looks like below.
-
-```
- Dataset<Row> hoodieIncViewDF = spark.read()
-     .format("com.uber.hoodie")
-     .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
-             DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
-     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
-            <beginInstantTime>)
-     .load(tablePath); // For incremental view, pass in the root/base path of dataset
-```
-
-Please refer to [configurations](configurations.html) section, to view all datasource options.
-
-
-Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
-
-| **API** | **Description** |
-| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own index for faster lookup |
-| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
-| checkExists(keys) | Check if the provided keys exist in a Hudi dataset |
-
-
-#### HiveIncrementalPuller Tool
-`HiveIncrementalPuller` allows the above to be done via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and incremental primitives
-(speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the Hive query saving its results in a temp table.
-that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
-e.g: `/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`.The Delta Hive table registered will be of the form `{tmpdb}.{source_table}_{last_commit_included}`.
-
-The following are the configuration options for HiveIncrementalPuller
+## Storage Management
 
-| **Config** | **Description** | **Default** |
-|hiveUrl| Hive Server 2 URL to connect to |  |
-|hiveUser| Hive Server 2 Username |  |
-|hivePass| Hive Server 2 Password |  |
-|queue| YARN Queue name |  |
-|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
-|extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
-|sourceTable| Source Table Name. Needed to set hive environment properties. |  |
-|targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
-|sourceDataPath| Source DFS Base Path. This is where the Hudi metadata will be read. |  |
-|targetDataPath| Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
-|tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
-|fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are pulled from.  |  |
-|maxCommits| Number of commits to include in the pull. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
-|help| Utility Help |  |
+Hudi also performs several key storage management functions on the data stored in a Hudi dataset. A key aspect of storing data on DFS is managing file sizes and counts
+and reclaiming storage space. For e.g HDFS is infamous for its handling of small files, which exerts memory/RPC pressure on the Name Node and can potentially destabilize
+the entire cluster. In general, query engines provide much better performance on adequately sized columnar files, since they can effectively amortize cost of obtaining 
+column statistics etc. Even on some cloud data stores, there is often cost to listing directories with large number of small files.
 
+Here are some ways to efficiently manage the storage of your Hudi datasets.
 
-Setting the fromCommitTime=0 and maxCommits=-1 will pull in the entire source dataset and can be used to initiate backfills. If the target dataset is a Hudi dataset,
-then the utility can determine if the target dataset has no commits or is behind more than 24 hour (this is configurable),
-it will automatically use the backfill configuration, since applying the last 24 hours incrementally could take more time than doing a backfill. The current limitation of the tool
-is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
+ - The [small file handling feature](configurations.html#compactionSmallFileSize) in Hudi, profiles incoming workload 
+   and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. 
+ - Cleaner can be [configured](configurations.html#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
+ - User can also tune the size of the [base/parquet file](configurations.html#limitFileSize), [log files](configurations.html#logFileMaxSize) & expected [compression ratio](configurations.html#parquetCompressionRatio), 
+   such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately.
+ - Intelligently tuning the [bulk insert parallelism](configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups
+   once created cannot be deleted, but simply expanded as explained before.
+ - For workloads with heavy updates, the [merge-on-read storage](concepts.html#merge-on-read-storage) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction.
\ No newline at end of file