You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/03/13 22:41:15 UTC
[incubator-hudi-site] 12/19: Revised community, contributing pages

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git

commit 94f64a7c7e8c8ec0307fbd78c5ec13ec0a7b9175
Author: Vinoth Chandar <vi...@uber.com>
AuthorDate: Mon Feb 25 07:01:53 2019 -0800

    Revised community, contributing pages
    
     - Community engagement instructions
     - Strawman contribution guide, to get us going
     - Fixed broken image urls from the hudi renames
     - Fixed broken code formatting on couple pages
     - Removed api_setup, roadmap pages and cleaned up structure
---
 .gitignore                                         |   1 +
 docs/README.md                                     |   5 +
 docs/_config.yml                                   |   2 +-
 docs/_data/topnav.yml                              |  24 ++-
 docs/_includes/footer.html                         |   6 +
 docs/_posts/2019-01-18-asf-incubation.md           |  10 ++
 docs/admin_guide.md                                |  22 ++-
 docs/api_docs.md                                   |  10 --
 docs/code_and_design.md                            |  38 -----
 docs/community.md                                  |  38 +++--
 docs/concepts.md                                   |  28 ++--
 docs/configurations.md                             |  38 +++--
 docs/contributing.md                               | 101 +++++++++++++
 docs/dev_setup.md                                  |  13 --
 docs/images/hoodie_cow.png                         | Bin 31136 -> 0 bytes
 docs/images/hoodie_mor.png                         | Bin 56002 -> 0 bytes
 docs/images/hudi_cow.png                           | Bin 0 -> 48994 bytes
 docs/images/hudi_mor.png                           | Bin 0 -> 92073 bytes
 .../{hoodie_timeline.png => hudi_timeline.png}     | Bin
 docs/implementation.md                             | 165 +++++++++++----------
 docs/index.md                                      |   7 +-
 docs/migration_guide.md                            |  70 ++++-----
 docs/quickstart.md                                 |  89 +++++------
 docs/roadmap.md                                    |  14 --
 docs/sql_queries.md                                |   5 +-
 25 files changed, 383 insertions(+), 303 deletions(-)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..e43b0f9
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/docs/README.md b/docs/README.md
index 0995250..8593206 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -11,6 +11,11 @@ The site is based on a [Jekyll](https://jekyllrb.com/) theme hosted [here](idrat
 
 Simply run `docker-compose build --no-cache && docker-compose up` from the `docs` folder and the site should be up & running at `http://localhost:4000`
 
+To see edits reflect on the site, you may have to bounce the container
+
+ - Stop existing container by `ctrl+c` the docker-compose program
+ - (or) alternatively via `docker stop docs_server_1`
+ - Bring up container again using `docker-compose up`
 
 #### Host OS
 
diff --git a/docs/_config.yml b/docs/_config.yml
index 781bdb6..9f0effd 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -77,7 +77,7 @@ defaults:
 sidebars:
 - mydoc_sidebar
 
-description: "Apache Hudi (pronounced “Hoodie”) is a Spark Library, that provides upserts and incremental processing capaibilities on Hadoop datasets"
+description: "Apache Hudi (pronounced “Hoodie”) provides upserts and incremental processing capaibilities on Big Data"
 # the description is used in the feed.xml file
 
 # needed for sitemap.xml file only
diff --git a/docs/_data/topnav.yml b/docs/_data/topnav.yml
index 190573a..0042feb 100644
--- a/docs/_data/topnav.yml
+++ b/docs/_data/topnav.yml
@@ -7,24 +7,22 @@ topnav:
       url: /news
     - title: Community
       url: /community.html
-    - title: Github
+    - title: Code
       external_url: https://github.com/uber/hoodie
 
 #Topnav dropdowns
 topnav_dropdowns:
 - title: Topnav dropdowns
   folders:
-    - title: Developer Resources
+    - title: Developers
       folderitems:
-          - title: Setup
-            url: /dev_setup.html
-            output: web
-          - title: API Docs
-            url: /api_docs.html
-            output: web
-          - title: Code Structure
-            url: /code_and_design.html
-            output: web
-          - title: Roadmap
-            url: /roadmap.html
+          - title: Contributing
+            url: /contributing.html
             output: web
+          - title: Wiki/Designs
+            external_url: https://cwiki.apache.org/confluence/display/HUDI
+          - title: Issues
+            external_url: https://issues.apache.org/jira/projects/HUDI/summary
+          - title: Blog
+            external_url: https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI
+      
diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html
index 00605db..c920c5c 100755
--- a/docs/_includes/footer.html
+++ b/docs/_includes/footer.html
@@ -8,6 +8,12 @@
                   <a class="footer-link-img" href="https://apache.org">
                     <img src="images/asf_logo.svg" alt="The Apache Software Foundation" height="100px" widh="50px"></a>
                   </p>
+                  <p>
+                  Apache Hudi is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the name of <a href="http://incubator.apache.org/">Apache Incubator</a>.
+                  Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have
+                  stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a
+                  reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
+                  </p>
                 </div>
             </div>
 </footer>
diff --git a/docs/_posts/2019-01-18-asf-incubation.md b/docs/_posts/2019-01-18-asf-incubation.md
new file mode 100644
index 0000000..79de37c
--- /dev/null
+++ b/docs/_posts/2019-01-18-asf-incubation.md
@@ -0,0 +1,10 @@
+---
+title:  "Hudi entered Apache Incubator"
+categories:  update
+permalink: strata-talk.html
+tags: [news]
+---
+
+In the coming weeks, we will be moving in our new home on the Apache Incubator.
+
+{% include links.html %}
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index 7f7e610..3d37d22 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -43,7 +43,9 @@ hoodie->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --t
 ```
 
 To see the description of hoodie table, use the command:
+
 ```
+
 hoodie:hoodie_table_1->desc
 18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
     _________________________________________________________
@@ -55,6 +57,7 @@ hoodie:hoodie_table_1->desc
     | hoodie.table.name       | hoodie_table_1               |
     | hoodie.table.type       | COPY_ON_WRITE                |
     | hoodie.archivelog.folder|                              |
+
 ```
 
 Following is a sample command to connect to a Hoodie dataset contains uber trips.
@@ -183,7 +186,7 @@ order (See Concepts). The below commands allow users to view the file-slices for
  | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta Files - compaction unscheduled|
  |========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================== [...]
  | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| [] |
- 
+
  hoodie:stock_ticks_mor->
 ```
 
@@ -224,7 +227,7 @@ This is a sequence file that contains a mapping from commitNumber => json with r
 
 #### Compactions
 
-To get an idea of the lag between compaction and writer applications, use the below command to list down all 
+To get an idea of the lag between compaction and writer applications, use the below command to list down all
 pending compactions.
 
 ```
@@ -316,7 +319,7 @@ hoodie:stock_ticks_mor->compaction validate --instant 20181005222611
 ...
 
    COMPACTION PLAN VALID
-   
+
     ___________________________________________________________________________________________________________________________________________________________________________________________________________________________
     | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error|
     |==========================================================================================================================================================================================================================|
@@ -340,14 +343,15 @@ hoodie:stock_ticks_mor->compaction validate --instant 20181005222601
 
 The following commands must be executed without any other writer/ingestion application running.
 
-Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction 
-operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed 
+Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction
+operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed
 so that are preserved. Hudi provides the following CLI to support it
 
 
 ##### UnScheduling Compaction
 
 ```
+
 hoodie:trips->compaction unscheduleFileId --fileId <FileUUID>
 ....
 No File renames needed to unschedule file from pending compaction. Operation successful.
@@ -356,24 +360,28 @@ No File renames needed to unschedule file from pending compaction. Operation suc
 
 In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI
 ```
+
 hoodie:trips->compaction unschedule --compactionInstant <compactionInstant>
 .....
 No File renames needed to unschedule pending compaction. Operation successful.
+
 ```
-  
+
 ##### Repair Compaction
 
 The above compaction unscheduling operations could sometimes fail partially (e:g -> HDFS temporarily unavailable). With
-partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run 
+partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
 `compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
 consistent with the compaction plan
 
 ```
+
 hoodie:stock_ticks_mor->compaction repair --instant 20181005222611
 ......
 Compaction successfully repaired
 .....
+
 ```
 
 
diff --git a/docs/api_docs.md b/docs/api_docs.md
deleted file mode 100644
index 24bfd6b..0000000
--- a/docs/api_docs.md
+++ /dev/null
@@ -1,10 +0,0 @@
----
-title: API Docs
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: api_docs.html
----
-
-Work In Progress
-
-
diff --git a/docs/code_and_design.md b/docs/code_and_design.md
deleted file mode 100644
index 3baaa97..0000000
--- a/docs/code_and_design.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-title: Code Structure
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: code_and_design.html
----
-
-## Code & Project Structure
-
- * hoodie-client     : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
- * hoodie-common     : Common code shared between different artifacts of Hoodie
-
- ## HoodieLogFormat
-
- The following diagram depicts the LogFormat for Hoodie MergeOnRead. Each logfile consists of one or more log blocks.
- Each logblock follows the format shown below.
-
- | Field  | Description |
- |-------------- |------------------|
- | MAGIC    | A magic header that marks the start of a block |
- | VERSION  | The version of the LogFormat, this helps define how to switch between different log format as it evolves |
- | TYPE     | The type of the log block |
- | HEADER LENGTH | The length of the headers, 0 if no headers |
- | HEADER        | Metadata needed for a log block. For eg. INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA etc. |
- | CONTENT LENGTH |  The length of the content of the log block |
- | CONTENT        | The content of the log block, for example, for a DATA_BLOCK, the content is (number of records + actual records) in byte [] |
- | FOOTER LENGTH  | The length of the footers, 0 if no footers |
- | FOOTER         | Metadata needed for a log block. For eg. index entries, a bloom filter for records in a DATA_BLOCK etc. |
- | LOGBLOCK LENGTH | The total number of bytes written for a log block, typically the SUM(everything_above). This is a LONG. This acts as a reverse pointer to be able to traverse the log in reverse.|
-
-
- {% include image.html file="hoodie_log_format_v2.png" alt="hoodie_log_format_v2.png" %}
-
-
-
-
-
-
diff --git a/docs/community.md b/docs/community.md
index c508191..c16dc92 100644
--- a/docs/community.md
+++ b/docs/community.md
@@ -6,17 +6,35 @@ toc: false
 permalink: community.html
 ---
 
+## Engage with us
+
+There are several ways to get in touch with the Hudi community.
+
+| When? | Channel to use |
+|-------|--------|
+| For any general questions, user support, development discussions | Dev Mailing list ([Subscribe](mailto:dev-subscribe@hudi.apache.org), [Unsubscribe](mailto:dev-unsubscribe@hudi.apache.org), [Archives](https://lists.apache.org/list.html?dev@hudi.apache.org)). Empty email works for subscribe/unsubscribe |
+| For reporting bugs or issues or discover known issues | Please use [ASF Hudi JIRA](https://issues.apache.org/jira/projects/HUDI/summary) |
+| For quick pings & 1-1 chats | Join our [slack group](https://join.slack.com/t/apache-hudi/signup) |
+| For proposing large features, changes | Start a Hudi Improvement Process (HIP). Instructions coming soon.|
+| For stream of commits, pull requests etc | Commits Mailing list ([Subscribe](mailto:commits-subscribe@hudi.apache.org), [Unsubscribe](mailto:commits-unsubscribe@hudi.apache.org), [Archives](https://lists.apache.org/list.html?commits@hudi.apache.org)) |
+
+If you wish to report a security vulnerability, please contact [security@apache.org](mailto:security@apache.org).
+Apache Hudi follows the typical Apache vulnerability handling [process](https://apache.org/security/committers.html#vulnerability-handling).
+
 ## Contributing
-We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open
-issues or pull requests against this repo. Before you do so, please sign the
-[Apache CLA](https://www.apache.org/licenses/icla.pdf).
-Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.
-If the reviewer feels this contributions needs to be in the release notes, please add it to CHANGELOG.md as well.
 
-If you want to participate in day-day conversations, please join our [slack group](https://join.slack.com/t/apache-hudi/signup).
-If you are from select pre-listed email domains, you can self signup. Others, please subscribe to dev@hudi.apache.org
+Apache Hudi community welcomes contributions from anyone!
+
+Here are few ways, you can get involved.
+
+ - Ask (and/or) answer questions on our support channels listed above.
+ - Review code or HIPs
+ - Help improve documentation
+ - Testing; Improving out-of-box experience by reporting bugs
+ - Share new ideas/directions to pursue or propose a new HIP
+ - Contributing code to the project
 
-## Becoming a Committer
+#### Code Contributions
 
-Hoodie has adopted a lot of guidelines set forth in [Google Chromium project](https://www.chromium.org/getting-involved/become-a-committer), to determine committership proposals. However, given this is a much younger project, we would have the contribution bar to be 10-15 non-trivial patches instead.
-Additionally, we expect active engagement with the community over a few months, in terms of conference/meetup talks, helping out with issues/questions on slack/github.
+Useful resources for contributing can be found under the "Developers" top menu.
+Specifically, please refer to the detailed [contribution guide](contributing.html).
diff --git a/docs/concepts.md b/docs/concepts.md
index 5ce3fc6..845228a 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -20,7 +20,7 @@ Such key activities include
  * `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset.
        Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation.
  * `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
- * `DELTA_COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a 
+ * `DELTA_COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a
  MergeOnRead storage type of dataset
  * `COMPACTIONS` - Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based log files to columnar formats.
 
@@ -37,15 +37,15 @@ only the changed files without say scanning all the time buckets > 07:00.
 
 ## Terminologies
 
- * `Hudi Dataset` 
-    A structured hive/spark dataset managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables. 
- * `Commit` 
-    A commit marks a new batch of data applied to a dataset. Hudi maintains  monotonically increasing timestamps to track commits and guarantees that a commit is atomically 
+ * `Hudi Dataset`
+    A structured hive/spark dataset managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables.
+ * `Commit`
+    A commit marks a new batch of data applied to a dataset. Hudi maintains  monotonically increasing timestamps to track commits and guarantees that a commit is atomically
     published.
  * `Commit Timeline`
-    Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime. 
- * `File Slice` 
-    Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id. 
+    Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime.
+ * `File Slice`
+    Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id.
     Hudi uses MVCC to provide atomicity and isolation of readers from a writer. This means that a logical fileId will
     have many physical versions of it. Each of these physical version of a file represents a complete view of the
     file as of a commit and is called File Slice
@@ -69,8 +69,6 @@ Hudi (will) supports the following storage types.
   - Copy On Write : A heavily read optimized storage type, that simply creates new versions of files corresponding to the records that changed.
   - Merge On Read : Also provides a near-real time datasets in the order of 5 mins, by shifting some of the write cost, to the reads and merging incoming and on-disk data on-the-fly
 
-{% include callout.html content="Hudi is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/Hudi/projects/1)" type="info" %}
-
 Regardless of the storage type, Hudi organizes a datasets into a directory structure under a `basepath`,
 very similar to Hive tables. Dataset is broken up into partitions, which are folders containing files for that partition.
 Each partition uniquely identified by its `partitionpath`, which is relative to the basepath.
@@ -92,12 +90,12 @@ commit, such that only columnar data exists. As a result, the write amplificatio
 Following illustrates how this works conceptually, when  data written into copy-on-write storage  and two queries running on top of it.
 
 
-{% include image.html file="Hudi_cow.png" alt="Hudi_cow.png" %}
+{% include image.html file="hudi_cow.png" alt="hudi_cow.png" %}
 
 
 As data gets written, updates to existing file ids, produce a new version for that file id stamped with the commit and
 inserts allocate a new file id and write its first version for that file id. These file versions and their commits are color coded above.
-Normal SQL queries running against such dataset (eg: select count(*) counting the total records in that partition), first checks the timeline for latest commit
+Normal SQL queries running against such dataset (eg: `select count(*)` counting the total records in that partition), first checks the timeline for latest commit
 and filters all but latest versions of each file id. As you can see, an old query does not see the current inflight commit's files colored in pink,
 but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data.
 
@@ -118,7 +116,7 @@ their columnar base data, to keep the query performance in check (larger append
 
 Following illustrates how the storage works, and shows queries on both near-real time table and read optimized table.
 
-{% include image.html file="Hudi_mor.png" alt="Hudi_mor.png" max-width="1000" %}
+{% include image.html file="hudi_mor.png" alt="hudi_mor.png" max-width="1000" %}
 
 
 There are lot of interesting things happening in this example, which bring out the subleties in the approach.
@@ -135,8 +133,6 @@ There are lot of interesting things happening in this example, which bring out t
  strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the RO Table sees data
  published within X minutes in a consistent fashion.
 
-{% include callout.html content="Hudi is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
-
 The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying
 data out to specialized systems, which may not be able to handle the data volume.
 
@@ -156,4 +152,4 @@ data out to specialized systems, which may not be able to handle the data volume
 | Trade-off | ReadOptimized | RealTime |
 |-------------- |------------------| ------------------|
 | Data Latency | Higher   | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + row based delta) |
\ No newline at end of file
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + row based delta) |
diff --git a/docs/configurations.md b/docs/configurations.md
index 50a7e5f..e6602e6 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -136,7 +136,7 @@ summary: "Here we list all possible configurations and what they mean"
         Actual value ontained by invoking .toString()</span>
         - [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY) (Default: com.uber.hoodie.SimpleKeyGenerator) <br/>
         <span style="color:grey">Key generator class, that implements will extract the key out of incoming `Row` object</span>
-        - [COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) (Default: _) <br/>
+        - [COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) (Default: `_`) <br/>
         <span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
         This is useful to store checkpointing information, in a consistent way with the hoodie timeline</span>
 
@@ -160,22 +160,33 @@ summary: "Here we list all possible configurations and what they mean"
 
 Writing data via Hudi happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.
 
- - **Write operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. 
+**Write operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`.
  Difference between them is that bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
- - **Input Parallelism** : By default, Hoodie tends to over-partition input (i.e `withParallelism(1500)`), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism `hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast input_data_size/500MB
- - **Off-heap memory** : Hoodie writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like `spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if you are running into such failures.
- - **Spark Memory** : Typically, hoodie needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.storage.memoryFraction` will generally help boost performance.
- - **Sizing files** : Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
- - **Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases
+
+**Input Parallelism** : By default, Hoodie tends to over-partition input (i.e `withParallelism(1500)`), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism `hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast input_data_size/500MB
+
+**Off-heap memory** : Hoodie writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like `spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if you are running into such failures.
+
+**Spark Memory** : Typically, hoodie needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.storage.memoryFraction` will generally help boost performance.
+
+**Sizing files** : Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
+
+**Timeseries/Log data** : Default configs are tuned for database/nosql changelogs where individual record sizes are large. Another very popular class of data is timeseries/event/log data that tends to be more volumnious with lot more records per partition. In such cases
     - Consider tuning the bloom filter accuracy via `.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look up time
     - Consider making a key that is prefixed with time of the event, which will enable range pruning & significantly speeding up index lookup.
- - **GC Tuning** : Please be sure to follow garbage collection tuning tips from Spark tuning guide to avoid OutOfMemory errors
-    - [Must] Use G1/CMS Collector. Sample CMS Flags to add to spark.executor.extraJavaOptions : ``-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ho [...]
-    - If it keeps OOMing still, reduce spark memory conservatively: `spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to spill rather than OOM. (reliably slow vs crashing intermittently)
 
- Below is a full working production config
+**GC Tuning** : Please be sure to follow garbage collection tuning tips from Spark tuning guide to avoid OutOfMemory errors
+[Must] Use G1/CMS Collector. Sample CMS Flags to add to spark.executor.extraJavaOptions :
 
- ```
+```
+-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+````
+
+If it keeps OOMing still, reduce spark memory conservatively: `spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to spill rather than OOM. (reliably slow vs crashing intermittently)
+
+Below is a full working production config
+
+```
  spark.driver.extraClassPath    /etc/hive/conf
  spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
  spark.driver.maxResultSize    2g
@@ -200,4 +211,5 @@ Writing data via Hudi happens as a Spark job and thus general rules of spark deb
  spark.yarn.driver.memoryOverhead    1024
  spark.yarn.executor.memoryOverhead    3072
  spark.yarn.max.executor.failures    100
- ```
+
+````
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 0000000..a93ba54
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,101 @@
+---
+title: Developer Setup
+keywords: developer setup
+sidebar: mydoc_sidebar
+toc: false
+permalink: contributing.html
+---
+## Pre-requisites
+
+To contribute code, you need
+
+ - a GitHub account
+ - a Linux (or) macOS development environment with Java JDK 8, Apache Maven (3.x+) installed
+ - [Docker](https://www.docker.com/) installed for running demo, integ tests or building website
+ - for large contributions, a signed [Individual Contributor License
+   Agreement](https://www.apache.org/licenses/icla.pdf) (ICLA) to the Apache
+   Software Foundation (ASF).
+ - (Recommended) Create an account on [JIRA](https://issues.apache.org/jira/projects/HUDI/summary) to open issues/find similar issues.
+ - (Recommended) Join our dev mailing list & slack channel, listed on [community](community.html) page.
+
+
+## IDE Setup
+
+To contribute, you would need to fork the Hudi code on Github & then clone your own fork locally. Once cloned, we recommend building as per instructions on [quickstart](quickstart.html)
+
+We have embraced the code style largely based on [google format](https://google.github.io/styleguide/javaguide.html). Please setup your IDE with style files from [here](../style/).
+These instructions have been tested on IntelliJ. We also recommend setting up the [Save Action Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format & organize imports on save. The Maven Compilation life-cycle will fail if there are checkstyle violations.
+
+
+## Lifecycle
+
+Here's a typical lifecycle of events to contribute to Hudi.
+
+ - [Recommended] Share your intent on the mailing list, so that community can provide early feedback, point out any similar JIRAs or HIPs.
+ - [Optional] If you want to get involved, but don't have a project in mind, please check JIRA for small, quick-starters.
+ - [Optional] Familiarize yourself with internals of Hudi using content on this page, as well as [wiki](https://cwiki.apache.org/confluence/display/HUDI)
+ - Once you finalize on a project/task, please open a new JIRA or assign an existing one to yourself. (If you don't have perms to do this, please email the dev mailing list with your JIRA id and a small intro for yourself. We'd be happy to add you as a contributor)
+ - Make your code change
+   - Every source file needs to include the Apache license header. Every new dependency needs to
+     have an open source license [compatible](https://www.apache.org/legal/resolved.html#criteria) with Apache.
+   - Get existing tests to pass using `mvn clean install -DskipITs`
+   - Add adequate tests for your new functionality
+   - [Optional] For involved changes, its best to also run the entire integration test suite using `mvn clean install`
+   - For website changes, please build the site locally & test navigation, formatting & links thoroughly
+ - Format commit messages and the pull request title like `[HUDI-XXX] Fixes bug in Spark Datasource`,
+   where you replace HUDI-XXX with the appropriate JIRA issue.
+ - Push your commit to your own fork/branch & create a pull request (PR) against the Hudi repo.
+ - If you don't hear back within 3 days on the PR, please send an email to dev @ mailing list.
+ - Address code review comments & keep pushing changes to your fork/branch, which automatically updates the PR
+ - Before your change can be merged, it should be squashed into a single commit for cleaner commit history.
+
+
+## Releases
+
+ - Apache Hudi community plans to do minor version releases every 6 weeks or so.
+ - If your contribution merged onto `master` branch after the last release, it will become part of next release.
+ - Website changes are regenerated once a week (until automation in place to reflect immediately)
+
+
+## Accounts and Permissions
+
+ - [Hudi issue tracker (JIRA)](https://issues.apache.org/jira/projects/HUDI/issues):
+   Anyone can access it and browse issues. Anyone can register an account and login
+   to create issues or add comments. Only contributors can be assigned issues. If
+   you want to be assigned issues, a PMC member can add you to the project contributor
+   group.  Email the dev mailing list to ask to be added as a contributor, and include your ASF Jira username.
+
+ - [Hudi Wiki Space](https://cwiki.apache.org/confluence/display/HUDI):
+   Anyone has read access. If you wish to contribute changes, please create an account and
+   request edit access on the dev@ mailing list (include your Wiki account user ID).
+
+ - Pull requests can only be merged by a HUDI committer, listed [here](https://incubator.apache.org/projects/hudi.html)
+
+ - [Voting on a release](https://www.apache.org/foundation/voting.html): Everyone can vote.
+   Only Hudi PMC members should mark their votes as binding.
+
+## Communication
+
+All communication is expected to align with the [Code of Conduct](https://www.apache.org/foundation/policies/conduct).
+Discussion about contributing code to Hudi happens on the [dev@ mailing list](community.html). Introduce yourself!
+
+
+## Code & Project Structure
+
+  * `docker` : Docker containers used by demo and integration tests. Brings up a mini data ecosystem locally
+  * `hoodie-cli` : CLI to inspect, manage and administer datasets
+  * `hoodie-client` : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
+  * `hoodie-common` : Common classes used across modules
+  * `hoodie-hadoop-mr` : InputFormat implementations for ReadOptimized, Incremental, Realtime views
+  * `hoodie-hive` : Manage hive tables off Hudi datasets and houses the HiveSyncTool
+  * `hoodie-integ-test` : Longer running integration test processes
+  * `hoodie-spark` : Spark datasource for writing and reading Hudi datasets. Streaming sink.
+  * `hoodie-utilities` : Houses tools like DeltaStreamer, SnapshotCopier
+  * `packaging` : Poms for building out bundles for easier drop in to Spark, Hive, Presto
+  * `style`  : Code formatting, checkstyle files
+
+
+## Website
+
+[Apache Hudi site](https://hudi.apache.org) is hosted on a special `asf-site` branch. Please follow the `README` file under `docs` on that branch for
+instructions on making changes to the website.
diff --git a/docs/dev_setup.md b/docs/dev_setup.md
deleted file mode 100644
index 1bdeec7..0000000
--- a/docs/dev_setup.md
+++ /dev/null
@@ -1,13 +0,0 @@
----
-title: Developer Setup
-keywords: developer setup
-sidebar: mydoc_sidebar
-permalink: dev_setup.html
----
-
-### Code Style
-
- We have embraced the code style largely based on [google format](https://google.github.io/styleguide/javaguide.html).
- Please setup your IDE with style files from [here](../style/)
- We also recommend setting up the [Save Action Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format & organize imports on save.
- The Maven Compilation life-cycle will fail if there are checkstyle violations.
diff --git a/docs/images/hoodie_cow.png b/docs/images/hoodie_cow.png
deleted file mode 100644
index bad15a8..0000000
Binary files a/docs/images/hoodie_cow.png and /dev/null differ
diff --git a/docs/images/hoodie_mor.png b/docs/images/hoodie_mor.png
deleted file mode 100644
index 8d7d902..0000000
Binary files a/docs/images/hoodie_mor.png and /dev/null differ
diff --git a/docs/images/hudi_cow.png b/docs/images/hudi_cow.png
new file mode 100644
index 0000000..40aca71
Binary files /dev/null and b/docs/images/hudi_cow.png differ
diff --git a/docs/images/hudi_mor.png b/docs/images/hudi_mor.png
new file mode 100644
index 0000000..100b8f0
Binary files /dev/null and b/docs/images/hudi_mor.png differ
diff --git a/docs/images/hoodie_timeline.png b/docs/images/hudi_timeline.png
similarity index 100%
rename from docs/images/hoodie_timeline.png
rename to docs/images/hudi_timeline.png
diff --git a/docs/implementation.md b/docs/implementation.md
index 6215155..e87a541 100644
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -23,7 +23,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces
 
 Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables
 us to speed up upserts significantly, without scanning over every record in the dataset. Hudi Indices can be classified based on
-their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key 
+their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key
 but a `non-global` does.
 
 #### HBase Index (global)
@@ -63,8 +63,8 @@ records such that
 
 In this storage, index updation is a no-op, since the bloom filters are already written as a part of committing data.
 
-In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of 
-the file 
+In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of
+the file
 
 {% include image.html file="hoodie_log_format_v2.png" alt="hoodie_log_format_v2.png" max-width="1000" %}
 
@@ -73,27 +73,27 @@ the file
 In the Merge-On-Read storage model, there are 2 logical components - one for ingesting data (both inserts/updates) into the dataset
  and another for creating compacted views. The former is hereby referred to as `Writer` while the later
  is referred as `Compactor`.
- 
+
 ##### Merge On Read Writer
- 
+
  At a high level, Merge-On-Read Writer goes through same stages as Copy-On-Write writer in ingesting data.
- The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice 
+ The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice
  without merging. For inserts, Hudi supports 2 modes:
 
    1. Inserts to Log Files - This is done for datasets that have an indexable log files (for eg global index)
    2. Inserts to parquet files - This is done for datasets that do not have indexable log files, for eg bloom index
       embedded in parquer files. Hudi treats writing new records in the same way as inserting to Copy-On-Write files.
 
-As in the case of Copy-On-Write, the input tagged records are partitioned such that all upserts destined to 
+As in the case of Copy-On-Write, the input tagged records are partitioned such that all upserts destined to
 a `file id` are grouped together. This upsert-batch is written as one or more log-blocks written to log-files.
 Hudi allows clients to control log file sizes (See [Storage Configs](../configurations))
 
 The WriteClient API is same for both Copy-On-Write and Merge-On-Read writers.
- 
+
 With Merge-On-Read, several rounds of data-writes would have resulted in accumulation of one or more log-files.
 All these log-files along with base-parquet (if exists) constitute a `file slice` which represents one complete version
-of the file. 
-  
+of the file.
+
 #### Compactor
 
 Realtime Readers will perform in-situ merge of these delta log-files to provide the most recent (committed) view of
@@ -106,48 +106,52 @@ Asynchronous Compaction involves 2 steps:
     to be compacted atomically in a single compaction commit. Hudi allows pluggable strategies for choosing
     file slices for each compaction runs. This step is typically done inline by Writer process as Hudi expects
     only one schedule is being generated at a time which allows Hudi to enforce the constraint that pending compaction
-    plans do not step on each other file-slices. This constraint allows for multiple concurrent `Compactors` to run at 
+    plans do not step on each other file-slices. This constraint allows for multiple concurrent `Compactors` to run at
     the same time. Some of the common strategies used for choosing `file slice` for compaction are:
-    * BoundedIO - Limit the number of file slices chosen for a compaction plan by expected total IO (read + write) 
-    needed to complete compaction run 
+    * BoundedIO - Limit the number of file slices chosen for a compaction plan by expected total IO (read + write)
+    needed to complete compaction run
     * Log File Size - Prefer file-slices with larger amounts of delta log data to be merged
     * Day Based - Prefer file slice belonging to latest day partitions
-    ```
-        API for scheduling compaction
-          /**
-           * Schedules a new compaction instant
-           * @param extraMetadata
-           * @return Compaction Instant timestamp if a new compaction plan is scheduled
-           */
-           Optional<String> scheduleCompaction(Optional<Map<String, String>> extraMetadata) throws IOException;
-     ```
+
   * `Compactor` : Hudi provides a separate API in Write Client to execute a compaction plan. The compaction
     plan (just like a commit) is identified by a timestamp. Most of the design and implementation complexities for Async
     Compaction is for guaranteeing snapshot isolation to readers and writer when
     multiple concurrent compactors are running. Typical compactor deployment involves launching a separate
     spark application which executes pending compactions when they become available. The core logic of compacting
     file slices in the Compactor is very similar to that of merging updates in a Copy-On-Write table. The only
-    difference being in the case of compaction, there is an additional step of merging the records in delta log-files. 
-    
-    Here are the main API to lookup and execute a compaction plan.
-    ```
-      Main API in HoodieWriteClient for running Compaction:
-       /**
-        * Performs Compaction corresponding to instant-time
-        * @param compactionInstantTime   Compaction Instant Time
-        * @return
-        * @throws IOException
-        */
-        public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws IOException;
-    
-      To lookup all pending compactions, use the API defined in HoodieReadClient
-    
-      /**
-       * Return all pending compactions with instant time for clients to decide what to compact next.
-       * @return
-       */
-      public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
-    ```
+    difference being in the case of compaction, there is an additional step of merging the records in delta log-files.
+
+Here are the main API to lookup and execute a compaction plan.
+
+```
+   Main API in HoodieWriteClient for running Compaction:
+   /**
+    * Performs Compaction corresponding to instant-time
+    * @param compactionInstantTime   Compaction Instant Time
+    * @return
+    * @throws IOException
+    */
+  public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws IOException;
+
+  To lookup all pending compactions, use the API defined in HoodieReadClient
+
+  /**
+   * Return all pending compactions with instant time for clients to decide what to compact next.
+   * @return
+   */
+   public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
+```
+API for scheduling compaction
+
+```
+
+          /**
+           * Schedules a new compaction instant
+           * @param extraMetadata
+           * @return Compaction Instant timestamp if a new compaction plan is scheduled
+           */
+           Optional<String> scheduleCompaction(Optional<Map<String, String>> extraMetadata) throws IOException;
+```
 
 Refer to  __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example of how compaction
 is scheduled and executed.
@@ -172,65 +176,65 @@ plan to be run to figure out the number of file slices being compacted and choos
 
 ## Async Compaction Design Deep-Dive (Optional)
 
-For the purpose of this section, it is important to distinguish between 2 types of commits as pertaining to the file-group: 
+For the purpose of this section, it is important to distinguish between 2 types of commits as pertaining to the file-group:
 
 A commit which generates a merged and read-optimized file-slice is called `snapshot commit` (SC) with respect to that file-group.
-A commit which merely appended the new/updated records assigned to the file-group into a new log block is called `delta commit` (DC) 
+A commit which merely appended the new/updated records assigned to the file-group into a new log block is called `delta commit` (DC)
 with respect to that file-group.
 
 ### Algorithm
 
 The algorithm is described with an illustration. Let us assume a scenario where there are commits SC1, DC2, DC3 that have
-already completed on a data-set. Commit DC4 is currently ongoing with the writer (ingestion) process using it to upsert data. 
-Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set whose latest version (`File-Slice`) 
-contains the base file created by commit SC1 (snapshot-commit in columnar format) and a log file containing row-based 
-log blocks of 2 delta-commits (DC2 and DC3). 
+already completed on a data-set. Commit DC4 is currently ongoing with the writer (ingestion) process using it to upsert data.
+Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set whose latest version (`File-Slice`)
+contains the base file created by commit SC1 (snapshot-commit in columnar format) and a log file containing row-based
+log blocks of 2 delta-commits (DC2 and DC3).
 
 {% include image.html file="async_compac_1.png" alt="async_compac_1.png" max-width="1000" %}
 
- * Writer (Ingestion) that is going to commit "DC4" starts. The record updates in this batch are grouped by file-groups 
-   and appended in row formats to the corresponding log file as delta commit. Let us imagine a subset of file-groups has 
+ * Writer (Ingestion) that is going to commit "DC4" starts. The record updates in this batch are grouped by file-groups
+   and appended in row formats to the corresponding log file as delta commit. Let us imagine a subset of file-groups has
    this new log block (delta commit) DC4 added.
- * Before the writer job completes, it runs the compaction strategy to decide which file-group to compact by compactor 
-   and creates a new compaction-request commit SC5. This commit file is marked as “requested” with metadata denoting 
-   which fileIds to compact (based on selection policy). Writer completes without running compaction (will be run async). 
- 
+ * Before the writer job completes, it runs the compaction strategy to decide which file-group to compact by compactor
+   and creates a new compaction-request commit SC5. This commit file is marked as “requested” with metadata denoting
+   which fileIds to compact (based on selection policy). Writer completes without running compaction (will be run async).
+
    {% include image.html file="async_compac_2.png" alt="async_compac_2.png" max-width="1000" %}
- 
- * Writer job runs again ingesting next batch. It starts with commit DC6. It reads the earliest inflight compaction 
-   request marker commit in timeline order and collects the (fileId, Compaction Commit Id “CcId” ) pairs from meta-data. 
-   Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets allocated for the file-group. 
-   The Writer will simply append records in row-format to the first log-file (as delta-commit) assuming the 
+
+ * Writer job runs again ingesting next batch. It starts with commit DC6. It reads the earliest inflight compaction
+   request marker commit in timeline order and collects the (fileId, Compaction Commit Id “CcId” ) pairs from meta-data.
+   Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets allocated for the file-group.
+   The Writer will simply append records in row-format to the first log-file (as delta-commit) assuming the
    base-file (“Phantom-Base-File”) will be created eventually by the compactor.
-   
+
    {% include image.html file="async_compac_3.png" alt="async_compac_3.png" max-width="1000" %}
- 
- * Compactor runs at some time  and commits at “Tc” (concurrently or before/after Ingestion DC6). It reads the commit-timeline 
-   and finds the first unprocessed compaction request marker commit. Compactor reads the commit’s metadata finding the 
-   file-slices to be compacted. It compacts the file-slice and creates the missing base-file (“Phantom-Base-File”) 
-   with “CCId” as the commit-timestamp. Compactor then marks the compaction commit timestamp as completed. 
-   It is important to realize that at data-set level, there could be different file-groups requesting compaction at 
+
+ * Compactor runs at some time  and commits at “Tc” (concurrently or before/after Ingestion DC6). It reads the commit-timeline
+   and finds the first unprocessed compaction request marker commit. Compactor reads the commit’s metadata finding the
+   file-slices to be compacted. It compacts the file-slice and creates the missing base-file (“Phantom-Base-File”)
+   with “CCId” as the commit-timestamp. Compactor then marks the compaction commit timestamp as completed.
+   It is important to realize that at data-set level, there could be different file-groups requesting compaction at
    different commit timestamps.
- 
+
     {% include image.html file="async_compac_4.png" alt="async_compac_4.png" max-width="1000" %}
 
- * Near Real-time reader interested in getting the latest snapshot will have 2 cases. Let us assume that the 
+ * Near Real-time reader interested in getting the latest snapshot will have 2 cases. Let us assume that the
    incremental ingestion (writer at DC6) happened before the compaction (some time “Tc”’).  
-   The below description is with regards to compaction from file-group perspective. 
-   * `Reader querying at time between ingestion completion time for DC6 and compaction finish “Tc”`: 
-     Hoodie’s implementation will be changed to become aware of file-groups currently waiting for compaction and 
-     merge log-files corresponding to DC2-DC6 with the base-file corresponding to SC1. In essence, Hudi will create 
-     a pseudo file-slice by combining the 2 file-slices starting at base-commits SC1 and SC5 to one. 
-     For file-groups not waiting for compaction, the reader behavior is essentially the same - read latest file-slice 
+   The below description is with regards to compaction from file-group perspective.
+   * `Reader querying at time between ingestion completion time for DC6 and compaction finish “Tc”`:
+     Hoodie’s implementation will be changed to become aware of file-groups currently waiting for compaction and
+     merge log-files corresponding to DC2-DC6 with the base-file corresponding to SC1. In essence, Hudi will create
+     a pseudo file-slice by combining the 2 file-slices starting at base-commits SC1 and SC5 to one.
+     For file-groups not waiting for compaction, the reader behavior is essentially the same - read latest file-slice
      and merge on the fly.
-   * `Reader querying at time after compaction finished (> “Tc”)` : In this case, reader will not find any pending 
-     compactions in the timeline and will simply have the current behavior of reading the latest file-slice and 
+   * `Reader querying at time after compaction finished (> “Tc”)` : In this case, reader will not find any pending
+     compactions in the timeline and will simply have the current behavior of reading the latest file-slice and
      merging on-the-fly.
-     
- * Read-Optimized View readers will query against the latest columnar base-file for each file-groups. 
+
+ * Read-Optimized View readers will query against the latest columnar base-file for each file-groups.
 
 The above algorithm explains Async compaction w.r.t a single compaction run on a single file-group. It is important
-to note that multiple compaction plans can be run concurrently as they are essentially operating on different 
+to note that multiple compaction plans can be run concurrently as they are essentially operating on different
 file-groups.
 
 ## Performance
@@ -272,4 +276,3 @@ with no impact on queries. Following charts compare the Hudi vs non-Hudi dataset
 **Presto**
 
 {% include image.html file="hoodie_query_perf_presto.png" alt="hoodie_query_perf_presto.png" max-width="1000" %}
-
diff --git a/docs/index.md b/docs/index.md
index b5b9da7..ad87933 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,12 +4,9 @@ keywords: homepage
 tags: [getting_started]
 sidebar: mydoc_sidebar
 permalink: index.html
-summary: "Hudi lowers data latency across the board, while simultaneously achieving orders of magnitude of efficiency over traditional batch processing."
+summary: "Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."
 ---
 
-
-
-
 Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores and provides three logical views for query access.
 
  * **Read Optimized View** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
@@ -21,4 +18,4 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical dat
 
 By carefully managing how data is laid out in storage & how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), while at the same time capable of being consumed incrementally from processing/ETL frameworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) t [...]
 
-Hudi broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access.
+Hudi broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access. See [quickstart](quickstart.html) for a demo.
diff --git a/docs/migration_guide.md b/docs/migration_guide.md
index a5d5506..13c27ac 100644
--- a/docs/migration_guide.md
+++ b/docs/migration_guide.md
@@ -4,9 +4,8 @@ keywords: migration guide
 sidebar: mydoc_sidebar
 permalink: migration_guide.html
 toc: false
-summary: In this page, we will discuss some available tools for migrating your existing dataset into a Hudi managed 
-dataset
-
+summary: In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset
+---
 
 Hudi maintains metadata such as commit timeline and indexes to manage a dataset. The commit timelines helps to understand the actions happening on a dataset as well as the current state of a dataset. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats.
 To be able to start using Hudi for your existing dataset, you will need to migrate your existing dataset into a Hudi managed dataset. There are a couple of ways to achieve this.
@@ -15,57 +14,60 @@ To be able to start using Hudi for your existing dataset, you will need to migra
 ## Approaches
 
 
-### Approach 1
+#### Use Hudi for new partitions alone
 
-Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the 
-dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete 
-Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive 
-partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing 
+Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the
+dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive
+partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing
 to a new partition or convert your last N partitions into Hudi instead of the entire table. Note, since the historical
- partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset. 
+ partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.
 Take this approach if your dataset is an append only type of dataset and you do not expect to perform any updates to existing (or non Hudi managed) partitions.
 
 
-### Approach 2
+#### Convert existing dataset to Hudi
 
 Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
- of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently 
- make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, 
+ of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset,
  there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
- . You can define the desired file size when converting this dataset and Hudi will ensure it writes out files 
- adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into 
+ . You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
+ adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
  small files rather than writing new small ones thus maintaining the health of your cluster.
 
 There are a few options when choosing this approach.
+
 #### Option 1
-Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in 
-parquet file 
-format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. 
-#### Option 2 
+Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in
+parquet file
+format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.
+
+#### Option 2
 For huge datasets, this could be as simple as : for partition in [list of partitions in source dataset] {
         val inputDF = spark.read.format("any_input_format").load("partition_path")
         inputDF.write.format("com.uber.hoodie").option()....save("basePath")
         }      
+
 #### Option 3
 Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
- [here](quickstart.md).
+ [here](quickstart.html).
 
 ```
-Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install -DskipTests`, the shell can be 
+Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
 
-hoodie->hdfsparquetimport 
-        --upsert false 
-        --srcPath /user/parquet/dataset/basepath 
-        --targetPath 
-        /user/hoodie/dataset/basepath 
-        --tableName hoodie_table 
-        --tableType COPY_ON_WRITE 
-        --rowKeyField _row_key 
-        --partitionPathField partitionStr 
-        --parallelism 1500 
-        --schemaFilePath /user/table/schema 
-        --format parquet 
-        --sparkMemory 6g 
+hoodie->hdfsparquetimport
+        --upsert false
+        --srcPath /user/parquet/dataset/basepath
+        --targetPath
+        /user/hoodie/dataset/basepath
+        --tableName hoodie_table
+        --tableType COPY_ON_WRITE
+        --rowKeyField _row_key
+        --partitionPathField partitionStr
+        --parallelism 1500
+        --schemaFilePath /user/table/schema
+        --format parquet
+        --sparkMemory 6g
         --retry 2
-```
\ No newline at end of file
+```
diff --git a/docs/quickstart.md b/docs/quickstart.md
index f1516ae..1e6fa49 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -13,13 +13,14 @@ permalink: quickstart.html
 Check out code and pull it into Intellij as a normal maven project.
 
 Normally build the maven project, from command line
+
 ```
 $ mvn clean install -DskipTests -DskipITs
+```
 
 To work with older version of Hive (pre Hive-1.2.1), use
-
+```
 $ mvn clean install -DskipTests -DskipITs -Dhive11
-
 ```
 
 {% include callout.html content="You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
@@ -31,13 +32,13 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
 
 Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
 
-| Hadoop | Hive  | Spark | Instructions to Build Hudi | 
+| Hadoop | Hive  | Spark | Instructions to Build Hudi |
 | ---- | ----- | ---- | ---- |
 | 2.6.0-cdh5.7.2 | 1.1.0-cdh5.7.2 | spark-2.[1-3].x | Use “mvn clean install -DskipTests -Dhadoop.version=2.6.0-cdh5.7.2 -Dhive.version=1.1.0-cdh5.7.2” |
 | Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 | Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 
-If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues. We are limited by our bandwidth to certify other combinations. 
+If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues. We are limited by our bandwidth to certify other combinations.
 It would be of great help if you can reach out to us with your setup and experience with hoodie.
 
 ## Generate a Hudi Dataset
@@ -60,7 +61,7 @@ export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$P
 
 ### Supported API's
 
-Use the DataSource API to quickly start reading or writing Hudi datasets in few lines of code. Ideal for most 
+Use the DataSource API to quickly start reading or writing Hudi datasets in few lines of code. Ideal for most
 ingestion use-cases.
 Use the RDD API to perform more involved actions on a Hudi dataset
 
@@ -132,11 +133,11 @@ This can be run as frequently as the ingestion pipeline to make sure new partiti
 cd hoodie-hive
 ./run_sync_tool.sh
   --user hive
-  --pass hive 
-  --database default 
-  --jdbc-url "jdbc:hive2://localhost:10010/" 
-  --base-path tmp/hoodie/sample-table/ 
-  --table hoodie_test 
+  --pass hive
+  --database default
+  --jdbc-url "jdbc:hive2://localhost:10010/"
+  --base-path tmp/hoodie/sample-table/
+  --table hoodie_test
   --partitioned-by field1,field2
 
 ```
@@ -304,7 +305,7 @@ hive>
 ## A Demo using docker containers
 
 Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your computer. 
+data infrastructure is brought up in a local docker cluster within your computer.
 
 The steps assume you are using Mac laptop
 
@@ -313,7 +314,7 @@ The steps assume you are using Mac laptop
   * Docker Setup :  For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
   * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat
   * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
-  
+
   ```
    127.0.0.1 adhoc-1
    127.0.0.1 adhoc-2
@@ -378,15 +379,15 @@ At this point, the docker cluster will be up and running. The demo cluster bring
    * HDFS Services (NameNode, DataNode)
    * Spark Master and Worker
    * Hive Services (Metastore, HiveServer2 along with PostgresDB)
-   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source for the demo) 
+   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source for the demo)
    * Adhoc containers to run Hudi/Hive CLI commands
 
 ### Demo
 
-Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction. 
+Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction.
 
-Take a look at the directory `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity. 
-The first batch contains stocker tracker data for some stock symbols during the first hour of trading window 
+Take a look at the directory `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity.
+The first batch contains stocker tracker data for some stock symbols during the first hour of trading window
 (9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 mins (10:30 - 11 a.m). Hudi will
 be used to ingest these batches to a dataset which will contain the latest stock tracker data at hour level granularity.
 The batches are windowed intentionally so that the second batch contains updates to some of the rows in the first batch.
@@ -396,7 +397,7 @@ The batches are windowed intentionally so that the second batch contains updates
 Upload the first batch to Kafka topic 'stock ticks'
 
 ```
-cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P 
+cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
 
 To check if the new topic shows up, use
 kafkacat -b kafkabroker -L -J | jq .
@@ -443,7 +444,7 @@ kafkacat -b kafkabroker -L -J | jq .
 
 Hudi comes with a tool named DeltaStreamer. This tool can connect to variety of data sources (including Kafka) to
 pull changes and apply to Hudi dataset using upsert/insert primitives. Here, we will use the tool to download
-json data from kafka topic and ingest to both COW and MOR tables we initialized in the previous step. This tool 
+json data from kafka topic and ingest to both COW and MOR tables we initialized in the previous step. This tool
 automatically initializes the datasets in the file-system if they do not exist yet.
 
 ```
@@ -468,8 +469,8 @@ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 exit
 ```
 
-You can use HDFS web-browser to look at the datasets 
-`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`. 
+You can use HDFS web-browser to look at the datasets
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
 
 You can explore the new partition folder created in the dataset along with a "deltacommit"
 file under .hoodie which signals a successful commit.
@@ -501,7 +502,7 @@ docker exec -it adhoc-2 /bin/bash
 ....
 exit
 ```
-After executing the above command, you will notice 
+After executing the above command, you will notice
 
 1. A hive table named `stock_ticks_cow` created which provides Read-Optimized view for the Copy On Write dataset.
 2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the Merge On Read dataset. The former
@@ -511,7 +512,7 @@ provides the ReadOptimized view for the Hudi dataset and the later provides the
 #### Step 4 (a): Run Hive Queries
 
 Run a hive query to find the latest timestamp ingested for stock symbol 'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the same value "10:29 a.m" as Hudi create a 
+(for both COW and MOR dataset)and realtime views (for MOR dataset)give the same value "10:29 a.m" as Hudi create a
 parquet file for the first batch of data.
 
 ```
@@ -565,7 +566,7 @@ Now, run a projection query:
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both 
+Lets run similar queries against M-O-R dataset. Lets look at both
 ReadOptimized and Realtime views supported by M-O-R dataset
 
 # Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -670,7 +671,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both 
+Lets run similar queries against M-O-R dataset. Lets look at both
 ReadOptimized and Realtime views supported by M-O-R dataset
 
 # Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -718,7 +719,7 @@ Upload the second batch of data and ingest this batch using delta-streamer. As t
 partitions, there is no need to run hive-sync
 
 ```
-cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P 
+cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
 
 # Within Docker container, run the ingestion command
 docker exec -it adhoc-2 /bin/bash
@@ -734,15 +735,15 @@ exit
 With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a new version of Parquet file getting created.
 See `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
 
-With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file. 
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
 Take a look at the HDFS filesystem to get an idea: `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
 
 #### Step 6(a): Run Hive Queries
 
-With Copy-On-Write table, the read-optimized view immediately sees the changes as part of second batch once the batch 
-got committed as each ingestion creates newer versions of parquet files. 
+With Copy-On-Write table, the read-optimized view immediately sees the changes as part of second batch once the batch
+got committed as each ingestion creates newer versions of parquet files.
 
-With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file. 
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
 This is the time, when ReadOptimized and Realtime views will provide different results. ReadOptimized view will still
 return "10:29 am" as it will only read from the Parquet file. Realtime View will do on-the-fly merge and return
 latest committed data which is "10:59 a.m".
@@ -773,7 +774,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the futu
 As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
 
 
-# Merge On Read Table: 
+# Merge On Read Table:
 
 # Read Optimized View
 0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
@@ -843,7 +844,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close
 As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
 
 
-# Merge On Read Table: 
+# Merge On Read Table:
 
 # Read Optimized View
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
@@ -909,8 +910,8 @@ To show the effects of incremental-query, let us assume that a reader has alread
 ingesting first batch. Now, for the reader to see effect of the second batch, he/she has to keep the start timestamp to
 the commit time of the first batch (20180924064621) and run incremental query
 
-`Hudi incremental mode` provides efficient scanning for incremental queries by filtering out files that do not have any 
-candidate rows using hudi-managed metadata. 
+`Hudi incremental mode` provides efficient scanning for incremental queries by filtering out files that do not have any
+candidate rows using hudi-managed metadata.
 
 ```
 docker exec -it adhoc-2 /bin/bash
@@ -1008,7 +1009,7 @@ hoodie:stock_ticks_mor->compactions show all
     ___________________________________________________________________
     | Compaction Instant Time| State    | Total FileIds to be Compacted|
     |==================================================================|
-    
+
 # Schedule a compaction. This will use Spark Launcher to schedule compaction
 hoodie:stock_ticks_mor->compaction schedule
 ....
@@ -1028,7 +1029,7 @@ hoodie:stock_ticks_mor->compactions show all
     ___________________________________________________________________
     | Compaction Instant Time| State    | Total FileIds to be Compacted|
     |==================================================================|
-    | 20180924070031         | REQUESTED| 1                            | 
+    | 20180924070031         | REQUESTED| 1                            |
 
 # Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
 hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 --parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc --retry 1  
@@ -1052,7 +1053,7 @@ hoodie:stock_ticks->compactions show all
     |==================================================================|
     | 20180924070031         | COMPLETED| 1                            |
 
-``` 
+```
 
 #### Step 9: Run Hive Queries including incremental queries
 
@@ -1169,9 +1170,9 @@ You can bring up a hadoop docker environment containing Hadoop, Hive and Spark s
 ```
 $ mvn pre-integration-test -DskipTests
 ```
-The above command builds docker images for all the services with 
-current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We 
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker images. 
+The above command builds docker images for all the services with
+current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker images.
 
 To bring down the containers
 ```
@@ -1185,9 +1186,9 @@ $ cd hoodie-integ-test
 $  mvn docker-compose:up -DdetachedMode=true
 ```
 
-Hudi is a library that is operated in a broader data analytics/ingestion environment 
+Hudi is a library that is operated in a broader data analytics/ingestion environment
 involving Hadoop, Hive and Spark. Interoperability with all these systems is a key objective for us. We are
-actively adding integration-tests under __hoodie-integ-test/src/test/java__ that makes use of this 
+actively adding integration-tests under __hoodie-integ-test/src/test/java__ that makes use of this
 docker environment (See __hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java__ )
 
 
@@ -1202,10 +1203,10 @@ and compose scripts are carefully implemented so that they serve dual-purpose
    inbuilt jars by mounting local HUDI workspace over the docker location
 
 This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth, they can still build local images 
-run the script 
+But if users want to test hudi from locations with lower network bandwidth, they can still build local images
+run the script
 `docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
- 
+
 Here are the commands:
 
 ```
diff --git a/docs/roadmap.md b/docs/roadmap.md
deleted file mode 100644
index c65c3a9..0000000
--- a/docs/roadmap.md
+++ /dev/null
@@ -1,14 +0,0 @@
----
-title: Roadmap
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: roadmap.html
----
-
-## Planned Features
-
-* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
-* Hudi Spark Datasource -  Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
-* Hudi Presto Connector - Allows for querying data managed by Hudi using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html)
-
-
diff --git a/docs/sql_queries.md b/docs/sql_queries.md
index 955e794..44848eb 100644
--- a/docs/sql_queries.md
+++ b/docs/sql_queries.md
@@ -62,7 +62,4 @@ spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.clas
 
 ## Presto
 
-Presto requires a [patch](https://github.com/prestodb/presto/pull/7002) (until the PR is merged) and the hoodie-hadoop-mr-bundle jar to be placed
-into `<presto_install>/plugin/hive-hadoop2/`.
-
-{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/81)" type="info" %}
+Presto requires the `hoodie-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.