You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/03/13 22:41:19 UTC
[incubator-hudi-site] 16/19: Bunch of cleanups based on recent issues/feedback

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git

commit 5507c617b96f6839ba37e20586391bd1e1124a31
Author: Vinoth Chandar <vi...@uber.com>
AuthorDate: Fri Mar 8 18:12:55 2019 -0800

    Bunch of cleanups based on recent issues/feedback
    
     - Deltastreamer/quickstart mistake
     - Clarify docker/windows support
     - Remove confusing references to HDFS with DFS
     - Remove all lingering use of "hoodie"
     - Links to picking up newbie tickets
     - Instructions on updating docs with code changes
---
 docs/_posts/2016-12-30-strata-talk-2017.md         |   2 +-
 docs/admin_guide.md                                |  44 ++++++++++-----------
 docs/community.md                                  |   2 +-
 docs/comparison.md                                 |   8 ++--
 docs/concepts.md                                   |   2 +-
 docs/configurations.md                             |  30 +++++++-------
 docs/contributing.md                               |   2 +
 docs/gcs_filesystem.md                             |   2 +-
 ...ommit_duration.png => hudi_commit_duration.png} | Bin
 .../{hoodie_intro_1.png => hudi_intro_1.png}       | Bin
 ...ie_log_format_v2.png => hudi_log_format_v2.png} | Bin
 ...uery_perf_hive.png => hudi_query_perf_hive.png} | Bin
 ..._perf_presto.png => hudi_query_perf_presto.png} | Bin
 ...ry_perf_spark.png => hudi_query_perf_spark.png} | Bin
 .../{hoodie_upsert_dag.png => hudi_upsert_dag.png} | Bin
 ...odie_upsert_perf1.png => hudi_upsert_perf1.png} | Bin
 ...odie_upsert_perf2.png => hudi_upsert_perf2.png} | Bin
 docs/implementation.md                             |  16 ++++----
 docs/incremental_processing.md                     |  22 +++++------
 docs/index.md                                      |   4 +-
 docs/migration_guide.md                            |   9 ++---
 docs/quickstart.md                                 |  20 +++++-----
 docs/s3_filesystem.md                              |   2 +-
 docs/sql_queries.md                                |   2 -
 docs/use_cases.md                                  |  12 +++---
 25 files changed, 89 insertions(+), 90 deletions(-)

diff --git a/docs/_posts/2016-12-30-strata-talk-2017.md b/docs/_posts/2016-12-30-strata-talk-2017.md
index 88b923e..39cfb65 100644
--- a/docs/_posts/2016-12-30-strata-talk-2017.md
+++ b/docs/_posts/2016-12-30-strata-talk-2017.md
@@ -5,7 +5,7 @@ permalink: strata-talk.html
 tags: [news]
 ---
 
-We will be presenting Hoodie & general concepts around how incremental processing works at Uber.
+We will be presenting Hudi & general concepts around how incremental processing works at Uber.
 Catch our talk **"Incremental Processing on Hadoop At Uber"**
 
 {% include links.html %}
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index 7757d04..58dbd15 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -17,11 +17,11 @@ This section provides a glimpse into each of these, with some general guidance o
 
 ## Admin CLI
 
-Once hoodie has been built via `mvn clean install -DskipTests`, the shell can be fired by via  `cd hoodie-cli && ./hoodie-cli.sh`.
-A hoodie dataset resides on HDFS, in a location referred to as the **basePath** and we would need this location in order to connect to a Hoodie dataset.
-Hoodie library effectively manages this HDFS dataset internally, using .hoodie subfolder to track all metadata
+Once hudi has been built, the shell can be fired by via  `cd hoodie-cli && ./hoodie-cli.sh`.
+A hudi dataset resides on DFS, in a location referred to as the **basePath** and we would need this location in order to connect to a Hudi dataset.
+Hudi library effectively manages this dataset internally, using .hoodie subfolder to track all metadata
 
-To initialize a hoodie table, use the following command.
+To initialize a hudi table, use the following command.
 
 ```
 18/09/06 15:56:52 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
@@ -42,7 +42,7 @@ hoodie->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --t
 18/09/06 15:57:15 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from ...
 ```
 
-To see the description of hoodie table, use the command:
+To see the description of hudi table, use the command:
 
 ```
 
@@ -60,7 +60,7 @@ hoodie:hoodie_table_1->desc
 
 ```
 
-Following is a sample command to connect to a Hoodie dataset contains uber trips.
+Following is a sample command to connect to a Hudi dataset contains uber trips.
 
 ```
 hoodie:trips->connect --path /app/uber/trips
@@ -111,7 +111,7 @@ hoodie:trips->
 
 #### Inspecting Commits
 
-The task of upserting or inserting a batch of incoming records is known as a **commit** in Hoodie. A commit provides basic atomicity guarantees such that only commited data is available for querying.
+The task of upserting or inserting a batch of incoming records is known as a **commit** in Hudi. A commit provides basic atomicity guarantees such that only commited data is available for querying.
 Each commit has a monotonically increasing string/number called the **commit number**. Typically, this is the time at which we started the commit.
 
 To view some basic information about the last 10 commits,
@@ -129,7 +129,7 @@ hoodie:trips->
 
 ```
 
-At the start of each write, Hoodie also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight
+At the start of each write, Hudi also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight
 
 
 ```
@@ -193,7 +193,7 @@ order (See Concepts). The below commands allow users to view the file-slices for
 
 #### Statistics
 
-Since Hoodie directly manages file sizes for HDFS dataset, it might be good to get an overall picture
+Since Hudi directly manages file sizes for DFS dataset, it might be good to get an overall picture
 
 
 ```
@@ -206,7 +206,7 @@ hoodie:trips->stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc
     ....
 ```
 
-In case of Hoodie write taking much longer, it might be good to see the write amplification for any sudden increases
+In case of Hudi write taking much longer, it might be good to see the write amplification for any sudden increases
 
 
 ```
@@ -221,7 +221,7 @@ hoodie:trips->stats wa
 
 #### Archived Commits
 
-In order to limit the amount of growth of .commit files on HDFS, Hoodie archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
+In order to limit the amount of growth of .commit files on DFS, Hudi archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
 This is a sequence file that contains a mapping from commitNumber => json with raw information about the commit (same that is nicely rolled up above).
 
 
@@ -369,7 +369,7 @@ No File renames needed to unschedule pending compaction. Operation successful.
 
 ##### Repair Compaction
 
-The above compaction unscheduling operations could sometimes fail partially (e:g -> HDFS temporarily unavailable). With
+The above compaction unscheduling operations could sometimes fail partially (e:g -> DFS temporarily unavailable). With
 partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
 `compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
@@ -387,7 +387,7 @@ Compaction successfully repaired
 
 ## Metrics
 
-Once the Hoodie Client is configured with the right datasetname and environment for metrics, it produces the following graphite metrics, that aid in debugging hoodie datasets
+Once the Hudi Client is configured with the right datasetname and environment for metrics, it produces the following graphite metrics, that aid in debugging hudi datasets
 
  - **Commit Duration** - This is amount of time it took to successfully commit a batch of records
  - **Rollback Duration** - Similarly, amount of time taken to undo partial data left over by a failed commit (happens everytime automatically after a failing write)
@@ -397,29 +397,29 @@ Once the Hoodie Client is configured with the right datasetname and environment
 
 These metrics can then be plotted on a standard tool like grafana. Below is a sample commit duration chart.
 
-{% include image.html file="hoodie_commit_duration.png" alt="hoodie_commit_duration.png" max-width="1000" %}
+{% include image.html file="hudi_commit_duration.png" alt="hudi_commit_duration.png" max-width="1000" %}
 
 
 ## Troubleshooting Failures
 
-Section below generally aids in debugging Hoodie failures. Off the bat, the following metadata is added to every record to help triage  issues easily using standard Hadoop SQL engines (Hive/Presto/Spark)
+Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage  issues easily using standard Hadoop SQL engines (Hive/Presto/Spark)
 
- - **_hoodie_record_key** - Treated as a primary key within each HDFS partition, basis of all updates/inserts
+ - **_hoodie_record_key** - Treated as a primary key within each DFS partition, basis of all updates/inserts
  - **_hoodie_commit_time** - Last commit that touched this record
  - **_hoodie_file_name** - Actual file name containing the record (super useful to triage duplicates)
  - **_hoodie_partition_path** - Path from basePath that identifies the partition containing this record
 
-{% include callout.html content="Note that as of now, Hoodie assumes the application passes in the same deterministic partitionpath for a given recordKey. i.e the uniqueness of record key is only enforced within each partition" type="warning" %}
+{% include callout.html content="Note that as of now, Hudi assumes the application passes in the same deterministic partitionpath for a given recordKey. i.e the uniqueness of record key is only enforced within each partition" type="warning" %}
 
 
 #### Missing records
 
 Please check if there were any write errors using the admin commands above, during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hoodie, but handed back to the application to decide what to do with it.
+If you do find errors, then the record was not actually written by Hudi, but handed back to the application to decide what to do with it.
 
 #### Duplicates
 
-First of all, please confirm if you do indeed have duplicates **AFTER** ensuring the query is accessing the Hoodie datasets [properly](sql_queries.html) .
+First of all, please confirm if you do indeed have duplicates **AFTER** ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
 
  - If confirmed, please use the metadata fields above, to identify the physical files & partition files containing the records .
  - If duplicates span files across partitionpath, then this means your application is generating different partitionPaths for same recordKey, Please fix your app
@@ -427,11 +427,11 @@ First of all, please confirm if you do indeed have duplicates **AFTER** ensuring
 
 #### Spark failures
 
-Typical upsert() DAG looks like below. Note that Hoodie client also caches intermediate RDDs to intelligently profile workload and size files and spark parallelism.
+Typical upsert() DAG looks like below. Note that Hudi client also caches intermediate RDDs to intelligently profile workload and size files and spark parallelism.
 Also Spark UI shows sortByKey twice due to the probe job also being shown, nonetheless its just a single sort.
 
 
-{% include image.html file="hoodie_upsert_dag.png" alt="hoodie_upsert_dag.png" max-width="1000" %}
+{% include image.html file="hudi_upsert_dag.png" alt="hudi_upsert_dag.png" max-width="1000" %}
 
 
 At a high level, there are two steps
@@ -448,5 +448,5 @@ At a high level, there are two steps
  - Job 6 : Lazy join of incoming records against recordKey, location to provide a final set of HoodieRecord which now contain the information about which file/partitionpath they are found at (or null if insert). Then also profile the workload again to determine sizing of files
  - Job 7 : Actual writing of data (update + insert + insert turned to updates to maintain file size)
 
-Depending on the exception source (Hoodie/Spark), the above knowledge of the DAG can be used to pinpoint the actual issue. The most often encountered failures result from YARN/HDFS temporary failures.
+Depending on the exception source (Hudi/Spark), the above knowledge of the DAG can be used to pinpoint the actual issue. The most often encountered failures result from YARN/DFS temporary failures.
 In the future, a more sophisticated debug/management UI would be added to the project, that can help automate some of this debugging.
diff --git a/docs/community.md b/docs/community.md
index 5708f3d..dd6d8ea 100644
--- a/docs/community.md
+++ b/docs/community.md
@@ -33,7 +33,7 @@ Here are few ways, you can get involved.
  - Author blogs on our wiki
  - Testing; Improving out-of-box experience by reporting bugs
  - Share new ideas/directions to pursue or propose a new HIP
- - Contributing code to the project
+ - Contributing code to the project ([newbie JIRAs](https://issues.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+component+%3D+newbie))
 
 #### Code Contributions
 
diff --git a/docs/comparison.md b/docs/comparison.md
index 0862c26..a606c94 100644
--- a/docs/comparison.md
+++ b/docs/comparison.md
@@ -6,7 +6,7 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of HDFS, and thus mostly co-exists nicely with these technologies. However,
+Apache Hudi fills a big void for processing data on top of DFS, and thus mostly co-exists nicely with these technologies. However,
 it would be useful to understand how Hudi fits into the current big data ecosystem, contrasting it with a few related systems
 and bring out the different tradeoffs these systems have accepted in their design.
 
@@ -47,12 +47,12 @@ just for analytics. Finally, HBase does not support incremental processing primi
 A popular question, we get is : "How does Hudi relate to stream processing systems?", which we will try to answer here. Simply put, Hudi can integrate with
 batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
 integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
-and later sent into a Hudi table via a Kafka topic/HDFS intermediate file. In more conceptual level, data processing
+and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing
 pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline.
-Hudi can act as either a source or sink, that stores data on HDFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability
+Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability
 of Presto/SparkSQL/Hive for your queries.
 
 More advanced use cases revolve around the concepts of [incremental processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), which effectively
 uses Hudi even inside the `processing` engine to speed up typical batch pipelines. For e.g: Hudi can be used as a state store inside a processing DAG (similar
 to how [rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend) is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam Runner](https://github.com/uber/hoodie/issues/8)
+and will eventually happen as a [Beam Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/docs/concepts.md b/docs/concepts.md
index 7532631..a2f4322 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -7,7 +7,7 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over datasets on HDFS
+Apache Hudi (pronounced “Hudi”) provides the following primitives over datasets on DFS
 
  * Upsert                     (how do I change the dataset?)
  * Incremental consumption    (how do I fetch data that changed?)
diff --git a/docs/configurations.md b/docs/configurations.md
index 3a9b88e..dd2fa8a 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -22,9 +22,9 @@ Immaterial of whether RDD/WriteClient APIs or Datasource is used, the following
 to cloud stores.
 
  * [AWS S3](s3_hoodie.html) <br/>
-   Configurations required for S3 and Hoodie co-operability.
+   Configurations required for S3 and Hudi co-operability.
  * [Google Cloud Storage](gcs_hoodie.html) <br/>
-   Configurations required for GCS and Hoodie co-operability.
+   Configurations required for GCS and Hudi co-operability.
 
 ### Spark Datasource Configs {#spark-datasource}
 
@@ -155,10 +155,10 @@ Following subsections go over different aspects of write configs, explaining mos
 
 - [withPath](#withPath) (hoodie_base_path) 
 Property: `hoodie.base.path` [Required] <br/>
-<span style="color:grey">Base HDFS path under which all the data partitions are created. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span>
+<span style="color:grey">Base DFS path under which all the data partitions are created. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span>
 - [withSchema](#withSchema) (schema_str) <br/> 
 Property: `hoodie.avro.schema` [Required]<br/>
-<span style="color:grey">This is the current reader avro schema for the Hoodie Dataset. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span>
+<span style="color:grey">This is the current reader avro schema for the dataset. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span>
 - [forTable](#forTable) (table_name)<br/> 
 Property: `hoodie.table.name` [Required] <br/>
  <span style="color:grey">Table name for the dataset, will be used for registering with Hive. Needs to be same across runs.</span>
@@ -170,7 +170,7 @@ Property: `hoodie.insert.shuffle.parallelism`, `hoodie.upsert.shuffle.parallelis
 <span style="color:grey">Once data has been initially imported, this parallelism controls initial parallelism for reading input records. Ensure this value is high enough say: 1 partition for 1 GB of input data</span>
 - [combineInput](#combineInput) (on_insert = false, on_update=true)<br/> 
 Property: `hoodie.combine.before.insert`, `hoodie.combine.before.upsert`<br/>
-<span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in HDFS</span>
+<span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS</span>
 - [withWriteStatusStorageLevel](#withWriteStatusStorageLevel) (level = MEMORY_AND_DISK_SER)<br/> 
 Property: `hoodie.write.status.storage.level`<br/>
 <span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because the Client can choose to inspect the WriteStatus and choose and commit or not based on the failures. This is a configuration for the storage level for this RDD </span>
@@ -215,7 +215,7 @@ Following configs control indexing behavior, which tags incoming records as eith
     <span style="color:grey">Only application if index type is HBASE. HBase ZK Quorum port to connect to.</span>
     - [hbaseTableName](#hbaseTableName) (tableName) [Required]<br/>
     Property: `hoodie.index.hbase.table` <br/>
-    <span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hoodie stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span>
+    <span style="color:grey">Only application if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span>
 
 #### Storage configs
 Controls aspects around sizing parquet and log files.
@@ -223,7 +223,7 @@ Controls aspects around sizing parquet and log files.
 - [withStorageConfig](#withStorageConfig) (HoodieStorageConfig) <br/>
     - [limitFileSize](#limitFileSize) (size = 120MB) <br/>
     Property: `hoodie.parquet.max.file.size` <br/>
-    <span style="color:grey">Target size for parquet files produced by Hudi write phases. For HDFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span>
+    <span style="color:grey">Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span>
     - [parquetBlockSize](#parquetBlockSize) (rowgroupsize = 120MB) <br/>
     Property: `hoodie.parquet.block.size` <br/>
     <span style="color:grey">Parquet RowGroup size. Its better this is same as the file size, so that a single column within a file is stored continuously on disk</span>
@@ -249,13 +249,13 @@ Configs that control compaction (merging of log files onto a new parquet base fi
 - [withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
     - [withCleanerPolicy](#withCleanerPolicy) (policy = KEEP_LATEST_COMMITS) <br/>
     Property: `hoodie.cleaner.policy` <br/>
-    <span style="color:grey">Hoodie Cleaning policy. Hoodie will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span>
+    <span style="color:grey"> Cleaning policy to be used. Hudi will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span>
     - [retainCommits](#retainCommits) (no_of_commits_to_retain = 24) <br/>
     Property: `hoodie.cleaner.commits.retained` <br/>
     <span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this dataset</span>
     - [archiveCommitsWith](#archiveCommitsWith) (minCommits = 96, maxCommits = 128) <br/>
     Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
-    <span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since HDFS is not designed to handle multiple small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span>
+    <span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span>
     - [compactionSmallFileSize](#compactionSmallFileSize) (size = 0) <br/>
     Property: `hoodie.parquet.small.file.limit` <br/>
     <span style="color:grey">This should be less < maxFileSize and setting it to 0, turns off this feature. Small files can always happen because of the number of insert records in a partition in a batch. Hudi has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a "small file size".</span>
@@ -264,10 +264,10 @@ Configs that control compaction (merging of log files onto a new parquet base fi
     <span style="color:grey">Insert Write Parallelism. Number of inserts grouped for a single partition. Writing out 100MB files, with atleast 1kb records, means 100K records per file. Default is to overprovision to 500K. To improve insert latency, tune this to match the number of records in a single file. Setting this to a low number, will result in small files (particularly when compactionSmallFileSize is 0)</span>
     - [autoTuneInsertSplits](#autoTuneInsertSplits) (true) <br/>
     Property: `hoodie.copyonwrite.insert.auto.split` <br/>
-    <span style="color:grey">Should hoodie dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
+    <span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
     - [approxRecordSize](#approxRecordSize) () <br/>
     Property: `hoodie.copyonwrite.record.size.estimate` <br/>
-    <span style="color:grey">The average record size. If specified, hoodie will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
+    <span style="color:grey">The average record size. If specified, hudi will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
     - [withInlineCompaction](#withInlineCompaction) (inlineCompaction = false) <br/>
     Property: `hoodie.compact.inline` <br/>
     <span style="color:grey">When set to true, compaction is triggered by the ingestion itself, right after a commit/deltacommit action as part of insert/upsert/bulk_insert</span>
@@ -301,7 +301,7 @@ Configs that control compaction (merging of log files onto a new parquet base fi
 Enables reporting of Hudi metrics to graphite.
 
 - [withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
-<span style="color:grey">Hoodie publishes metrics on every commit, clean, rollback etc.</span>
+<span style="color:grey">Hudi publishes metrics on every commit, clean, rollback etc.</span>
     - [on](#on) (metricsOn = true) <br/>
     Property: `hoodie.metrics.on` <br/>
     <span style="color:grey">Turn sending metrics on/off. on by default.</span>
@@ -335,11 +335,11 @@ Controls memory usage for compaction and merges, performed internally by Hudi
 
 Writing data via Hudi happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.
 
-**Input Parallelism** : By default, Hoodie tends to over-partition input (i.e `withParallelism(1500)`), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism `hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast input_data_size/500MB
+**Input Parallelism** : By default, Hudi tends to over-partition input (i.e `withParallelism(1500)`), to ensure each Spark partition stays within the 2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger inputs. We recommend having shuffle parallelism `hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast input_data_size/500MB
 
-**Off-heap memory** : Hoodie writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like `spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if you are running into such failures.
+**Off-heap memory** : Hudi writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like `spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if you are running into such failures.
 
-**Spark Memory** : Typically, hoodie needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.storage.memoryFraction` will generally help boost performance.
+**Spark Memory** : Typically, hudi needs to be able to read a single file into memory to perform merges or compactions and thus the executor memory should be sufficient to accomodate this. In addition, Hoodie caches the input to be able to intelligently place data and thus leaving some `spark.storage.memoryFraction` will generally help boost performance.
 
 **Sizing files** : Set `limitFileSize` above judiciously, to balance ingest/write latency vs number of files & consequently metadata overhead associated with it.
 
diff --git a/docs/contributing.md b/docs/contributing.md
index 028ab00..cf449f3 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -42,6 +42,8 @@ Here's a typical lifecycle of events to contribute to Hudi.
    - Add adequate tests for your new functionality
    - [Optional] For involved changes, its best to also run the entire integration test suite using `mvn clean install`
    - For website changes, please build the site locally & test navigation, formatting & links thoroughly
+   - If your code change changes some aspect of documentation (e.g new config, default value change), 
+     please ensure there is a another PR to [update the docs](https://github.com/apache/incubator-hudi/blob/asf-site/docs/README.md) as well.
  - Format commit messages and the pull request title like `[HUDI-XXX] Fixes bug in Spark Datasource`,
    where you replace HUDI-XXX with the appropriate JIRA issue.
  - Push your commit to your own fork/branch & create a pull request (PR) against the Hudi repo.
diff --git a/docs/gcs_filesystem.md b/docs/gcs_filesystem.md
index 26c4401..3919fdf 100644
--- a/docs/gcs_filesystem.md
+++ b/docs/gcs_filesystem.md
@@ -6,7 +6,7 @@ permalink: gcs_hoodie.html
 toc: false
 summary: In this page, we go over how to configure hudi with Google Cloud Storage.
 ---
-For Hudi storage on GCS, **regional** buckets provide an HDFS API with strong consistency.
+For Hudi storage on GCS, **regional** buckets provide an DFS API with strong consistency.
 
 ## GCS Configs
 
diff --git a/docs/images/hoodie_commit_duration.png b/docs/images/hudi_commit_duration.png
similarity index 100%
rename from docs/images/hoodie_commit_duration.png
rename to docs/images/hudi_commit_duration.png
diff --git a/docs/images/hoodie_intro_1.png b/docs/images/hudi_intro_1.png
similarity index 100%
rename from docs/images/hoodie_intro_1.png
rename to docs/images/hudi_intro_1.png
diff --git a/docs/images/hoodie_log_format_v2.png b/docs/images/hudi_log_format_v2.png
similarity index 100%
rename from docs/images/hoodie_log_format_v2.png
rename to docs/images/hudi_log_format_v2.png
diff --git a/docs/images/hoodie_query_perf_hive.png b/docs/images/hudi_query_perf_hive.png
similarity index 100%
rename from docs/images/hoodie_query_perf_hive.png
rename to docs/images/hudi_query_perf_hive.png
diff --git a/docs/images/hoodie_query_perf_presto.png b/docs/images/hudi_query_perf_presto.png
similarity index 100%
rename from docs/images/hoodie_query_perf_presto.png
rename to docs/images/hudi_query_perf_presto.png
diff --git a/docs/images/hoodie_query_perf_spark.png b/docs/images/hudi_query_perf_spark.png
similarity index 100%
rename from docs/images/hoodie_query_perf_spark.png
rename to docs/images/hudi_query_perf_spark.png
diff --git a/docs/images/hoodie_upsert_dag.png b/docs/images/hudi_upsert_dag.png
similarity index 100%
rename from docs/images/hoodie_upsert_dag.png
rename to docs/images/hudi_upsert_dag.png
diff --git a/docs/images/hoodie_upsert_perf1.png b/docs/images/hudi_upsert_perf1.png
similarity index 100%
rename from docs/images/hoodie_upsert_perf1.png
rename to docs/images/hudi_upsert_perf1.png
diff --git a/docs/images/hoodie_upsert_perf2.png b/docs/images/hudi_upsert_perf2.png
similarity index 100%
rename from docs/images/hoodie_upsert_perf2.png
rename to docs/images/hudi_upsert_perf2.png
diff --git a/docs/implementation.md b/docs/implementation.md
index cbf394e..54966e2 100644
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -10,7 +10,7 @@ Hudi (pronounced “Hoodie”) is implemented as a Spark library, which makes it
 libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and
 Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces.
 
- - **Indexing** :  A big part of Hoodie's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to.
+ - **Indexing** :  A big part of Hudi's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to.
  This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently.
  `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that
  can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically
@@ -66,7 +66,7 @@ In this storage, index updation is a no-op, since the bloom filters are already
 In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of
 the file
 
-{% include image.html file="hoodie_log_format_v2.png" alt="hoodie_log_format_v2.png" max-width="1000" %}
+{% include image.html file="hudi_log_format_v2.png" alt="hudi_log_format_v2.png" max-width="1000" %}
 
 #### Merge On Read
 
@@ -222,7 +222,7 @@ log blocks of 2 delta-commits (DC2 and DC3).
    incremental ingestion (writer at DC6) happened before the compaction (some time “Tc”’).  
    The below description is with regards to compaction from file-group perspective.
    * `Reader querying at time between ingestion completion time for DC6 and compaction finish “Tc”`:
-     Hoodie’s implementation will be changed to become aware of file-groups currently waiting for compaction and
+     Hudi’s implementation will be changed to become aware of file-groups currently waiting for compaction and
      merge log-files corresponding to DC2-DC6 with the base-file corresponding to SC1. In essence, Hudi will create
      a pseudo file-slice by combining the 2 file-slices starting at base-commits SC1 and SC5 to one.
      For file-groups not waiting for compaction, the reader behavior is essentially the same - read latest file-slice
@@ -247,14 +247,14 @@ the conventional alternatives for achieving these tasks.
 Following shows the speed up obtained for NoSQL ingestion, by switching from bulk loads off HBase to Parquet to incrementally upserting
 on a Hudi dataset, on 5 tables ranging from small to huge.
 
-{% include image.html file="hoodie_upsert_perf1.png" alt="hoodie_upsert_perf1.png" max-width="1000" %}
+{% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" max-width="1000" %}
 
 
 Given Hudi can build the dataset incrementally, it opens doors for also scheduling ingesting more frequently thus reducing latency, with
 significant savings on the overall compute cost.
 
 
-{% include image.html file="hoodie_upsert_perf2.png" alt="hoodie_upsert_perf2.png" max-width="1000" %}
+{% include image.html file="hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" max-width="1000" %}
 
 Hudi upserts have been stress tested upto 4TB in a single commit across the t1 table.
 
@@ -267,12 +267,12 @@ with no impact on queries. Following charts compare the Hudi vs non-Hudi dataset
 
 **Hive**
 
-{% include image.html file="hoodie_query_perf_hive.png" alt="hoodie_query_perf_hive.png" max-width="800" %}
+{% include image.html file="hudi_query_perf_hive.png" alt="hudi_query_perf_hive.png" max-width="800" %}
 
 **Spark**
 
-{% include image.html file="hoodie_query_perf_spark.png" alt="hoodie_query_perf_spark.png" max-width="1000" %}
+{% include image.html file="hudi_query_perf_spark.png" alt="hudi_query_perf_spark.png" max-width="1000" %}
 
 **Presto**
 
-{% include image.html file="hoodie_query_perf_presto.png" alt="hoodie_query_perf_presto.png" max-width="1000" %}
+{% include image.html file="hudi_query_perf_presto.png" alt="hudi_query_perf_presto.png" max-width="1000" %}
diff --git a/docs/incremental_processing.md b/docs/incremental_processing.md
index 7c97cc9..63f4f39 100644
--- a/docs/incremental_processing.md
+++ b/docs/incremental_processing.md
@@ -13,7 +13,7 @@ discusses a few tools that can be used to achieve these on different contexts.
 
 ## Incremental Ingestion
 
-Following means can be used to apply a delta or an incremental change to a Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
+Following means can be used to apply a delta or an incremental change to a Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to DFS or
 even changes pulled from another Hudi dataset.
 
 #### DeltaStreamer Tool
@@ -23,9 +23,10 @@ from different sources such as DFS or Kafka.
 
 The tool is a spark job (part of hoodie-utilities), that provides the following functionality
 
- - Ability to consume new events from Kafka, incremental imports from Sqoop or output of `HiveIncrementalPuller` or files under a folder on HDFS
+ - Ability to consume new events from Kafka, incremental imports from Sqoop or output of `HiveIncrementalPuller` or files under a folder on DFS
  - Support json, avro or a custom payload types for the incoming data
- - New data is written to a Hudi dataset, with support for checkpointing & schemas and registered onto Hive
+ - Pick up avro schemas from DFS or Confluent [schema registry](https://github.com/confluentinc/schema-registry).
+ - New data is written to a Hudi dataset, with support for checkpointing and registered onto Hive
 
 Command line options describe capabilities in more detail (first build hoodie-utilities using `mvn clean package`).
 
@@ -116,7 +117,7 @@ and then ingest it as follows.
   --op BULK_INSERT
 ```
 
-In some cases, you may want to convert your existing dataset into Hoodie, before you can begin ingesting new data. This can be accomplished using the `hdfsparquetimport` command on the `hoodie-cli`.
+In some cases, you may want to convert your existing dataset into Hudi, before you can begin ingesting new data. This can be accomplished using the `hdfsparquetimport` command on the `hoodie-cli`.
 Currently, there is support for converting parquet datasets.
 
 #### Via Custom Spark Job
@@ -167,9 +168,6 @@ Usage: <main class> [options]
 
 ```
 
-{% include callout.html content="Note that for now, due to jar mismatches between Spark & Hive, its recommended to run this as a separate Java task in your workflow manager/cron. This is getting fix [here](https://github.com/uber/hoodie/issues/123)" type="info" %}
-
-
 ## Incrementally Pulling
 
 Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified commit timestamp.
@@ -196,10 +194,10 @@ A sample incremental pull, that will obtain all records written since `beginInst
 Please refer to [configurations](configurations.html) section, to view all datasource options.
 
 
-Additionally, `HoodieReadClient` offers the following functionality using Hoodie's implicit indexing.
+Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
 
 | **API** | **Description** |
-| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hoodie's own index for faster lookup |
+| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own index for faster lookup |
 | filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi dataset |
 
@@ -217,12 +215,12 @@ The following are the configuration options for HiveIncrementalPuller
 |hiveUser| Hive Server 2 Username |  |
 |hivePass| Hive Server 2 Password |  |
 |queue| YARN Queue name |  |
-|tmp| Directory where the temporary delta data is stored in HDFS. The directory structure will follow conventions. Please see the below section.  |  |
+|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
 |extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
 |sourceTable| Source Table Name. Needed to set hive environment properties. |  |
 |targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
-|sourceDataPath| Source HDFS Base Path. This is where the Hudi metadata will be read. |  |
-|targetDataPath| Target HDFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
+|sourceDataPath| Source DFS Base Path. This is where the Hudi metadata will be read. |  |
+|targetDataPath| Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
 |tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
 |fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are pulled from.  |  |
 |maxCommits| Number of commits to include in the pull. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
diff --git a/docs/index.md b/docs/index.md
index 22e1174..0ae003c 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -7,14 +7,14 @@ permalink: index.html
 summary: "Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three logical views for query access.
 
  * **Read Optimized View** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
  * **Incremental View** - Provides a change stream out of the dataset to feed downstream jobs/ETLs.
  * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
 
 
-{% include image.html file="hoodie_intro_1.png" alt="hoodie_intro_1.png" %}
+{% include image.html file="hudi_intro_1.png" alt="hudi_intro_1.png" %}
 
 By carefully managing how data is laid out in storage & how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), while at the same time capable of being consumed incrementally from processing/ETL frameworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) t [...]
 
diff --git a/docs/migration_guide.md b/docs/migration_guide.md
index f785251..e415ed3 100644
--- a/docs/migration_guide.md
+++ b/docs/migration_guide.md
@@ -29,7 +29,7 @@ Take this approach if your dataset is an append only type of dataset and you do
 
 Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
  of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset,
+ make the update available to queries. Note that not only do you get to use all Hudi primitives on this dataset,
  there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
  . You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
  adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
@@ -38,9 +38,8 @@ Import your existing dataset into a Hudi managed dataset. Since all the data is
 There are a few options when choosing this approach.
 
 #### Option 1
-Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in
-parquet file
-format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.
+Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.
 
 #### Option 2
 For huge datasets, this could be as simple as : for partition in [list of partitions in source dataset] {
@@ -53,7 +52,7 @@ Write your own custom logic of how to load an existing dataset into a Hudi manag
  [here](quickstart.html).
 
 ```
-Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install -DskipTests`, the shell can be
+Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
 
 hoodie->hdfsparquetimport
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 5a9193a..e19bedf 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -30,7 +30,8 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
 
 ## Version Compatibility
 
-Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
+Hudi requires Java 8 to be installed on a *nix system. Hudi works with Spark-2.x versions. 
+Further, we have verified that Hudi works with the following combination of Hadoop/Hive/Spark.
 
 | Hadoop | Hive  | Spark | Instructions to Build Hudi |
 | ---- | ----- | ---- | ---- |
@@ -38,8 +39,9 @@ Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We hav
 | Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 | Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean install -DskipTests" |
 
-If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues. We are limited by our bandwidth to certify other combinations.
-It would be of great help if you can reach out to us with your setup and experience with hoodie.
+If your environment has other versions of hadoop/hive/spark, please try out Hudi and let us know if there are any issues.
+We are limited by our bandwidth to certify other combinations (e.g Docker on Windows).
+It would be of great help if you can reach out to us with your setup and experience with hudi.
 
 ## Generate a Hudi Dataset
 
@@ -67,7 +69,7 @@ Use the RDD API to perform more involved actions on a Hudi dataset
 
 #### DataSource API
 
-Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS/local filesystem. Use the wrapper script
+Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
 to run from command-line
 
 ```
@@ -330,9 +332,9 @@ The steps assume you are using Mac laptop
 ### Setting up Docker Cluster
 
 
-#### Build Hoodie
+#### Build Hudi
 
-The first step is to build hoodie
+The first step is to build hudi
 ```
 cd <HUDI_WORKSPACE>
 mvn package -DskipTests
@@ -451,7 +453,7 @@ automatically initializes the datasets in the file-system if they do not exist y
 docker exec -it adhoc-2 /bin/bash
 
 # Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 ....
 ....
 2018-09-24 22:20:00 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
@@ -981,7 +983,7 @@ Again, You can use Hudi CLI to manually schedule and run compaction
 
 ```
 docker exec -it adhoc-1 /bin/bash
-^[[Aroot@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
 ============================================
 *                                          *
 *     _    _                 _ _           *
@@ -1166,7 +1168,7 @@ This brings the demo to an end.
 
 ## Testing Hudi in Local Docker environment
 
-You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hoodie.
+You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
 ```
 $ mvn pre-integration-test -DskipTests
 ```
diff --git a/docs/s3_filesystem.md b/docs/s3_filesystem.md
index 6c7b636..de16123 100644
--- a/docs/s3_filesystem.md
+++ b/docs/s3_filesystem.md
@@ -10,7 +10,7 @@ In this page, we explain how to get your Hudi spark job to store into AWS S3.
 
 ## AWS configs
 
-There are two configurations required for Hoodie-S3 compatibility:
+There are two configurations required for Hudi-S3 compatibility:
 
 - Adding AWS Credentials for Hudi
 - Adding required Jars to classpath
diff --git a/docs/sql_queries.md b/docs/sql_queries.md
index 4fa795f..4dc7493 100644
--- a/docs/sql_queries.md
+++ b/docs/sql_queries.md
@@ -34,8 +34,6 @@ to using the Hive Serde to read the data (planning/executions is still Spark). T
 towards Parquet reading, which we will address in the next method based on path filters.
 However benchmarks have not revealed any real performance degradation with Hudi & SparkSQL, compared to native support.
 
-{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/7) and [here](https://issues.apache.org/jira/browse/SPARK-19351) " type="info" %}
-
 Sample command is provided below to spin up Spark Shell
 
 ```
diff --git a/docs/use_cases.md b/docs/use_cases.md
index 0040bc1..9846aa0 100644
--- a/docs/use_cases.md
+++ b/docs/use_cases.md
@@ -18,7 +18,7 @@ even though this data is arguably the most valuable for the entire organization.
 
 
 For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an
-equivalent Hudi table on HDFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
+equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
 or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
@@ -39,14 +39,14 @@ But, typically these systems end up getting abused for less interactive queries
 
 
 On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__.
-By bringing __data freshness to a few minutes__, Hudi can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in HDFS.
+By bringing __data freshness to a few minutes__, Hudi can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in DFS.
 Also, Hudi has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enables faster analytics on much fresher analytics, without increasing the operational overhead.
 
 
 ## Incremental Processing Pipelines
 
 One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows.
-Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition.
+Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new DFS Folder/Hive Partition.
 Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour.
 Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours.
 
@@ -63,14 +63,14 @@ like 15 mins, and providing an end-end latency of 30 mins at `HD`.
 
 {% include callout.html content="To achieve this, Hudi has embraced similar concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer)
 or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
-For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %}
+For the more curious, a more detailed explanation of the benefits of Incremental Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %}
 
 
-## Data Dispersal From Hadoop
+## Data Dispersal From DFS
 
 A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application.
 For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch, to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store.
-A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)__
+A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on DFS (for offline analysis on computed results) and Kafka (for dispersal)__
 
 Once again Hudi can efficiently solve this problem, by having the Spark Pipeline upsert output from
 each run into a Hudi dataset, which can then be incrementally tailed (just like a Kafka topic) for new data & written into the serving store.