You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/21 01:25:11 UTC

[GitHub] [incubator-hudi] bhasudha opened a new pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

bhasudha opened a new pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [x] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783760
 
 

 ##########
 File path: docs/_docs/1_3_use_cases.md
 ##########
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
 
 Review comment:
   ah. good catch

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171184

##########
File path: docs/_docs/2_1_concepts.md
##########
@@ -53,69 +53,70 @@ With the help of the timeline, an incremental query attempting to get all new da
only the changed files without say scanning all the time buckets > 07:00.

## File management
-Hudi organizes a datasets into a directory structure under a `basepath` on DFS. Dataset is broken up into partitions, which are folders containing data files for that partition,
+Hudi organizes a table into a directory structure under a `basepath` on DFS. Table is broken up into partitions, which are folders containing data files for that partition,
very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath.

Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several
-`file slices`, where each slice contains a base columnar file (`*.parquet`) produced at a certain commit/compaction instant time,
+`file slices`, where each slice contains a base file (`*.parquet`) produced at a certain commit/compaction instant time,
along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced.
Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of
unused/older file slices to reclaim space on DFS.

-Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group, via an indexing mechanism.
+## Index
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism.
This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the
mapped file group contains all versions of a group of records.

-## Storage Types & Views
-Hudi storage types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written).
-In turn, `views` define how the underlying data is exposed to the queries (i.e how data is read).
+## Table Types & Querying

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784018

Review comment:
and Queries (instead of Querying)?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171951
 
 

 ##########
 File path: docs/_docs/1_3_use_cases.md
 ##########
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
 
 Review comment:
   sure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171577
 
 

 ##########
 File path: docs/_docs/2_3_querying_data.md
 ##########
 @@ -1,47 +1,52 @@
 ---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
 keywords: hudi, hive, spark, sql, presto
 permalink: /docs/querying_data.html
 summary: In this page, we go over how to enable SQL queries on Hudi built tables.
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](/docs/concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts.html#query-types). 
+Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines like Hive, Spark and Presto.
 
-Specifically, there are two Hive tables named off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during write.   
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
+ - `hudi_trips` supports snapshot querying and incremental querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying (providing near-real time data) of the table  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+ 
 
 As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows 
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi tables can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows 
 since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. 
+joined with other tables (tables/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss in detail how to access all the 3 views on each query engine.
+In sections, below we will discuss how to access these query types from different query engines.
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
+In order for Hive to recognize Hudi tables and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
 in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format 
 classes with its dependencies are available for query planning & execution. 
 
-### Read Optimized table
+### Read optimized querying
 In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the  fully qualified path name of the 
 inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set 
 to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
 
-### Real time table
+### Snapshot querying
 In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that
 queries can pick up the custom RecordReader as well.
 
-### Incremental Pulling
-
+### Incremental pulling
 
 Review comment:
   done!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784217
 
 

 ##########
 File path: docs/_docs/2_2_writing_data.md
 ##########
 @@ -156,41 +157,31 @@ inputDF.write()
 
 ## Syncing to Hive
 
-Both tools above support syncing of the dataset's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
+Both tools above support syncing of the table's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
 In case, its preferable to run this from commandline or in an independent jvm, Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+once you have built the hudi-hive module. Following is how we sync the above Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command:
 
 ```java
 cd hudi-hive
 ./run_sync_tool.sh
  [hudi-hive]$ ./run_sync_tool.sh --help
-Usage: <main class> [options]
-  Options:
-  * --base-path
-       Basepath of Hudi dataset to sync
-  * --database
-       name of the target database in Hive
-    --help, -h
-       Default: false
-  * --jdbc-url
-       Hive jdbc connect url
-  * --use-jdbc
-       Whether to use jdbc connection or hive metastore (via thrift)
-  * --pass
-       Hive password
-  * --table
-       name of the target table in Hive
-  * --user
-       Hive username
 ```
 
 ## Deletes 
 
-Hudi supports implementing two types of deletes on data stored in Hudi datasets, by enabling the user to specify a different record payload implementation. 
+Hudi supports implementing two types of deletes on data stored in Hudi tables, by enabling the user to specify a different record payload implementation. 
 
 Review comment:
   lets link to the delete blog from here? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369187185
 
 

 ##########
 File path: docs/_docs/2_2_writing_data.md
 ##########
 @@ -156,41 +157,31 @@ inputDF.write()
 
 ## Syncing to Hive
 
-Both tools above support syncing of the dataset's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
+Both tools above support syncing of the table's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
 In case, its preferable to run this from commandline or in an independent jvm, Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+once you have built the hudi-hive module. Following is how we sync the above Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command:
 
 ```java
 cd hudi-hive
 ./run_sync_tool.sh
  [hudi-hive]$ ./run_sync_tool.sh --help
-Usage: <main class> [options]
-  Options:
-  * --base-path
-       Basepath of Hudi dataset to sync
-  * --database
-       name of the target database in Hive
-    --help, -h
-       Default: false
-  * --jdbc-url
-       Hive jdbc connect url
-  * --use-jdbc
-       Whether to use jdbc connection or hive metastore (via thrift)
-  * --pass
-       Hive password
-  * --table
-       name of the target table in Hive
-  * --user
-       Hive username
 ```
 
 ## Deletes 
 
-Hudi supports implementing two types of deletes on data stored in Hudi datasets, by enabling the user to specify a different record payload implementation. 
+Hudi supports implementing two types of deletes on data stored in Hudi tables, by enabling the user to specify a different record payload implementation. 
 
 Review comment:
   will do!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576481756
 
 
   @leesf / @yanghua can you please help review this PR. Also, this might be needed in the corresponding cn pages too. Need your help there as well. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1260: [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1260: [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576859029
 
 
   Merging this. Will send a separate PR for other doc related changes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha merged pull request #1260: [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha merged pull request #1260: [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576480047
 
 
   @vinothchandar I kept it as WIP as I am still working on other changes such as scala version, quickstart fix etc. But wanted to send out a PR so the renaming part can be reviewed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783925
 
 

 ##########
 File path: docs/_docs/2_1_concepts.md
 ##########
 @@ -1,37 +1,37 @@
 ---
 title: "Concepts"
-keywords: hudi, design, storage, views, timeline
+keywords: hudi, design, table, queries, timeline
 permalink: /docs/concepts.html
 summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over hadoop compatible storages
 
- * Upsert                     (how do I change the dataset?)
- * Incremental pull           (how do I fetch data that changed?)
+ * Update/Delete Records      (how do I change records in a table?)
+ * Change Streams             (how do I fetch data that changed?)
 
 Review comment:
   how do I fetch `records` that changed ? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

lamber-ken commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368818803
 
 

 ##########
 File path: docs/_docs/1_3_use_cases.md
 ##########
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
 
 Review comment:
   Good catch, `https://kafka.apache.org` is better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784404

##########
File path: docs/_docs/2_3_querying_data.md
##########
@@ -1,47 +1,52 @@
---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
keywords: hudi, hive, spark, sql, presto
permalink: /docs/querying_data.html
summary: In this page, we go over how to enable SQL queries on Hudi built tables.
toc: true
last_modified_at: 2019-12-30T15:59:57-04:00
---

-Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](/docs/concepts.html#views).
-Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts.html#query-types).
+Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines like Hive, Spark and Presto.

-Specifically, there are two Hive tables named off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write.
-For e.g, if `table name = hudi_tbl`, then we get
+Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY)
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during write.

- - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get:
+ - `hudi_trips` supports snapshot querying and incremental querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying (providing near-real time data) of the table backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+

As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi tables can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows
since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above,
-with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset.
+joined with other tables (tables/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is realized by querying one of the tables above,
+with special configurations that indicates to query planning that only incremental data needs to be fetched out of the table.

-In sections, below we will discuss in detail how to access all the 3 views on each query engine.
+In sections, below we will discuss how to access these query types from different query engines.

## Hive

-In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
+In order for Hive to recognize Hudi tables and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format
classes with its dependencies are available for query planning & execution.

-### Read Optimized table
+### Read optimized querying
In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the fully qualified path name of the
inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set
to `org.apache.hadoop.hive.ql.io.HiveInputFormat`

-### Real time table
+### Snapshot querying
In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that
queries can pick up the custom RecordReader as well.

-### Incremental Pulling
-
+### Incremental pulling

Review comment:
Incremental Query instead of Incremental Pull? (again Query instead of Querying)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369170642
 
 

 ##########
 File path: docs/_docs/1_2_structure.md
 ##########
 @@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, providing fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical tables over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three types of querying.
 
- * **Read Optimized View** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
- * **Incremental View** - Provides a change stream out of the dataset to feed downstream jobs/ETLs.
- * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
+ * **Read Optimized querying** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
 
 Review comment:
   sure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369170965
 
 

 ##########
 File path: docs/_docs/2_1_concepts.md
 ##########
 @@ -1,37 +1,37 @@
 ---
 title: "Concepts"
-keywords: hudi, design, storage, views, timeline
+keywords: hudi, design, table, queries, timeline
 permalink: /docs/concepts.html
 summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over hadoop compatible storages
 
- * Upsert                     (how do I change the dataset?)
- * Incremental pull           (how do I fetch data that changed?)
+ * Update/Delete Records      (how do I change records in a table?)
+ * Change Streams             (how do I fetch data that changed?)
 
 Review comment:
   yes sure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

leesf commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576524810
 
 
   > @leesf / @yanghua can you please help review this PR. Also, this might be needed in the corresponding cn pages too. Need your help there as well. Thanks!
   
   Hi @bhasudha . Would just go ahead, we would make a follow-up PR to cn pages.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783683
 
 

 ##########
 File path: docs/_docs/1_2_structure.md
 ##########
 @@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, providing fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical tables over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three types of querying.
 
- * **Read Optimized View** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
- * **Incremental View** - Provides a change stream out of the dataset to feed downstream jobs/ETLs.
- * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
+ * **Read Optimized querying** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
 
 Review comment:
   just `Query` and not `querying`? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services