You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "codope (via GitHub)" <gi...@apache.org> on 2023/02/15 12:57:29 UTC

[GitHub] [hudi] codope opened a new pull request, #7965: Merge query engine setup and querying data docs

codope opened a new pull request, #7965:
URL: https://github.com/apache/hudi/pull/7965

   ### Change Logs
   
   * Merge query engine setup docs into querying data docs.
   * Add ClickHouse to the list of supported query engines.
   * Update support matrix.
   
   ### Impact
   
   Public docs change.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   Stated as above. Pages affected:
   https://hudi.apache.org/docs/querying_data
   https://hudi.apache.org/docs/query_engine_setup
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nfarah86 commented on a diff in pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "nfarah86 (via GitHub)" <gi...@apache.org>.

nfarah86 commented on code in PR #7965:
URL: https://github.com/apache/hudi/pull/7965#discussion_r1108034298


##########
website/docs/querying_data.md:
##########
@@ -17,7 +17,11 @@ In sections, below we will discuss specific setup to access different query type
 The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
 See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. 
 
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#Spark-DataSource) page.
+**Setup**
+
+If your Spark environment does not have the Hudi jars installed, add `hudi-spark-bundle_2.11-<hudi.version>.jar` to the

Review Comment:
   can we add a link where they can find the versions?



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.

Review Comment:
   its =>  with their 



##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both Hive or Hudi connector (
+Presto version 0.275 onwards) for querying Hudi tables. Both connectors currently support snapshot querying on
+COPY_ON_WRITE tables, and snapshot and read optimized queries on MERGE_ON_READ Hudi tables.
+
+Since PrestoDB-Hudi integration has evolved over time, the installation instructions for PrestoDB would vary based on
+versions. Please check the below table for query types supported and installation instructions for different versions of
+PrestoDB.
+
+| **PrestoDB Version** | **Installation description** | **Query types supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.233             | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.240             | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables. |
+| > = 0.268             | No action needed. Hudi 0.9.0 version is a compile time dependency. | Snapshot querying on bootstrap tables. |
+| > = 0.272             | No action needed. Hudi 0.10.1 version is a compile time dependency. | File listing optimizations. Improved query performance. |
+| > = 0.275             | No action needed. Hudi 0.11.0 version is a compile time dependency. | All of the above. Native Hudi connector that is on par with Hive connector. |
+
+To learn more about the usage of Hudi connector, please
+checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+:::note Incremental queries and point in time queries are not supported either through the Hive connector or Hudi
+connector. However, it is in our roadmap and you can track the development

Review Comment:
   , and you



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster,
+  so that queries can pick up the custom RecordReader as well.
+
+In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the fully
+qualified path name of the inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally
+the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the

Review Comment:
   ,additonally,



##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both Hive or Hudi connector (

Review Comment:
   One can use both the Hive or Hudi connector



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster,

Review Comment:
   ,additionally, 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on PR #7965:
URL: https://github.com/apache/hudi/pull/7965#issuecomment-1434553921

   @nfarah86 Thanks for reviewing the content. I have addressed your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on a diff in pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "bhasudha (via GitHub)" <gi...@apache.org>.

bhasudha commented on code in PR #7965:
URL: https://github.com/apache/hudi/pull/7965#discussion_r1112639624


##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both the Hive or Hudi connector (
+Presto version 0.275 onwards) for querying Hudi tables. Both connectors currently support snapshot querying on
+COPY_ON_WRITE tables, and snapshot and read optimized queries on MERGE_ON_READ Hudi tables.
+
+Since PrestoDB-Hudi integration has evolved over time, the installation instructions for PrestoDB would vary based on
+versions. Please check the below table for query types supported and installation instructions for different versions of
+PrestoDB.
+
+| **PrestoDB Version** | **Installation description** | **Query types supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.233             | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.240             | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables. |
+| > = 0.268             | No action needed. Hudi 0.9.0 version is a compile time dependency. | Snapshot querying on bootstrap tables. |
+| > = 0.272             | No action needed. Hudi 0.10.1 version is a compile time dependency. | File listing optimizations. Improved query performance. |
+| > = 0.275             | No action needed. Hudi 0.11.0 version is a compile time dependency. | All of the above. Native Hudi connector that is on par with Hive connector. |
+
+To learn more about the usage of Hudi connector, please
+checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+:::note Incremental queries and point in time queries are not supported either through the Hive connector or Hudi

Review Comment:
   Styling comment. Move Incremental queries .... to new line. Otherwise all of first line appears bold and is inconsistent with other notes in the same page. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on code in PR #7965:
URL: https://github.com/apache/hudi/pull/7965#discussion_r1112970944


##########
website/docs/querying_data.md:
##########
@@ -283,14 +283,10 @@ PrestoDB.
 To learn more about the usage of Hudi connector, please
 checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html).
 
-:::note Incremental queries and point in time queries are not supported either through the Hive connector or Hudi
+:::note 
+Incremental queries and point in time queries are not supported either through the Hive connector or Hudi
 connector. However, it is in our roadmap, and you can track the development
 under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
-
-There is a known issue ([HUDI-4290](https://issues.apache.org/jira/browse/HUDI-4290)) for a clustered Hudi table. Presto
-query using version 0.272 or later may contain duplicates in results if clustering is enabled. This issue has been fixed
-in Hudi version 0.12.0 and we need to upgrade `hudi-presto-bundle`
-in presto to version 0.12.0. It is tracked in [HUDI-4605](https://issues.apache.org/jira/browse/HUDI-4605).

Review Comment:
   This issue is fixed in Presto version 0.278.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on code in PR #7965:
URL: https://github.com/apache/hudi/pull/7965#discussion_r1112970167


##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both the Hive or Hudi connector (
+Presto version 0.275 onwards) for querying Hudi tables. Both connectors currently support snapshot querying on
+COPY_ON_WRITE tables, and snapshot and read optimized queries on MERGE_ON_READ Hudi tables.
+
+Since PrestoDB-Hudi integration has evolved over time, the installation instructions for PrestoDB would vary based on
+versions. Please check the below table for query types supported and installation instructions for different versions of
+PrestoDB.
+
+| **PrestoDB Version** | **Installation description** | **Query types supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.233             | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.240             | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables. |
+| > = 0.268             | No action needed. Hudi 0.9.0 version is a compile time dependency. | Snapshot querying on bootstrap tables. |
+| > = 0.272             | No action needed. Hudi 0.10.1 version is a compile time dependency. | File listing optimizations. Improved query performance. |
+| > = 0.275             | No action needed. Hudi 0.11.0 version is a compile time dependency. | All of the above. Native Hudi connector that is on par with Hive connector. |
+
+To learn more about the usage of Hudi connector, please
+checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+:::note Incremental queries and point in time queries are not supported either through the Hive connector or Hudi

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope merged pull request #7965: [DOCS] Merge query engine setup and querying data docs

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope merged PR #7965:
URL: https://github.com/apache/hudi/pull/7965


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org