You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "nfarah86 (via GitHub)" <gi...@apache.org> on 2023/02/16 06:18:15 UTC

[GitHub] [hudi] nfarah86 commented on a diff in pull request #7965: [DOCS] Merge query engine setup and querying data docs

nfarah86 commented on code in PR #7965:
URL: https://github.com/apache/hudi/pull/7965#discussion_r1108034298


##########
website/docs/querying_data.md:
##########
@@ -17,7 +17,11 @@ In sections, below we will discuss specific setup to access different query type
 The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple `spark.read.parquet`.
 See the [Spark Quick Start](/docs/quick-start-guide) for more examples of Spark datasource reading queries. 
 
-To setup Spark for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#Spark-DataSource) page.
+**Setup**
+
+If your Spark environment does not have the Hudi jars installed, add `hudi-spark-bundle_2.11-<hudi.version>.jar` to the

Review Comment:
   can we add a link where they can find the versions?



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.

Review Comment:
   its =>  with their 



##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both Hive or Hudi connector (
+Presto version 0.275 onwards) for querying Hudi tables. Both connectors currently support snapshot querying on
+COPY_ON_WRITE tables, and snapshot and read optimized queries on MERGE_ON_READ Hudi tables.
+
+Since PrestoDB-Hudi integration has evolved over time, the installation instructions for PrestoDB would vary based on
+versions. Please check the below table for query types supported and installation instructions for different versions of
+PrestoDB.
+
+| **PrestoDB Version** | **Installation description** | **Query types supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.233             | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| > = 0.240             | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables. |
+| > = 0.268             | No action needed. Hudi 0.9.0 version is a compile time dependency. | Snapshot querying on bootstrap tables. |
+| > = 0.272             | No action needed. Hudi 0.10.1 version is a compile time dependency. | File listing optimizations. Improved query performance. |
+| > = 0.275             | No action needed. Hudi 0.11.0 version is a compile time dependency. | All of the above. Native Hudi connector that is on par with Hive connector. |
+
+To learn more about the usage of Hudi connector, please
+checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+:::note Incremental queries and point in time queries are not supported either through the Hive connector or Hudi
+connector. However, it is in our roadmap and you can track the development

Review Comment:
   , and you



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster,
+  so that queries can pick up the custom RecordReader as well.
+
+In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the fully
+qualified path name of the inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally
+the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the

Review Comment:
   ,additonally,



##########
website/docs/querying_data.md:
##########
@@ -246,10 +262,87 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page.
+
+PrestoDB is a popular query engine, providing interactive query performance. One can use both Hive or Hudi connector (

Review Comment:
   One can use both the Hive or Hudi connector



##########
website/docs/querying_data.md:
##########
@@ -205,7 +209,19 @@ And for these use cases you should test the stability first.
 | `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics  |
 
 ## Hive
-To setup Hive for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#hive) page.
+
+In order for Hive to recognize Hudi tables and query correctly,
+
+- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-<hudi.version>.jar` in
+  its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
+  . This will ensure the input format classes with its dependencies are available for query planning & execution.
+- For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster,

Review Comment:
   ,additionally, 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org