You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/07 00:23:21 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

alexeykudinkin opened a new pull request, #5244:
URL: https://github.com/apache/hudi/pull/5244

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.
   
   ## Brief change log
   
    - Fixing Data Skipping configuration to respect MT configs (on the Read path)
    - Tightening up DS handling of cases when no top-level columns are in the target query
    - Enhancing tests to cover all possible cases
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   This change added tests and can be verified as follows:
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093374069

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 23a10d80255247f32c77cade8b15d9a8711f7ee1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907) 
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r845488181


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   As discussed offline: MT might be enabled on the write path, and therefore have the Column Stats index available, but since we're deliberately splitting configs for both Write/Read paths, we have to check whether these are enabled on the Read path.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)

Review Comment:
   Please check my comment above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r845618339


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   Yes.  Per discussion, `hoodie.metadata.enable` is still needed to make sure the right API fetching column stats is made to prevent any exception.  `hoodie.metadata.index.column.stats.enable` might not be needed.  We need to revisit the abstraction and configs of reading metadata table as a whole in a separate effort.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)

Review Comment:
   Synced up offline.  The concern is resolved.  See the comment below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r846676319


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090963705

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879) 
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 4071e00dcd1bfedbda291add17478b15495864a2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r844550905


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -48,7 +48,8 @@
       .sinceVersion("0.7.0")
       .withDocumentation("Enable the internal metadata table which serves table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  // TODO rectify
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review Comment:
   A reminder here to remove this change before merging.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   The check here should not rely on `isMetadataTableEnabled` (`hoodie.metadata.enable`) and `isColumnStatsIndexEnabled` (`hoodie.metadata.index.column.stats.enable`) which may not be the source of truth on the query side.  `isColumnStatsIndexAvailable` should be the only source of truth of whether col_stats partition is ready to read in metadata table.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)

Review Comment:
   The existing logic looks fine to me.  What's the gap here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093584649

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r846676628


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   also, how do we deduce what columns have been indexed in MDT CSI? 
   for eg, we have two flows. 
   a. hoodie.metadata.index.column.stats.all_columns.enable = true, where in all cols will be enabled. 
   b. hoodie.metadata.index.column.stats.column.list set to list of columns to be indexed. 
   
   So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090962254

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879) 
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090957509

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1092341251

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 4071e00dcd1bfedbda291add17478b15495864a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881) 
   * 23a10d80255247f32c77cade8b15d9a8711f7ee1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1094087138

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935",
       "triggerID" : "1093585813",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935) 
   * cf0ecea0b15e98a0d41f830c3b26e6b498d279d6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093447210

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090985304

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879) 
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 4071e00dcd1bfedbda291add17478b15495864a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1092370692

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 4071e00dcd1bfedbda291add17478b15495864a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881) 
   * 23a10d80255247f32c77cade8b15d9a8711f7ee1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093586844

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935",
       "triggerID" : "1093585813",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093502499

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
nsivabalan merged PR #5244:
URL: https://github.com/apache/hudi/pull/5244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090960783

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879) 
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093504604

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093358630

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 23a10d80255247f32c77cade8b15d9a8711f7ee1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907) 
   * 6097520274d65ff9a9c2734a25eaf253ef78529d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093672960

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935",
       "triggerID" : "1093585813",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r846678629


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   > given that CSI does not have stats for top level columns, if predicate references both top level and non-top level columns, we gonna skip leveraging CSI is it? since anyways, for non top level column, we have to visit all data files?
   
   It depends on the predicate, but we will at least try to leverage it to filter out for top-level columns only
   
   > So, when we are looking to apply data skipping on the query side, should we check for these configs and decided whether a particular col is indexed by CSI or not ?
   
   We can't do that, we have to play by what's actually in index: this is handled when we execute the filter against lookup table -- if it doesn't contain the column of the filter, it will just match all of the files.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   However, your question made me realize that we're actually deriving index schema incorrectly currently. Let me address that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1092416411

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 23a10d80255247f32c77cade8b15d9a8711f7ee1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1091024080

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 4071e00dcd1bfedbda291add17478b15495864a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1090955932

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 02a7aa62d7cc6cfa624253f50785c337aee7cde2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r845614938


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   we can probably go w/ 3 guards here
   !isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled
   
   to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used. 
   
   



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   we can probably go w/ 3 guards here
   !isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled
   
   to utilize base metadata itself, one has to enable explicitly on the read path. So, I prefer to guard that. and then check if data skipping is enabled. And then only if col stats partition is available to be used. 
   
   



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -48,7 +48,8 @@
       .sinceVersion("0.7.0")
       .withDocumentation("Enable the internal metadata table which serves table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  // TODO rectify
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review Comment:
   +1 



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -48,7 +48,8 @@
       .sinceVersion("0.7.0")
       .withDocumentation("Enable the internal metadata table which serves table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  // TODO rectify
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review Comment:
   +1 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1094106725

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935",
       "triggerID" : "1093585813",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7948",
       "triggerID" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * cf0ecea0b15e98a0d41f830c3b26e6b498d279d6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7948) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1094094556

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7879",
       "triggerID" : "02a7aa62d7cc6cfa624253f50785c337aee7cde2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c252f8d6d9a6b38adcebec0ba857d5aafae823cf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7881",
       "triggerID" : "4071e00dcd1bfedbda291add17478b15495864a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7907",
       "triggerID" : "23a10d80255247f32c77cade8b15d9a8711f7ee1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925",
       "triggerID" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929",
       "triggerID" : "1093502499",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "6097520274d65ff9a9c2734a25eaf253ef78529d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935",
       "triggerID" : "1093585813",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7948",
       "triggerID" : "cf0ecea0b15e98a0d41f830c3b26e6b498d279d6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c252f8d6d9a6b38adcebec0ba857d5aafae823cf UNKNOWN
   * 6097520274d65ff9a9c2734a25eaf253ef78529d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7925) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7929) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7935) 
   * cf0ecea0b15e98a0d41f830c3b26e6b498d279d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7948) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r846686802


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || !HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case for "struct")
+    //          - Any expression not directly referencing top-level column (for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   @nsivabalan i'm addressing this problem in a separate PR to avoid overloading this one: #5275



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #5244: [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #5244:
URL: https://github.com/apache/hudi/pull/5244#issuecomment-1093585813

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org