You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/03 22:19:34 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

alexeykudinkin opened a new pull request #4948:
URL: https://github.com/apache/hudi/pull/4948


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   This PR rebases Data Skipping flow from relying on bespoke Column Stats Index implementation to instead leverage MT Column Stats Index.
   
   ## Brief change log
   
    - Added `HoodieDatasetUitls`
    - Rebased `HoodieFileIndex` to use MT instead of bespoke CS Index
    - Fixing tests
    - Cleaning up
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064705848


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798300


   hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065812263


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

xiarixiaoyao commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058729769


   @alexeykudinkin  great works


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822169323



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Keep in mind that here we only join columns that the table is clustered by, so this is likely bounded by the number of 10. So, frankly, i don't think this will be a bottleneck, unless we're talking about gargantuan tables (with 10s of M files).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061300452


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065812263


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r823184442



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)

Review comment:
       Got it, this is fine for now.  I'm thinking from the perspective of whether this can be reused for index on the write path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r823399696



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       > Based on the code, I thought all columns from predicates are going to trigger joining (i.e., n number of columns -> n-1 joins), not just the clustering columns, since column stats index in metadata table can contain all columns from the schema.
   
   Correct, but even in the current setup we only join M columns which are directly _referenced in the predicates_. So even for fat tables having 1000s of columns, this is unlikely to be a problem since M << N practically at all times.
   
   > I understand that some kind of "joining" is needed here, but the spark table join in the current scheme expands the table after each join and adds additional col stats column. If for each of the df from a column from the following applies the filter first and generate a boolean for each file, then the next step is going to do AND, which does not require expanding columns and an additional cached table, reducing memory pressure and possible shuffling. Then that is much less costly than spark table/df join.
   
   Understand your point. Such slicing however will a) require to essentially revisit the whole flow, b) would blend in index reshaping and actual querying, and i think we're trying to optimize it prematurely at the moment. We can certainly fine-tune this flow, but i would much rather focus on its correctness right now and then follow up on the performance tuning after proper testing/profiling is done. WDYT?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r824919440



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       That makes sense to me.  As synced offline, the optimization of the flow of joining will be a follow-up, not in this PR.  We still need a good understanding of the percentage of the time spent in the joining stage vs the overall query planning/execution time in different table sizes (small and medium to start with), to check if this is really the bottleneck, before actually optimizing it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822166241



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =

Review comment:
       Correct. It also makes code much simpler (otherwise you need to do rows intersection for every column, that makes code more involved)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058684245


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058736338


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065775688


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065783105


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067209908


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067451750


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   * 4db7b511ee9609859f4fac1a24ad960e14dddf2e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061300452


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065795525


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058682395


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r823178174



##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       In this case, we need an upgrade step if column stats index is enabled in metadata table, and this should be somewhat automatic.  @nsivabalan @vinothchandar wdyt?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065783433


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822042168



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
##########
@@ -259,34 +260,4 @@ object DataSkippingUtils extends Logging {
         throw new AnalysisException(s"convert reference to name failed,  Found unsupported expression ${other}")
     }
   }
-
-  def getIndexFiles(conf: Configuration, indexPath: String): Seq[FileStatus] = {

Review comment:
       Correct




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798887


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798887


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058568643


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058736338


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058787817


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r825144635



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Created HUDI-3611




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r823200671



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Based on the code, I thought all columns from predicates are going to trigger joining (i.e., n number of columns -> n-1 joins), not just the clustering columns, since column stats index in metadata table can contain all columns from the schema.
   
   There are cases where the table is fat (1k to 10k+ number of columns, see [this blog](https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/)) and the queries can have more than 10 predicates at Uber and ByteDance.   Bytedance has PB-level tables which can easily have 10s of M files in a few partitions.  I worry that the joining can take a hit for this kind of scale.
   
   I understand that some kind of "joining" is needed here, but the spark table join in the current scheme expands the table after each join and adds additional col stats column.  If for each of the df from a column from the following applies the filter first and generate a boolean for each file, then the next step is going to do AND, which does not require expanding columns and an additional cached table, reducing memory pressure and possible shuffling.  Then that is much less costly than spark table/df join.
   
   ```
   queryReferencedColumns.map(colName =>
             colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
               .select(targetColStatsIndexColumns.map(col): _*)
               .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
               .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
               .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
           )
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067207423


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065783433


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067209908


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798300






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067495120


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945",
       "triggerID" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4db7b511ee9609859f4fac1a24ad960e14dddf2e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067207423


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067453094


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945",
       "triggerID" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   * 4db7b511ee9609859f4fac1a24ad960e14dddf2e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058568643


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058786543


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   * 2779d95457c76f4726615a153a1acf26b24836e2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065796347


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822169323



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Keep in mind that here we only join columns that the table is clustered by, so this is likely bounded by the number of 10.
   
   So, frankly, i don't think this will be a bottleneck, unless we're talking about gargantuan tables (with 10s of M files) .
   

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Keep in mind that here we only join columns that the table is clustered by, so this is likely bounded by the number of 10.
   
   So, frankly, i don't think this will be a bottleneck, unless we're talking about gargantuan tables (with 10s of M files).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822163833



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/TablePathUtils.java
##########
@@ -92,12 +92,21 @@ private static boolean isInsideTableMetadataFolder(String path) {
         HoodiePartitionMetadata metadata = new HoodiePartitionMetadata(fs, partitionPath);
         metadata.readFromFS();
         return Option.of(getNthParent(partitionPath, metadata.getPartitionDepth()));
+      } else {
+        // Simply traverse directory structure until found .hoodie folder
+        Path current = partitionPath;
+        while (current != null) {
+          if (hasTableMetadataFolder(fs, current)) {

Review comment:
       Correct. This is only useful in a discovery phase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064707297


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065795937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

codope commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r821266033



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {

Review comment:
       `fs.exists` is going to happen in every call if data skipping is enabled. This will hit perf as we observed in Presto. We should try to avoid it. I think we should just assume that metadata table exists and error out if it doesn't.

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
##########
@@ -259,34 +260,4 @@ object DataSkippingUtils extends Logging {
         throw new AnalysisException(s"convert reference to name failed,  Found unsupported expression ${other}")
     }
   }
-
-  def getIndexFiles(conf: Configuration, indexPath: String): Seq[FileStatus] = {

Review comment:
       Why remove these two methods? Are they not being used anywhere?

##########
File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala
##########
@@ -333,6 +333,57 @@ class TestHoodieFileIndex extends HoodieClientTestBase {
     assert(fileIndex.getAllQueryPartitionPaths.get(0).path.equals("c"))
   }
 
+  @Test
+  def testDataSkippingWhileFileListing(): Unit = {
+    val r = new Random(0xDEED)
+    val tuples = for (i <- 1 to 1000) yield (i, 1000 - i, r.nextString(5), r.nextInt(4))
+
+    val _spark = spark
+    import _spark.implicits._
+    val inputDF = tuples.toDF("id", "inv_id", "str", "rand")
+
+    val opts = Map(
+      "hoodie.insert.shuffle.parallelism" -> "4",
+      "hoodie.upsert.shuffle.parallelism" -> "4",
+      HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      RECORDKEY_FIELD.key -> "id",
+      PRECOMBINE_FIELD.key -> "id",
+      HoodieMetadataConfig.ENABLE.key -> "true",
+      HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true",
+      HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS_FOR_ALL_COLUMNS.key -> "true",
+      HoodieTableConfig.POPULATE_META_FIELDS.key -> "true"
+    )
+
+    inputDF.repartition(4)
+      .write
+      .format("hudi")
+      .options(opts)
+      .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, 100 * 1024)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    metaClient = HoodieTableMetaClient.reload(metaClient)
+
+    val props = Map[String, String](
+      "path" -> basePath,
+      QUERY_TYPE.key -> QUERY_TYPE_SNAPSHOT_OPT_VAL,
+      DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true"
+    )
+
+    val fileIndex = HoodieFileIndex(spark, metaClient, Option.empty, props, NoopCache)
+
+    val allFilesPartitions = fileIndex.listFiles(Seq(), Seq())
+    assertEquals(10, allFilesPartitions.head.files.length)
+
+    // We're selecting a single file that contains "id" == 1 row, which there should be
+    // strictly 1. Given that 1 is minimal possible value, Data Skipping should be able to
+    // truncate search space to just a single file
+    val dataFilter = EqualTo(AttributeReference("id", IntegerType, nullable = false)(), Literal(1))

Review comment:
       perhaps we can add tests for more expressions apart from `EqualTo`

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {

Review comment:
       +1 for persistence

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       This change we'll have to revist when we tackle schema evolution. Can you please track it in a JIRA?

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -119,18 +121,14 @@ case class HoodieFileIndex(spark: SparkSession,
     //    - Col-Stats Index is present
     //    - List of predicates (filters) is present
     val candidateFilesNamesOpt: Option[Set[String]] =
-      lookupCandidateFilesInColStatsIndex(dataFilters) match {
+      lookupCandidateFilesInMetadataTable(dataFilters) match {
         case Success(opt) => opt
         case Failure(e) =>
-          if (e.isInstanceOf[AnalysisException]) {
-            logDebug("Failed to relay provided data filters to Z-index lookup", e)
-          } else {
-            logError("Failed to lookup candidate files in Z-index", e)
-          }
+          logError("Failed to lookup candidate files in Z-index", e)

Review comment:
       maybe change `z-index` to column stats index in this err msg as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058571835


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058682395


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061302026


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061332043


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058818634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065775688


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065618195


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   * 97c07beea71959b984fc69e8f5c0da2b251217fc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065619464


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065795525


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r827244755



##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       Not saying it would break, and based on the context provided it should be supported by schema evolution, so I'm good with it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822043839



##########
File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala
##########
@@ -333,6 +333,57 @@ class TestHoodieFileIndex extends HoodieClientTestBase {
     assert(fileIndex.getAllQueryPartitionPaths.get(0).path.equals("c"))
   }
 
+  @Test
+  def testDataSkippingWhileFileListing(): Unit = {
+    val r = new Random(0xDEED)
+    val tuples = for (i <- 1 to 1000) yield (i, 1000 - i, r.nextString(5), r.nextInt(4))
+
+    val _spark = spark
+    import _spark.implicits._
+    val inputDF = tuples.toDF("id", "inv_id", "str", "rand")
+
+    val opts = Map(
+      "hoodie.insert.shuffle.parallelism" -> "4",
+      "hoodie.upsert.shuffle.parallelism" -> "4",
+      HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      RECORDKEY_FIELD.key -> "id",
+      PRECOMBINE_FIELD.key -> "id",
+      HoodieMetadataConfig.ENABLE.key -> "true",
+      HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true",
+      HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS_FOR_ALL_COLUMNS.key -> "true",
+      HoodieTableConfig.POPULATE_META_FIELDS.key -> "true"
+    )
+
+    inputDF.repartition(4)
+      .write
+      .format("hudi")
+      .options(opts)
+      .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, 100 * 1024)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    metaClient = HoodieTableMetaClient.reload(metaClient)
+
+    val props = Map[String, String](
+      "path" -> basePath,
+      QUERY_TYPE.key -> QUERY_TYPE_SNAPSHOT_OPT_VAL,
+      DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true"
+    )
+
+    val fileIndex = HoodieFileIndex(spark, metaClient, Option.empty, props, NoopCache)
+
+    val allFilesPartitions = fileIndex.listFiles(Seq(), Seq())
+    assertEquals(10, allFilesPartitions.head.files.length)
+
+    // We're selecting a single file that contains "id" == 1 row, which there should be
+    // strictly 1. Given that 1 is minimal possible value, Data Skipping should be able to
+    // truncate search space to just a single file
+    val dataFilter = EqualTo(AttributeReference("id", IntegerType, nullable = false)(), Literal(1))

Review comment:
       This is tested separately in unit-tests `TestDataSkippingUtils`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061332043


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064705848


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064773990


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065795937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798362


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058564344


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058684245


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822165683



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)

Review comment:
       Am gonna start w/ a latter: i don't think we're planning to support both of these, since bespoke ColStats index was purely a stop-gap solution until we get primary MT index.
   
   Having said that, i don't really see this to be commonly used for us to promote it into `HoodieTableMetadata` API: keep in mind that this table format is very Data Skipping specific and i don't think is very useful outside of that




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1061302026


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058818634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

codope commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r826527777



##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       Why would write break? Addition of field is a valid schema evolution that we support right.
   For reads, maybe we just handle this gracefully, if this field is not present in metadata table then fallback to usual query path (w/o data skipping). 

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {

Review comment:
       We can use the table config to determine which MT partitions are available for reading. Can you please track this in a JIRA?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065619464


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064773990


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058787817


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   * 2779d95457c76f4726615a153a1acf26b24836e2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058564344


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822042309



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       Is the plan to deprecate existing column stats under `.hoodie/.colstatsindex` and remove all usage of it in 0.11?  If not, should we have two modes where metadata col stats index is used when metadata table is enabled, and `.hoodie/.colstatsindex` is used if metadata table is disabled?

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/util/TablePathUtils.java
##########
@@ -92,12 +92,21 @@ private static boolean isInsideTableMetadataFolder(String path) {
         HoodiePartitionMetadata metadata = new HoodiePartitionMetadata(fs, partitionPath);
         metadata.readFromFS();
         return Option.of(getNthParent(partitionPath, metadata.getPartitionDepth()));
+      } else {
+        // Simply traverse directory structure until found .hoodie folder
+        Path current = partitionPath;
+        while (current != null) {
+          if (hasTableMetadataFolder(fs, current)) {

Review comment:
       One caveat is that this may incur more than one `fs.exists()` calls.  Is this only used for initialization (which is fine), e.g., getting table path from config, and not for core read/write logic per data file?

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       this may not scale well with a large number of columns from predicates, as DF joining is expensive, even considering caching.  I'm wondering if a different DAG should be written for metadata table col stats, i.e., one row of col stats per file + column.  Conceptually, I think such joining can be avoided when prunning the files.

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       I also need clarification on how `.hoodie/.colstatsindex` is generated.  Does that come from clustering or it is also updated per write?

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       Adding to @codope 's point, does this break the read/write of metadata records in the existing metadata table if users enables it in older releases, e.g., 0.10.0 and 0.10.1?

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)

Review comment:
       Should the logic of fetching column stats into DF be incorporated into `BaseTableMetadata` as there is already another API of `getColumnStats()`?  In this way, it may also be possible to make the logic here metadata table agnostic, and instead rely on BaseTableMetadata/HoodieTableMetadata to decide which source (.hoodie/.colstatsindex on fs vs metadata table) to use.

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =

Review comment:
       I guess the purpose of doing transposing here is to adapt to the expected input of existing APIs of data skipping?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822170796



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {
+      Option.empty
+    } else {
+      val targetColStatsIndexColumns = Seq(
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE,
+        HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT)
+
+      val requiredMetadataIndexColumns =
+        (targetColStatsIndexColumns :+ HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).map(colName =>
+          s"${HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS}.${colName}")
+
+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)
+
+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {
+        // Metadata Table bears rows in the following format
+        //
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  |        fileName           | columnName |  minValue  |  maxValue  |  num_nulls  |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          A |          1 |         10 |           0 |
+        //  | another_base_file.parquet |          A |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+------------+-------------+
+        //
+        // While Data Skipping utils are expecting following (transposed) format, where per-column stats are
+        // essentially transposed (from rows to columns):
+        //
+        //  +---------------------------+------------+------------+-------------+
+        //  |          file             | A_minValue | A_maxValue | A_num_nulls |
+        //  +---------------------------+------------+------------+-------------+
+        //  | one_base_file.parquet     |          1 |         10 |           0 |
+        //  | another_base_file.parquet |        -10 |          0 |           5 |
+        //  +---------------------------+------------+------------+-------------+
+        //
+        // NOTE: Column Stats Index might potentially contain statistics for many columns (if not all), while
+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =
+        queryReferencedColumns.map(colName =>
+          colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName))
+            .select(targetColStatsIndexColumns.map(col): _*)
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName))
+            .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName))
+        )
+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

Review comment:
       Conceptually, we actually can't avoid joining for the following reason: ultimately to validate whether file will be accepted or not we will have to AND all of the rows of individual columns (ie all of the columns had to satisfy their respective filters) which implicitly requires join by the filename (one way or the other)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r823207445



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       When you say "current bespoke implementation will be removed in a follow-up", is this before or after 0.11.0 release?  I think we still need to keep `.hoodie/.colstatsindex` and the data skipping logic based on that, and there should be a flag to choose how data skipping is done between that vs MT col stats.  Because, if user doesn't choose to enable MT col stats in 0.11.0 and there is no data skipping logic based on `.hoodie/.colstatsindex`, data skipping cannot be done unless user goes back to 0.10.x.  The old logic can be removed one release after 0.11.0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822043217



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {

Review comment:
       Fair enough, we can replace it with config check whether MT is enabled




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067280735


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r827244755



##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       Not saying it would break, and based on the context provided it should be supported by schema evaluation, so I'm good with it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822164221



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       Yeah, current bespoke implementation will be removed in a follow-up. It's currently updated after clustering completes.

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       Yeah, current bespoke implementation will be removed in a follow-up. It's currently updated only after clustering completes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r822163559



##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -109,6 +109,14 @@
                                 "string"
                             ]
                         },
+                        {
+                            "doc": "Column name for which this column statistics applies",

Review comment:
       Good call. Data Skipping won't be functional w/o this column, so we will have to call out that folks would need to flush and rebuild their MT if they want to use with Data Skipping.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067453094


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945",
       "triggerID" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   * 4db7b511ee9609859f4fac1a24ad960e14dddf2e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6945) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua merged pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

yihua merged pull request #4948:
URL: https://github.com/apache/hudi/pull/4948


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on a change in pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#discussion_r825138951



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)

Review comment:
       Discussed offline: Bespoke implementation of Col Stats Index would be removed in 0.11

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -194,77 +192,102 @@ case class HoodieFileIndex(spark: SparkSession,
    * @param queryFilters list of original data filters passed down from querying engine
    * @return list of pruned (data-skipped) candidate base-files' names
    */
-  private def lookupCandidateFilesInColStatsIndex(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
     val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)
 
-    if (!enableDataSkipping() || !fs.exists(new Path(indexPath)) || queryFilters.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val completedCommits = getActiveTimeline.filterCompletedInstants().getInstants.iterator.asScala.toList.map(_.getTimestamp)
-
-    // Collect all index tables present in `.zindex` folder
-    val candidateIndexTables =
-      fs.listStatus(new Path(indexPath))
-        .filter(_.isDirectory)
-        .map(_.getPath.getName)
-        .filter(completedCommits.contains(_))
-        .sortBy(x => x)
-
-    if (candidateIndexTables.isEmpty) {
-      // scalastyle:off return
-      return Success(Option.empty)
-      // scalastyle:on return
-    }
-
-    val dataFrameOpt = try {
-      Some(spark.read.load(new Path(indexPath, candidateIndexTables.last).toString))
-    } catch {
-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {

Review comment:
       @codope on a second thought -- there still could be case, when MT is enabled, but it's not bootstrapped yet, so we can't equate MT being enabled in config, with its presence in FS. Frankly, i don't see a way w/o `fs.exists` in some shape or form -- if not here it would happen w/in Spark's Data Source.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065796347


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1064707297


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6523",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669",
       "triggerID" : "4421752bef3dd3b53cd896f7d3ca23bb49d22034",
       "triggerType" : "PUSH"
     }, {
       "hash" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812",
       "triggerID" : "14366cac6e233cb85ee94307a7f62f6184ed5b34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4421752bef3dd3b53cd896f7d3ca23bb49d22034 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6669) 
   * 14366cac6e233cb85ee94307a7f62f6184ed5b34 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6812) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067451750


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4db7b511ee9609859f4fac1a24ad960e14dddf2e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   * 4db7b511ee9609859f4fac1a24ad960e14dddf2e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798462


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798462


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 97c07beea71959b984fc69e8f5c0da2b251217fc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1065798322


   2hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1067280735


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6851",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6863",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6859",
       "triggerID" : "1065783105",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "97c07beea71959b984fc69e8f5c0da2b251217fc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6864",
       "triggerID" : "1065798362",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932",
       "triggerID" : "fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fff47ce78d9bcd3e01c0bde609af2cb3bb802e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6932) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot removed a comment on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058571835


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89490cb2d45a5b9a4a097a14ddcfa38016f84db1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4948: [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on pull request #4948:
URL: https://github.com/apache/hudi/pull/4948#issuecomment-1058786543


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6515",
       "triggerID" : "89490cb2d45a5b9a4a097a14ddcfa38016f84db1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516",
       "triggerID" : "29d309812b4f7118dc2fdbe7d558fa4f2f697739",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2779d95457c76f4726615a153a1acf26b24836e2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 29d309812b4f7118dc2fdbe7d558fa4f2f697739 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6516) 
   * 2779d95457c76f4726615a153a1acf26b24836e2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org