You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/09 03:37:17 UTC

[GitHub] [hudi] Reimus opened a new issue, #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Reimus opened a new issue, #5808:
URL: https://github.com/apache/hudi/issues/5808

   After writing Hudi table using spark command
   ```
    ds.write
         .format("hudi")
         .mode(SaveMode.Append)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key, "ts")
         .option(DataSourceWriteOptions.RECORDKEY_FIELD.key, "id")
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key, "ym")
         .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(HoodieWriteConfig.TBL_NAME.key, tableName)
         .option(DataSourceWriteOptions.RECONCILE_SCHEMA.key, "true")
         .option(DataSourceWriteOptions.TABLE_TYPE.key, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(HoodieTableConfig.TIMELINE_TIMEZONE.key, HoodieTimelineTimeZone.UTC.name)
         .option(HoodieWriteConfig.SCHEMA_EVOLUTION_ENABLE.key, "true")
         .option(HoodieWriteConfig.AVRO_SCHEMA_VALIDATE_ENABLE.key, "true")
         .option(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key, WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name())
         .option(HoodieIndexConfig.BLOOM_FILTER_TYPE.key, BloomFilterTypeCode.DYNAMIC_V0.name)
         .option(HoodieIndexConfig.BLOOM_FILTER_NUM_ENTRIES_VALUE.key, String.valueOf(100000))
         .option(HoodieIndexConfig.BLOOM_INDEX_USE_METADATA.key, "true")
         .option(HoodieIndexConfig.BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.key, String.valueOf(1000000))
         .option(HoodieLockConfig.HIVE_DATABASE_NAME.key, databaseName)
         .option(HoodieLockConfig.HIVE_TABLE_NAME.key, tableName)
         .option(HoodieLockConfig.HIVE_METASTORE_URI.key, env.spark.hiveMetastore)
         .option(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key, classOf[org.apache.hudi.hive.HiveMetastoreBasedLockProvider].getName)
         .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, String.valueOf(256 * 1024 * 1024))
         .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE.key, String.valueOf(256 * 1024 * 1024))
         .option(HoodieCompactionConfig.AUTO_CLEAN.key, "true")
         .option(HoodieCompactionConfig.FAILED_WRITES_CLEANER_POLICY.key, HoodieFailedWritesCleaningPolicy.LAZY.name)
   
         .option(HoodieCompactionConfig.CLEANER_POLICY.key, HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS.name())
         .option(HoodieCompactionConfig.CLEANER_HOURS_RETAINED.key, String.valueOf(24))
         .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key, String.valueOf(104857600))
   
         .option(HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS.key, "ym,ymd,date,ts,lvl1.ymd,lvl1.lvl2.date")
         .option(HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key, "id,col1,col2")
         .option(HoodieMetadataConfig.POPULATE_META_FIELDS.key, "true")
         .option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key, "true")
         .option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key, "true")
         .option(HoodieMetadataConfig.ENABLE.key, "true")
         .save("/tmp/hudi")
   
   ```
   And reading said table using:
   
   ```
   val s = spark.read.format("hudi").option("hoodie.datasource.query.type","read_optimized").option("hoodie.file.index.enable","true").option("hoodie.enable.data.skipping","true").option("hoodie.metadata.enable","true").
   option("hoodie.metadata.index.column.stats.enable","true").option("","true").option("hoodie.datasource.read.extract.partition.values.from.path","true").load("/tmp/hudi")
   
   s.where('col1==="values").show
   s.where('col3==="values").show
   ```
   Where col1 is in the BLOOM_FILTER_INDEX_FOR_COLUMNS array, while col3 is not. and "values" is not expected to be found in the table.
   For both of the queries - same number of files is being scanned.
   
   **Expected behavior**
   
   Since the value is expected not to be found - small number of files (false positives) is expected to be scanned for first query.
   Full table scan is expected for second query.
   
   **Environment Description**
   
   * Hudi version : 0.11
   
   * Spark version : 3.1.2
   
   * Hadoop version : 3.0.X
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1217015237

   thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1152772715

   Alright, from 2d pass i think i understood your concern. 
   
   Here is my explanation: Data Skipping currently only integrated with Column Stats Index and does not leverage Bloom Filter Index at all. Therefore, Column Stats Index would be to effectively prune the search space only if the distribution of the values of "col1" in your files is not encapsulating "values" value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1152771115

   @Reimus thanks for reporting this!
   
   Can you please add the schema of your table and elaborate on why you believe Data Skipping isn't working as expected? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1167715468

   @Zhangshunyu awesome! Glad that it got resolved for you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1154216217

   No worries! We're actually would be looking to support Bloom-filter index in the Data Skipping eventually as well, but this will def be a non-trivial challenge given the sheer difference in sizes b/w Bloom-filter index and Column Stats Indexes even for moderately sized tables.
   
   > think customer uild column for example - since it is a random string, column stats would be relatively useless - but bloom filter could skip 99% of all files when looking for a particular uuid.
   Or am I missing on how the column stats work - reading the code/metadata - they seem useful for monotonic or slowly changing columns - like dates or db FK's - where min/max stats in combination of clustering/sorting can do proper data skipping.
   
   You're right -- Data Skipping effectiveness is correlated to how disjoint individual file's ranges are for particular column. The opposite is also true -- if for column A ranges for every file are exactly the same, Data Skipping effectiveness will be practically 0 (we call it often "pruning potential"). As you rightfully noticed it's the most effective w/ ordered or semi-ordered columns, and therefore we usually recommend folks to think about clustering on particular columns they are querying most often to leverage full Data Skipping's potential (especially given that since 0.10 Hudi now have spatial-curves like Z-order, Hilbert supported in its clustering suite).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Zhangshunyu commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

Zhangshunyu commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1163076874

   Hi @alexeykudinkin , now i am using dataskipping with column stat and meta table enabled, it works to prune files by this feature, but i find the num of matched lines are diff from file based meta.
   For example, we have 'year, month, day' 3 level partitions and if we query the count(*) by using filters including partition columns and other columns, we find the result is diff.(the num of file based meta is ok, but data skipping is less)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Zhangshunyu commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

Zhangshunyu commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1166234264

   Hi @alexeykudinkin , thanks for your reply, this problem is fixed by merging the mr for [Hudi-4200], after the records sorted read for hfile, its ok now.
    [HUDI-4200] Fixing sorting of keys fetched from metadata table 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1165981219

   @Zhangshunyu that def sounds like a bug to me. Can you please create a separate issue (it doesn't really overlap with this one) and provide steps to reproduce the issue (as much as possible)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all
URL: https://github.com/apache/hudi/issues/5808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1216239571

   @alexeykudinkin @Reimus : If things are taken care, can we close out this issue. If not, can you guys follow up to see whats pending. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Reimus commented on issue #5808: [SUPPORT] Data skipping using Column Stats Bloom does not seem to work at all

Posted by GitBox <gi...@apache.org>.

Reimus commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1152845213

   Thank you for the explanation.
   
   The column stats indexes / data skipping are awesome addition to 0.11.0 already - given that in docs they are mentioned in same breath as bloom index - I assumed there is a use for bloom based secondary indexes too 
   - think customer uild column for example - since it is a random string, column stats would be relatively useless - but bloom filter could skip 99% of all files when looking for a particular uuid.
   Or am I missing on how the column stats work - reading the code/metadata - they seem useful for monotonic or slowly changing columns - like dates or db FK's - where min/max stats in combination of clustering/sorting can do proper data skipping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org