You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ethan Guo (Jira)" <ji...@apache.org> on 2023/01/10 02:37:00 UTC

[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

     [ https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Guo updated HUDI-5323:
----------------------------
    Story Points: 2  (was: 4)

> Decouple virtual key with writing bloom filters to parquet files
> ----------------------------------------------------------------
>
>                 Key: HUDI-5323
>                 URL: https://issues.apache.org/jira/browse/HUDI-5323
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: index, writer-core
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.13.0
>
>
> When the virtual key feature is enabled by setting hoodie.populate.meta.fields to false, the bloom filters are not written to parquet base files in the write transactions.  Relevant logic in HoodieFileWriterFactory class:
> {code:java}
> private static <T extends HoodieRecordPayload, R extends IndexedRecord> HoodieFileWriter<R> newParquetFileWriter(
>     String instantTime, Path path, HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable,
>     TaskContextSupplier taskContextSupplier, boolean populateMetaFields) throws IOException {
>   return newParquetFileWriter(instantTime, path, config, schema, hoodieTable.getHadoopConf(),
>       taskContextSupplier, populateMetaFields, populateMetaFields);
> }
> private static <T extends HoodieRecordPayload, R extends IndexedRecord> HoodieFileWriter<R> newParquetFileWriter(
>     String instantTime, Path path, HoodieWriteConfig config, Schema schema, Configuration conf,
>     TaskContextSupplier taskContextSupplier, boolean populateMetaFields, boolean enableBloomFilter) throws IOException {
>   Option<BloomFilter> filter = enableBloomFilter ? Option.of(createBloomFilter(config)) : Option.empty();
>   HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter(conf).convert(schema), schema, filter);
>   HoodieParquetConfig<HoodieAvroWriteSupport> parquetConfig = new HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
>       config.getParquetBlockSize(), config.getParquetPageSize(), config.getParquetMaxFileSize(),
>       conf, config.getParquetCompressionRatio(), config.parquetDictionaryEnabled());
>   return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, taskContextSupplier, populateMetaFields);
> } {code}
> Given that bloom filters are absent, when using Bloom Index on the same table, the writer encounters NPE (HUDI-5319).
> We should decouple the virtual key feature with bloom filter and always write the bloom filters to the parquet files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)