You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/15 09:18:57 UTC

[GitHub] [hudi] KarthickAN opened a new issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

KarthickAN opened a new issue #2178:
URL: https://github.com/apache/hudi/issues/2178

Hi,
I tried inspecting the parquet files produced by hudi using the parquet tools. Each parquet file produced by hudi contains around 10MB worth of data for the field **extra: org.apache.hudi.bloomfilter** where the actual data is in KB's. As per the doc every 50000 bloom entries is supposed to be 4KB. Is this expected behavior or am I missing something here ? Below are the configs which I am using currently.

SmallFileSize = 104857600
MaxFileSize = 125829120
RecordSize = 35
CompressionRatio = 5
InsertSplitSize = 3500000
IndexBloomNumEntries = 1500000
KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
TableType = COPY_ON_WRITE
PartitionPathFields = date,sourceid
HiveStylePartitioning = True
WriteOperation = insert
CompressionCodec = snappy
CommitsRetained = 1
CombineBeforeInsert = True
PrecombineField = timestamp
InsertDropDuplicates = True
InsertShuffleParallelism = 100

**Environment Description**

Hudi version : 0.6.0

Spark version : 2.4.3

Hadoop version : 2.8.5-amzn-1

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : No. Running on AWS Glue

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888


   @nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks.
   
   However what's the recommended approach in terms of indexing here ? I see various features are available out of the box. As per the record size (35 bytes) I could have more than 3.5 Million records in a file with max size 120MB. Since in the doc it was recommended to have approximately half the size of total number of records I went with 1.5M for bloom filter. But with that approach looks like the storage size is increasing by 10MB per file.
   
   with index type - hoodie.index.type - How does this SIMPLE type work ?
   
   I see hoodie.bloom.index.prune.by.ranges, hoodie.bloom.index.use.caching, hoodie.bloom.index.use.treebased.filter, hoodie.bloom.index.bucketized.checking all these are enabled by default. Does this really help regardless of the hoodie key types used ? In my case I am using ComplexKeyGenerator with five different fields out of which one is timestamp.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888


   @nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks.
   
   However what's the recommended approach in terms of indexing here ? I see various features are available out of the box. 
   
   with index type - hoodie.index.type - How does this SIMPLE type work ?
   
   I see hoodie.bloom.index.prune.by.ranges, hoodie.bloom.index.use.caching, hoodie.bloom.index.use.treebased.filter, hoodie.bloom.index.bucketized.checking all these are enabled by default. Does this really help regardless of the hoodie key types used ? In my case I am using ComplexKeyGenerator with five different fields out of which one is timestamp.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-709446368


   @nsivabalan : Can you take a look at this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904


   If you wish to have dynamic bloom filter that scales its size as the number of entries increase, you can try it out. 
   Remember this is different from hoodie.index.type which refers to BLOOM/GLOBAL_BLOOM, etc. 
   The config of interest is 
   hoodie.bloom.index.filter.type = SIMPLE/DYNAMIC_V0
   
   for DYNAMIC_V0, you need to set an extra config. 
   hoodie.bloom.index.filter.dynamic.max.entries
   
   Basically the bloom will be initialized based on hoodie.index.bloom.num_entries. but as number of entries added reaches this value, bloom dynamically scales and increases its bitsize. This goes on upto "hoodie.bloom.index.filter.dynamic.max.entries". So, until this the ffp will be honored. After this the ffp may not be honored as the bloom may not grow beyond this. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711168471


   I guess the small record size of 35 bytes throws it off. so, lets see what we can do. 
   Before I go further, let me recap the SIMPLE bloom. 
   Bloom filter will statically allocate the size based on the numEntries and fpp. So, irrespective of whether you add 50k entries or 500k entries or 1.5M entries (as per your config), bloom size is going to be 10MB or so. But it will guarantee the 1*10-9 false positive probability. Thats how a typical bloom works. If you initialize with 1.5M with 1*10-9 as fpp, it is going to allocate so many buckets. 
   Once you exceed the number of entries, then the fpp may not be guaranteed. In other words, when looked up, more entries will result in false positive compared to 1*10-9. 
   
   Coming back to the problem. So, with SIMPLE bloom, guess we can't do much given the small size of record size. And yes, all the configs you mentioned will help in reducing the time during index look up. 
   On a high level, these are the steps done during index lookup.
   - Do range look up and filter out those data files whose range does not match input records(for each input record). 
   - From the filtered ones, do bloom look up to further trim down the data files to be looked up. 
   - After all these filtering, go ahead and look up in data files for each record key and return the matched location if found. 
   
   So, few things to note here. 
   - The range filtering will work only if your data is layed out such that each data files has a subset of entire dataset. If every data file's range is going to be more or less with similar min and max values, then this filtering may not help much. 
   - Bloom filter look up: again, depending on fpp initialized, it will trim down most data files if not found. 
   
   Having said all this, here is a rough idea of the bloom filter size based on diff values for numEntires and fpp.
   
   numEntries / FFP | 1*10-6 | 1*10-7 | 1*10-8 | 1*10-9 | 
   -------|--------|---------|---------|-----------
   100k | 400k| 560k| 640k| 710k|
   250k | 1.2Mb| 1.4Mb | 1.6Mb | 1.8Mb|
   500k | 2.3Mb | 2.8Mb| 3.1Mb| 3.6Mb|
   750k | 3.6Mb| 4.1Mb| 4.8Mb| 5.4Mb|
   1M | 4.8Mb| 5.6Mb| 6.4Mb| 71.2Mb|
   1.25M| 6Mb| 7Mb| 8Mb| 9Mb|
   1.5M | 7.2Mb| 8.4Mb| 9.6Mb| 10Mb| 
   
   So, may be you can try 250k/500k with 1*10-6 or 1*10-7. 
   Or better option is to run some workloads and determine which ones best fits your case. 
   
   May I know whats the perf impact you are seeing btw. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-709715455


   I will try to repro tomorrow. few questions in the mean time. 
   1. By recordSize = 35 you mean, 35 bytes? seems too low. 
   2. By "IndexBloomNumEntries = 1500000", you meant to say, you have set "hoodie.index.bloom.num_entries" to this value?
   3. Did you set any value for any of these configs. 
   hoodie.copyonwrite.insert.auto.split 
   hoodie.index.bloom.num_entries
   hoodie.index.bloom.fpp
   hoodie.bloom.index.filter.type
   4. what was the size of the parquet file or how many entries your parquet file had which resulted in 10MB of bloom index.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888

@nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks.

However what's the recommended approach in terms of indexing here ? I see various features are available out of the box. As per the record size (35 bytes) I could have more than 3.5 Million records in a file with max size 120MB. Since in the doc it was recommended to have approximately half the size of total number of records I went with 1.5M for bloom filter. But with that approach looks like the storage size is increasing by 10MB per file.

with index type - hoodie.index.type - How does this SIMPLE type work ?

I see hoodie.bloom.index.prune.by.ranges, hoodie.bloom.index.use.caching, hoodie.bloom.index.use.treebased.filter, hoodie.bloom.index.bucketized.checking all these are enabled by default. Does this really help regardless of the hoodie key types used ? In my case I am using ComplexKeyGenerator with five different fields out of those one is timestamp. Is it recommended to enable all this for performance ?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711168471


   I guess the small record size of 35 bytes throws it off. so, lets see what we can do. 
   Before I go further, let me recap the SIMPLE bloom. 
   Bloom filter will statically allocate the size based on the numEntries and fpp. So, irrespective of whether you add 50k entries or 500k entries or 1.5M entries (as per your config), bloom size is going to be 10MB or so. But it will guarantee the 1 * 10-9 false positive probability. Thats how a typical bloom works. If you initialize with 1.5M with 1 * 10-9 as fpp, it is going to allocate so many buckets. 
   Once you exceed the number of entries, then the fpp may not be guaranteed. In other words, when looked up, more entries will result in false positive compared to 1 * 10-9. 
   
   Coming back to the problem. So, with SIMPLE bloom, guess we can't do much given the small size of record size. And yes, all the configs you mentioned will help in reducing the time during index look up. 
   On a high level, these are the steps done during index lookup.
   - Do range look up and filter out those data files whose range does not match input records(for each input record). 
   - From the filtered ones, do bloom look up to further trim down the data files to be looked up. 
   - After all these filtering, go ahead and look up in data files for each record key and return the matched location if found. 
   
   So, few things to note here. 
   - The range filtering will work only if your data is layed out such that each data files has a subset of entire dataset. If every data file's range is going to be more or less with similar min and max values, then this filtering may not help much. 
   - Bloom filter look up: again, depending on fpp initialized, it will trim down most data files if not found. 
   
   Having said all this, here is a rough idea of the bloom filter size based on diff values for numEntires and fpp.
   
   numEntries / FFP | 1 * 10-6 | 1 * 10-7 | 1 * 10-8 | 1 * 10-9 | 
   -------|--------|---------|---------|-----------
   100k | 400k| 560k| 640k| 710k|
   250k | 1.2Mb| 1.4Mb | 1.6Mb | 1.8Mb|
   500k | 2.3Mb | 2.8Mb| 3.1Mb| 3.6Mb|
   750k | 3.6Mb| 4.1Mb| 4.8Mb| 5.4Mb|
   1M | 4.8Mb| 5.6Mb| 6.4Mb| 71.2Mb|
   1.25M| 6Mb| 7Mb| 8Mb| 9Mb|
   1.5M | 7.2Mb| 8.4Mb| 9.6Mb| 10Mb| 
   
   So, may be you can try 250k/500k with 1 * 10-6 or 1 * 10-7. 
   Or better option is to run some workloads and determine which ones best fits your case. 
   
   May I know whats the perf impact you are seeing btw. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-709875380


   @bvaradar @nsivabalan I did run some test around this issue. So I ran the job after setting the config hoodie.index.bloom.num_entries to 1500000 and inspected the file produced. There are 1000 records in total with total size 165381 bytes and then 10.2MB data for the bloom filter and the total size of the file was 10.2MB.
   
   After that I removed the config for hoodie.index.bloom.num_entries and ran the job with the default. This time I see same 1000 records with size 165381 and only 422KB data for bloom filter and the total size of the file was 428KB.
   
   So this issue happens when I set value for the hoodie.index.bloom.num_entries to 1500000.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904


   If you wish to have dynamic bloom filter that scales its size as the number of entries incase, you can try it out. 
   Remember this is different from hoodie.index.type which refers to BLOOM/GLOBAL_BLOOM, etc. 
   The config of interest is 
   hoodie.bloom.index.filter.type = SIMPLE/DYNAMIC_V0
   
   for DYNAMIC_V0, you need to set an extra config. 
   hoodie.bloom.index.filter.dynamic.max.entries
   
   Basically the bloom will be initialized based on hoodie.index.bloom.num_entries. but as number of entries added reaches this value, bloom dynamically scales and increases its bitsize. This goes on upto "hoodie.bloom.index.filter.dynamic.max.entries". So, until this the ffp will be honored. After this the ffp may not be honored as the bloom may not grow beyond this. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-709875380


   @nsivabalan I did run some test around this issue. So I ran the job after setting the config hoodie.index.bloom.num_entries to 1500000 and inspected the file produced. There are 1000 records in total with total size 165381 bytes and then 10.2MB data for the bloom filter and the total size of the file was 10.2MB.
   
   After that I removed the config for hoodie.index.bloom.num_entries and ran the job with the default. This time I see same 1000 records with size 165381 and only 422KB data for bloom filter and the total size of the file was 428KB.
   
   So this issue happens when I set value for the hoodie.index.bloom.num_entries to 1500000.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-998389487


   https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888


   @nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks.
   
   However what's the recommended approach in terms of indexing here ? I see various features are available out of the box. As per the record size (35 bytes) I could have more than 3.5 Million records in a file with max size 120MB. Since in the doc it was recommended to have approximately half the size of total number of records I went with 1.5M for bloom filter.
   
   with index type - hoodie.index.type - How does this SIMPLE type work ?
   
   I see hoodie.bloom.index.prune.by.ranges, hoodie.bloom.index.use.caching, hoodie.bloom.index.use.treebased.filter, hoodie.bloom.index.bucketized.checking all these are enabled by default. Does this really help regardless of the hoodie key types used ? In my case I am using ComplexKeyGenerator with five different fields out of which one is timestamp.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-766438221


   @KarthickAN : hope you got a chance to go through our [blog on indexes in Hudi](https://hudi.apache.org/blog/hudi-indexing-mechanisms/). Wrt this gh issue, please do let us know if you have any more specific questions. If not, will close this out in a weeks time. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

Gatsby-Lee commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-998381511


   > @KarthickAN : hope you got a chance to go through our [blog on indexes in Hudi](https://hudi.apache.org/blog/hudi-indexing-mechanisms/). Wrt this gh issue, please do let us know if you have any more specific questions. If not, will close this out in a weeks time.
   
   @nsivabalan  hi, the link is broken. can you share the updated link?
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710029348


   yes, are you right. bitsize to initialize bloom is an objective function of both numEntries and ffp. 
   
   (int) Math.ceil(numEntries * (-Math.log(errorRate) / (Math.log(2) * Math.log(2)))
   
   You can check [this](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/bloom/BloomFilterUtils.java) file for the math.
   
   NumEntries 60000, with error rate 0.000000001
   Bitsize 2587966, numHashes 30
   Bloom filter ser string size : 431348
   
   NumEntries 1500000, with error rate 0.000000001
   Bitsize 64699145, numHashes 30
   Bloom filter ser string size : 10783212
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-766438221


   @KarthickAN : hope you got a chance to go through our [blog on indexes in Hudi](https://hudi.apache.org/blog/hudi-indexing-mechanisms/). Wrt this gh issue, please do let us know if you have any more specific questions. If not, will close this out in a weeks time. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711466156

>However what's the recommended approach in terms of indexing here ? I see various features are available out of the box.

Been meaning to write a blog that walks through the options here. Interested in being a reviewer? That will also help explain this better for yourself as well :)

In short, you can pick options based on your workload (we intend to make dynamic_bloom default in 0.7.0 going forward, which should help with this issue?)

- If you have records where there are ordered keys (e.g timestamp prefix), then Bloom index with range pruning will do an excellent job. It will be able to quickly prune out large number of files to compare against and just use bloom filters for the rest.
- if you have records with no ordering in them (e.g uuid), but the pattern is such that mostly the recent partitions are updated with a long tail of updates/deletes to the older partitions, then still bloom index will be faster. but better to turn off range pruning, since it does not help, just incurs the cost of checking.

- If your update patterns are totally random i.e each commit affects almost every file, you can use SIMPLE_INDEX (which will join against the entire table) or if you have HBase - HBASE index (you can write your own index as well, its pluggable). We are working on adding a record level index natively within Hudi in the next major release hopefully

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904


   If you wish to scale the bloom filer size along with the number of entries, you can try out dynamic bloom filter.
   Remember this is different from hoodie.index.type which refers to BLOOM/GLOBAL_BLOOM, etc. 
   The config of interest is 
   hoodie.bloom.index.filter.type = SIMPLE/DYNAMIC_V0
   
   for DYNAMIC_V0, you need to set an extra config. 
   hoodie.bloom.index.filter.dynamic.max.entries
   
   Basically the bloom will be initialized based on hoodie.index.bloom.num_entries. but as number of entries added reaches this value, bloom dynamically scales and increases its bitsize. This goes on upto "hoodie.bloom.index.filter.dynamic.max.entries". So, until this the ffp will be honored. After this the ffp may not be honored as the bloom may not grow beyond this. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

Gatsby-Lee commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-998389972


   @nsivabalan 
   
   Thank you very much!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-709817321


   @nsivabalan Please find below my answers
   
   1. That's the average record size. I inspected the parquet files produced and calculated that based on the metrics I found there.
   2. Yes
   3. hoodie.copyonwrite.insert.split.size - didn't set manually. By default its enabled. But we don't retain 24 commits its just 1.
   hoodie.index.bloom.num_entries = set to 1500000
   hoodie.index.bloom.fpp = didn't set manually. default is 0.000000001
   hoodie.bloom.index.filter.type = didn't set manually. default is BLOOM
   
   in fact except for the configs I mentioned in the issue description I didn't set any other config explicitly and I left it all as defaults.
   
   4. All the files I inspected so far had this issue regardless of the size. This is consistent.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #2178:
URL: https://github.com/apache/hudi/issues/2178


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

Posted by GitBox <gi...@apache.org>.

KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711645166

@nsivabalan @vinothchandar Thank you so much for all the explanations. If I think about it, having 10MB worth of index data may not be an issue as long as the file contains considerable amount of records. In my case there was a scenario where I had only 1000 records but with 10MB for index. So I switched to dynamic bloom now which is really helpful in this case.

We are dealing with two different types of data out of which one doesn't have much volume. That's where it threw it off where as for the other type where we do have good volume of data this didn't come out as an issue as we'd already have around 110-120MB worth of data plus index. As of now I've configure it like below

IndexBloomNumEntries = 35000
BloomIndexFilterType = DYNAMIC_V0
BloomIndexFilterDynamicMaxEntries = 1400000

starting off with 35k (1% of max no of entries in a file) as a base and scaling it out till 1.4M(40% of max no of entries in a file) entries as the file grows. So that should solve the problem possibly. Anyways we need to test this out for the volume we are seeing right now and tune it further if required.

@vinothchandar Yes. Having a blog around this will definitely be very helpful. I felt hudi has a lot of features that can be used efficiently with some more in depth explanations than what we have right now as part of the documentation.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org