You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "chenjunjiedada (via GitHub)" <gi...@apache.org> on 2023/05/16 08:16:13 UTC

[GitHub] [iceberg] chenjunjiedada opened a new pull request, #7617: Parquet: skip writing bloom filter for deletes

chenjunjiedada opened a new pull request, #7617:
URL: https://github.com/apache/iceberg/pull/7617

   Right now, the Flink upsert job writes bloom filter for equality deletes when enabling bloom filter for equality columns. This is useless and should be avoided.
   
   <img width="1515" alt="image" src="https://github.com/apache/iceberg/assets/3960228/b4a305ea-887b-436f-b3ce-4c4577a55491">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#issuecomment-1550717223

   I would get @huaxingao 's input on this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on code in PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#discussion_r1205971970


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -519,8 +512,8 @@ static Context deleteContext(Map<String, String> config) {
             compressionLevel,
             rowGroupCheckMinRecordCount,
             rowGroupCheckMaxRecordCount,
-            bloomFilterMaxBytes,
-            columnBloomFilterEnabled,
+            PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
+            ImmutableMap.of(),

Review Comment:
   or delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu merged pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu merged PR #7617:
URL: https://github.com/apache/iceberg/pull/7617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#issuecomment-1550884548

   LGTM 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "chenjunjiedada (via GitHub)" <gi...@apache.org>.
chenjunjiedada commented on code in PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#discussion_r1206109480


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -519,8 +512,8 @@ static Context deleteContext(Map<String, String> config) {
             compressionLevel,
             rowGroupCheckMinRecordCount,
             rowGroupCheckMaxRecordCount,
-            bloomFilterMaxBytes,
-            columnBloomFilterEnabled,
+            PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
+            ImmutableMap.of(),

Review Comment:
   > If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?
   
   First, we don't have filter logic for the bloom filter right now.  Second, high-rate updates mostly produce position delete which doesn't contains a column to build a bloom filter unless the upstream is deleting a set of records. In that case, the records of the filtered data files should always pass the bloom filter of equality delete.
   
   > delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.
   
   I think so, the deletes impact read performance, as far as I know, all real productions need the async tasks to compact them to achieve good performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "chenjunjiedada (via GitHub)" <gi...@apache.org>.
chenjunjiedada commented on PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#issuecomment-1550558408

   @stevenzwu @hililiwei You guys may wanna take a look on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "chenjunjiedada (via GitHub)" <gi...@apache.org>.
chenjunjiedada commented on PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#issuecomment-1558374712

   I just rebased this since it has conflicts. @stevenzwu could you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on code in PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#discussion_r1205969186


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -519,8 +512,8 @@ static Context deleteContext(Map<String, String> config) {
             compressionLevel,
             rowGroupCheckMinRecordCount,
             rowGroupCheckMaxRecordCount,
-            bloomFilterMaxBytes,
-            columnBloomFilterEnabled,
+            PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
+            ImmutableMap.of(),

Review Comment:
   for most use cases, this change makes sense.
   
   If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?
   
   if yes, we can introduce a config to enable/disable bloom filter for delete files only. If not, this change is good to me.
   
   @huaxingao @hililiwei @chenjunjiedada 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #7617: Parquet: skip writing bloom filter for deletes

Posted by "chenjunjiedada (via GitHub)" <gi...@apache.org>.
chenjunjiedada commented on code in PR #7617:
URL: https://github.com/apache/iceberg/pull/7617#discussion_r1206096245


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -519,8 +512,8 @@ static Context deleteContext(Map<String, String> config) {
             compressionLevel,
             rowGroupCheckMinRecordCount,
             rowGroupCheckMaxRecordCount,
-            bloomFilterMaxBytes,
-            columnBloomFilterEnabled,
+            PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
+            ImmutableMap.of(),

Review Comment:
   > If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?
    That should produce position deletes 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org