You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/27 07:00:02 UTC

[GitHub] [iceberg] jshmchenxi opened a new pull request #2642: Core: Support writing parquet bloom filter

jshmchenxi opened a new pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642


   Split #2582 into several PRs.
   This part adds support for writing parquet bloom filter.
   
   Add 3 new TableProperties. The definition is similar to [apache/parquet-mr](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop)
   
   Property | Default | Description
   -- | -- | --
   | write.parquet.bloom-filter-enabled      | false          | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
   | write.parquet.bloom-filter-max-bytes    | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset        |
   | write.parquet.bloom-filter-expected-ndv | (not set)      | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zinking commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
zinking commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r803269617



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -259,17 +271,31 @@ private WriteBuilder createContextFunc(Function<Map<String, String>, Context> ne
           conf.set(entry.getKey(), entry.getValue());
         }
 
-        ParquetProperties parquetProperties = ParquetProperties.builder()
+        ParquetProperties.Builder propsBuilder = ParquetProperties.builder()
             .withWriterVersion(writerVersion)
             .withPageSize(pageSize)
             .withDictionaryPageSize(dictionaryPageSize)
-            .build();
+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()

Review comment:
       I know it's probably not a big deal, but I mean could new instances be avoided ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r802318164



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -259,17 +271,31 @@ private WriteBuilder createContextFunc(Function<Map<String, String>, Context> ne
           conf.set(entry.getKey(), entry.getValue());
         }
 
-        ParquetProperties parquetProperties = ParquetProperties.builder()
+        ParquetProperties.Builder propsBuilder = ParquetProperties.builder()
             .withWriterVersion(writerVersion)
             .withPageSize(pageSize)
             .withDictionaryPageSize(dictionaryPageSize)
-            .build();
+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()

Review comment:
       This ColumnConfigParser instance is created to parse the config and apply to propsBuilder




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958691189


   Thanks for rebasing the code! @jshmchenxi 
   
   cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r755439102



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       Did you check Trino source code? Cc @jackye1995 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958691189


   Thanks for rebasing the code! @jshmchenxi 
   
   cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r746529384



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       Hi, Yufei, thanks for the review. The performance impact of writing bloom filter should be negligible, though we didn't do a performance benchmark. The cost of bloom filter is space. The default size of bloom filter for one column is 1 MB in each parquet file. If there are N columns in the table, then the extra space cost is **N MB in each file** to enable bloom filter for all columns. It is more reasonable to enable bloom filter only for columns that is of high cardinality and often used in filter expressions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r747966473



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       I will update the Parquet bloom filter benchmark PR so it can be merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r667669697



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -297,6 +297,10 @@ public static boolean hasNonDictionaryPages(ColumnChunkMetaData meta) {
     }
   }
 
+  public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) {
+    return meta.getBloomFilterOffset() == -1;

Review comment:
       As the `getBloomFilterOffset` is marked as `Private` method,  I would prefer not to use it in the production files , in case of breaking the compatibility when upgrading the parquet version in future. 
   
   Is there any other approach to check whether it generate bloom filter binary or not ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r750648681



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java
##########
@@ -127,4 +140,80 @@ public void testNumberOfBytesWritten() throws IOException {
         records.toArray(new GenericData.Record[]{}));
     return Pair.of(file, size);
   }
+
+  @Test
+  public void testBloomFilterWriteRead() throws IOException {
+    File parquetFile = generateFileWithBloomFilter();
+
+    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(localInput(parquetFile)))) {
+      BlockMetaData rowGroup = reader.getRowGroups().get(0);
+      BloomFilterReader bloomFilterDataReader = reader.getBloomFilterDataReader(rowGroup);
+
+      ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0);
+      BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn);
+      Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30)));

Review comment:
       I think we should add in a few negative tests as well since all of these check for inclusion and none for exclusion. I'm not super worried about this since I assume this code path is already well tested in Parquet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r746558113



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -197,9 +198,11 @@ private void startRowGroup() {
 
     PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
         compressor, parquetSchema, props.getAllocator(), this.columnIndexTruncateLength);
+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");

Review comment:
       The code would be more readable when we add the check here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r745201062



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       What's the perf impact of writing bloom filer? Does it make sense to enable it by default? Would be nice to include benchmarks?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kingeasternsun commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
kingeasternsun commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r755032658



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       hi everyone, does anybody know how to set these configuration  when i  use trino with iceberg . I have searched many places ,but does not found any useful info.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958691189


   Thanks for rebasing the code! @jshmchenxi 
   
   cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi edited a comment on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi edited a comment on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r745201062



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       What's the perf impact of writing bloom filer? Does it make sense to enable it by default if the perf impact is minor? Would be nice to include benchmarks? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r750641857



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -197,9 +198,11 @@ private void startRowGroup() {
 
     PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
         compressor, parquetSchema, props.getAllocator(), this.columnIndexTruncateLength);
+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");

Review comment:
       I'm a little confused on how the type might not match here? We get the class due to reflection so do we have a chance here of getting the wrong class?
   
   Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file? I may misunderstand this but it seems like we should try to do this in the reflection itself rather than after we instantiate?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r753761266



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java
##########
@@ -127,4 +140,80 @@ public void testNumberOfBytesWritten() throws IOException {
         records.toArray(new GenericData.Record[]{}));
     return Pair.of(file, size);
   }
+
+  @Test
+  public void testBloomFilterWriteRead() throws IOException {
+    File parquetFile = generateFileWithBloomFilter();
+
+    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(localInput(parquetFile)))) {
+      BlockMetaData rowGroup = reader.getRowGroups().get(0);
+      BloomFilterReader bloomFilterDataReader = reader.getBloomFilterDataReader(rowGroup);
+
+      ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0);
+      BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn);
+      Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30)));

Review comment:
       The hash values might conflict so I didn't add exclusion tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r750626466



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.

Review comment:
       Is there any plan for actually making this public?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r752328938



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -197,9 +198,11 @@ private void startRowGroup() {
 
     PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
         compressor, parquetSchema, props.getAllocator(), this.columnIndexTruncateLength);
+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");

Review comment:
       Yes, the check is redundant here. I'll remove it.
   
   > Seems like we may benefit from moving this into the DynamicConstructor code at the top of the file?
   
   We are checking the constructed instance against some interface type. Maybe it will help if we add constraint to `DynamicConstructor `.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r745214311



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -197,9 +198,11 @@ private void startRowGroup() {
 
     PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
         compressor, parquetSchema, props.getAllocator(), this.columnIndexTruncateLength);
+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");

Review comment:
       Do we need this check? It throws a `ClassCastException` at line 205 anyway.

##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |

Review comment:
       BTW, how does a user specify multiple columns? like this or something else?
   ```
   write.parquet.bloom-filter-enabled#column1=false
   write.parquet.bloom-filter-enabled#column2=false
   ```

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
+ * Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object.
+ */
+class ColumnConfigParser {
+
+  private static class ConfigHelper<T> {
+    private final String prefix;
+    private final Function<String, T> function;
+    private final BiConsumer<String, T> consumer;
+
+    ConfigHelper(String prefix, Function<String, T> function, BiConsumer<String, T> consumer) {
+      this.prefix = prefix;
+      this.function = function;
+      this.consumer = consumer;
+    }
+
+    public void processKey(String key) {
+      if (key.startsWith(prefix)) {
+        String columnPath = key.substring(prefix.length());
+        T value = function.apply(key);
+        consumer.accept(columnPath, value);
+      }
+    }
+  }
+
+  private final List<ConfigHelper<?>> helpers = new ArrayList<>();
+
+  public <T> ColumnConfigParser withColumnConfig(String rootKey, Function<String, T> function,
+      BiConsumer<String, T> consumer) {
+    helpers.add(new ConfigHelper<T>(rootKey + '#', function, consumer));
+    return this;
+  }
+
+  public void parseConfig(Configuration conf) {
+    for (Map.Entry<String, String> entry : conf) {
+      for (ConfigHelper<?> helper : helpers) {
+        // We retrieve the value from function instead of parsing from the string here to use the exact implementations
+        // in Configuration
+        helper.processKey(entry.getKey());
+      }
+    }
+  }
+
+  public void parseConfig(Map<String, String> conf) {
+    for (Map.Entry<String, String> entry : conf.entrySet()) {
+      for (ConfigHelper<?> helper : helpers) {
+        // We retrieve the value from function instead of parsing from the string here to use the exact implementations
+        // in Configuration
+        helper.processKey(entry.getKey());
+      }
+    }

Review comment:
       extract it to a function `parseConfig(Iterable<Map.Entry<String,String>> entrySet)` to avoid duplication?

##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |

Review comment:
       Minor suggestions. How about this?
   ```
   To enable or disable bloom filter for all columns by default. Partial enabling and disabling are possible by specifying the column name. For example, ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi edited a comment on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi edited a comment on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r640342954



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
+ * Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object.
+ */
+class ColumnConfigParser {

Review comment:
       This class is used to parse column configs. Mostly copied from [apache/parquet-mr](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnConfigParser.java).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r667996047



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -297,6 +297,10 @@ public static boolean hasNonDictionaryPages(ColumnChunkMetaData meta) {
     }
   }
 
+  public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) {
+    return meta.getBloomFilterOffset() == -1;

Review comment:
       I did some search and didn't find another approach. `ColumnChunkMetaData#getBloomFilterOffset` is also used when reading bloom filter to judge if bloom filter is enabled, see [ParquetFileReader#readBloomFilter](https://github.com/apache/parquet-mr/blob/db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1182).

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -297,6 +297,10 @@ public static boolean hasNonDictionaryPages(ColumnChunkMetaData meta) {
     }
   }
 
+  public static boolean hasNonBloomFilterPages(ColumnChunkMetaData meta) {
+    return meta.getBloomFilterOffset() == -1;

Review comment:
       I did some search and didn't find another approach. `ColumnChunkMetaData#getBloomFilterOffset` is also used when reading bloom filter to judge if bloom filter is enabled, see [ParquetFileReader#readBloomFilter](https://github.com/apache/parquet-mr/blob/db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1182).
   However, I should change `== -1` to `< 0`

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -38,16 +38,17 @@
 import org.apache.parquet.bytes.ByteBufferAllocator;
 import org.apache.parquet.column.ColumnWriteStore;
 import org.apache.parquet.column.ParquetProperties;
-import org.apache.parquet.column.page.PageWriteStore;
+import org.apache.parquet.column.values.bloomfilter.BloomFilterWriteStore;
 import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.ColumnChunkPageWriteStore;
 import org.apache.parquet.hadoop.ParquetFileWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.parquet.schema.MessageType;
 
 class ParquetWriter<T> implements FileAppender<T>, Closeable {
 
-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
-          .builder(PageWriteStore.class)
+  private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors
+          .builder(ColumnChunkPageWriteStore.class)

Review comment:
       Yes, `ColumnChunkPageWriteStore` used to be a package-private class and has to be created by reflection. I didn't pay much attension to the `InterfaceAudience.Private` annotation and thanks for pointing it out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r750653128



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       A few minor suggestions : 
   "The expected number of distinct values in a column, it is used to compute the optimal size of bytes of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size set in`bloom-filter-max-bytes`"
   
   "This property overrides `write.parquet.bloom-filter-enabled`, automatically enabling bloom filters for any columns specified." then the example?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957039915


   @jshmchenxi
   Do you still have time to work on this PR? If not, is it OK with you that either I or other person submit a new PR based on your work? We will list you as co-author and you will get commit credits. Please let me know if this is OK with you. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957124675


   Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957124675


   Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r746545287



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.
+ * Parses the specified key-values in the format of root.key#column.path from a {@link Configuration} object.
+ */
+class ColumnConfigParser {
+
+  private static class ConfigHelper<T> {
+    private final String prefix;
+    private final Function<String, T> function;
+    private final BiConsumer<String, T> consumer;
+
+    ConfigHelper(String prefix, Function<String, T> function, BiConsumer<String, T> consumer) {
+      this.prefix = prefix;
+      this.function = function;
+      this.consumer = consumer;
+    }
+
+    public void processKey(String key) {
+      if (key.startsWith(prefix)) {
+        String columnPath = key.substring(prefix.length());
+        T value = function.apply(key);
+        consumer.accept(columnPath, value);
+      }
+    }
+  }
+
+  private final List<ConfigHelper<?>> helpers = new ArrayList<>();
+
+  public <T> ColumnConfigParser withColumnConfig(String rootKey, Function<String, T> function,
+      BiConsumer<String, T> consumer) {
+    helpers.add(new ConfigHelper<T>(rootKey + '#', function, consumer));
+    return this;
+  }
+
+  public void parseConfig(Configuration conf) {
+    for (Map.Entry<String, String> entry : conf) {
+      for (ConfigHelper<?> helper : helpers) {
+        // We retrieve the value from function instead of parsing from the string here to use the exact implementations
+        // in Configuration
+        helper.processKey(entry.getKey());
+      }
+    }
+  }
+
+  public void parseConfig(Map<String, String> conf) {
+    for (Map.Entry<String, String> entry : conf.entrySet()) {
+      for (ConfigHelper<?> helper : helpers) {
+        // We retrieve the value from function instead of parsing from the string here to use the exact implementations
+        // in Configuration
+        helper.processKey(entry.getKey());
+      }
+    }

Review comment:
       Good point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r803419870



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -259,17 +271,31 @@ private WriteBuilder createContextFunc(Function<Map<String, String>, Context> ne
           conf.set(entry.getKey(), entry.getValue());
         }
 
-        ParquetProperties parquetProperties = ParquetProperties.builder()
+        ParquetProperties.Builder propsBuilder = ParquetProperties.builder()
             .withWriterVersion(writerVersion)
             .withPageSize(pageSize)
             .withDictionaryPageSize(dictionaryPageSize)
-            .build();
+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()

Review comment:
       The parser is designed to be used in this way, just like we need creating builder instances to build objects.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958691189


   Thanks for rebasing the code! @jshmchenxi 
   
   cc @aokolnychyi @RussellSpitzer @flyrain Could you please take a look when you have time? Thanks a lot in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-954357142


   @jshmchenxi 
   Are you still working on this PR? Could you please update the PR so it can be reviewed? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957124675


   Hi @jshmchenxi, thanks for the contribution. We have some updates based on your works. Would you mind we submit another PR based on your works?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r753761007



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.

Review comment:
       Thanks for the review. I've updated the configurations to be simialr to metrics.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r753761007



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.

Review comment:
       Thanks for the review. I've updated the configurations to be simialr to metrics.

##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java
##########
@@ -127,4 +140,80 @@ public void testNumberOfBytesWritten() throws IOException {
         records.toArray(new GenericData.Record[]{}));
     return Pair.of(file, size);
   }
+
+  @Test
+  public void testBloomFilterWriteRead() throws IOException {
+    File parquetFile = generateFileWithBloomFilter();
+
+    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(localInput(parquetFile)))) {
+      BlockMetaData rowGroup = reader.getRowGroups().get(0);
+      BloomFilterReader bloomFilterDataReader = reader.getBloomFilterDataReader(rowGroup);
+
+      ColumnChunkMetaData intColumn = rowGroup.getColumns().get(0);
+      BloomFilter intBloomFilter = bloomFilterDataReader.readBloomFilter(intColumn);
+      Assert.assertTrue(intBloomFilter.findHash(intBloomFilter.hash(30)));

Review comment:
       The hash values might conflict so I didn't add exclusion tests.

##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       Got it!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zinking commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
zinking commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r800501311



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -259,17 +271,31 @@ private WriteBuilder createContextFunc(Function<Map<String, String>, Context> ne
           conf.set(entry.getKey(), entry.getValue());
         }
 
-        ParquetProperties parquetProperties = ParquetProperties.builder()
+        ParquetProperties.Builder propsBuilder = ParquetProperties.builder()
             .withWriterVersion(writerVersion)
             .withPageSize(pageSize)
             .withDictionaryPageSize(dictionaryPageSize)
-            .build();
+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()

Review comment:
       looks a bit weird, is this intended, new instance without reference to it ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r640340662



##########
File path: versions.props
##########
@@ -6,7 +6,7 @@ org.apache.hadoop:* = 2.7.3
 org.apache.hive:hive-metastore = 2.3.8
 org.apache.hive:hive-serde = 2.3.8
 org.apache.orc:* = 1.6.8
-org.apache.parquet:* = 1.11.1
+org.apache.parquet:* = 1.12.0

Review comment:
       Will remove this when #2441 is merged




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r755683704



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       You have to do some adaption in Trino likewise. @kingeasternsun 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r745214311



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -197,9 +198,11 @@ private void startRowGroup() {
 
     PageWriteStore pageStore = pageStoreCtorParquet.newInstance(
         compressor, parquetSchema, props.getAllocator(), this.columnIndexTruncateLength);
+    Preconditions.checkState(pageStore instanceof BloomFilterWriteStore,
+        "pageStore must be an instance of BloomFilterWriteStore");

Review comment:
       Do we need this check? It throws a `ClassCastException` at line 205 anyway if the type doesn't match.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r802318164



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -259,17 +271,31 @@ private WriteBuilder createContextFunc(Function<Map<String, String>, Context> ne
           conf.set(entry.getKey(), entry.getValue());
         }
 
-        ParquetProperties parquetProperties = ParquetProperties.builder()
+        ParquetProperties.Builder propsBuilder = ParquetProperties.builder()
             .withWriterVersion(writerVersion)
             .withPageSize(pageSize)
             .withDictionaryPageSize(dictionaryPageSize)
-            .build();
+            .withMaxBloomFilterBytes(bloomFilterMaxBytes)
+            .withBloomFilterEnabled(bloomFilterEnabled);
+
+        new ColumnConfigParser()

Review comment:
       This ColumnConfigParser instance is created to parse column configs and apply to propsBuilder




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r750656506



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ColumnConfigParser.java
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * TODO: Once org.apache.parquet.hadoop.ColumnConfigParser is made public, should replace this class.

Review comment:
       I think I agree with Ryan's comments that we should strive to keep this as similar to the way we setup parquet metrics as possible. I know this would actually effect the writer, while our other properties effect the metric level, but I think it makes sense to keep all the configurations alike.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r667683062



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -38,16 +38,17 @@
 import org.apache.parquet.bytes.ByteBufferAllocator;
 import org.apache.parquet.column.ColumnWriteStore;
 import org.apache.parquet.column.ParquetProperties;
-import org.apache.parquet.column.page.PageWriteStore;
+import org.apache.parquet.column.values.bloomfilter.BloomFilterWriteStore;
 import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.ColumnChunkPageWriteStore;
 import org.apache.parquet.hadoop.ParquetFileWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.parquet.schema.MessageType;
 
 class ParquetWriter<T> implements FileAppender<T>, Closeable {
 
-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
-          .builder(PageWriteStore.class)
+  private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors
+          .builder(ColumnChunkPageWriteStore.class)

Review comment:
       More context:  I think we use a `DynConstructors.Ctor` before because we are trying to avoid the code dependency from the private `ColumnChunkPageWriteStore` and preparing for the parquet version upgradtion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r747017041



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       Thanks for the explanation. Make sense to disable it by default. Is there any perf test from Parquet or Spark side we can refer to?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r747960248



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -111,6 +111,15 @@ private TableProperties() {
   public static final String DELETE_AVRO_COMPRESSION = "write.delete.avro.compression-codec";
   public static final String AVRO_COMPRESSION_DEFAULT = "gzip";
 
+  public static final String PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled";
+  public static final boolean PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;

Review comment:
       Yes, I found [SPARK-35345](https://github.com/apache/spark/pull/32473) is adding perf test for Parquet bloom filter with Spark. And the author @huaxingao happens to be in this PR 😄 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957039915


   @jshmchenxi
   Do you still have time to work on this PR? If not, is it OK with you that either I or other person submit a new PR based on your work? We will list you as co-author and you will get commit credits. Please let me know if this is OK with you. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi edited a comment on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi edited a comment on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-957039915


   @jshmchenxi
   Do you still have time to work on this PR? If not, is it OK with you that either I or other person submit a new PR based on your work? We will list you as co-author and you will get commit credits. Please let me know if this is OK with you. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r802830823



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       Trino uses completely different parquet readers so the properties don't all translate over. Basically any changes we make in Iceberg Parquet readers will not effect trino.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r746566703



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |

Review comment:
       Yes, the format references properties defined in [parquet-hadoop](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop), like `parquet.bloom.filter.enabled` and `parquet.bloom.filter.expected.ndv`. 
   However, according to Ryan's [comment](https://github.com/apache/iceberg/pull/2582#discussion_r650448248), I will change the property pattern to `write.parquet.bloom-filter.col1.enabled`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r747111782



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |

Review comment:
       OK, it requires changes for the class `ColumnConfigParser` since the "#" isn't used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r667681542



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -38,16 +38,17 @@
 import org.apache.parquet.bytes.ByteBufferAllocator;
 import org.apache.parquet.column.ColumnWriteStore;
 import org.apache.parquet.column.ParquetProperties;
-import org.apache.parquet.column.page.PageWriteStore;
+import org.apache.parquet.column.values.bloomfilter.BloomFilterWriteStore;
 import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.ColumnChunkPageWriteStore;
 import org.apache.parquet.hadoop.ParquetFileWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.parquet.schema.MessageType;
 
 class ParquetWriter<T> implements FileAppender<T>, Closeable {
 
-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
-          .builder(PageWriteStore.class)
+  private static DynConstructors.Ctor<ColumnChunkPageWriteStore> pageStoreCtorParquet = DynConstructors
+          .builder(ColumnChunkPageWriteStore.class)

Review comment:
       The `ColumnChunkPageWriteStore`  is also marked as `InterfaceAudience.Private` in `parquet-mr` project,  so we should not depend on this private interface explicty in java code in case of upgrading issues.
   
   I can understand that we change it from `PageWriterStore` to `ColumnChunkPageWriteStore` because we will need to pass a `BloomFilterWriteStore`  for the following `newColumnWriteStore`. Maybe we could change the way to express this logic: 
   
   ```java
   Precondition.checkState(pageStore instance of BloomFilterWriteStore, "pageStore must be an instance of BloomFilterWriteStore");
   this.writeStore = props.newColumnWriteStore(parquetSchema, pageStore,(BloomFilterWriteStore) pageStore);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi edited a comment on pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi edited a comment on pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#issuecomment-958686405


   Sorry for the late reply, I was busy with something else. I've rebased the code and please help review it. I have time to complete the feature now. And thanks for reminding me! @huaxingao @ConeyLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jshmchenxi commented on a change in pull request #2642: Core: Support writing parquet bloom filter

Posted by GitBox <gi...@apache.org>.
jshmchenxi commented on a change in pull request #2642:
URL: https://github.com/apache/iceberg/pull/2642#discussion_r753761339



##########
File path: site/docs/configuration.md
##########
@@ -40,6 +40,9 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec                          |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled | false | Whether to enable writing bloom filter; If it is true, the bloom filter will be enable for all columns; If it is false, it will be disabled for all columns; It is also possible to enable it for some columns by specifying the column name within the property followed by #; For example, setting both `write.parquet.bloom-filter-enabled=true` and `write.parquet.bloom-filter-enabled#some_column=false` will enable bloom filter for all columns except `some_column` |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |
+| write.parquet.bloom-filter-expected-ndv | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that if this property is not set, the bloom filter will use the maximum size; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property; For example, setting `write.parquet.bloom-filter-expected-ndv#some_column=200` will enable bloom filter for `some_column` with expected number of distinct values equals to 200 |

Review comment:
       Got it!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org