You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/02 02:55:27 UTC

[GitHub] [iceberg] huaxingao opened a new pull request, #4938: Support reading Parquet row group bloom filter

huaxingao opened a new pull request, #4938:
URL: https://github.com/apache/iceberg/pull/4938

   This is the read path of parquet row group bloom filter. The original PR is [here](https://github.com/apache/iceberg/pull/4831/)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r890267725


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -91,12 +91,21 @@ static Expression bind(StructType struct,
   }
 
   public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
+    return references(struct, exprs, caseSensitive, false);
+  }
+
+  public static Set<Integer> references(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {

Review Comment:
   Not all expressions are bound or unbound. What about just trying to bind the expression and catching the exception that is thrown if it's already bound? Then you can just move on and return the references.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r891321693


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   Should I set the Parquet properties to `encodingPropsBuilder`? If the same properties is also set by iceberg properties, I will reset to overwrite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#issuecomment-1145164701

   cc @rdblue @RussellSpitzer @kbendick @chenjunjiedada @hililiwei 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r890363938


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   @rdblue Thanks for your quick reply!
   
   Seems the properties need to be passed to `InternalParquetRecordWriter` through this [encodingProps](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L305). We need to call the withXXX method explicitly e.g. [withDictionaryPageSize](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L294) to set the property to [encodingPropsBuilder](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L483)
   
   So it seems to me that we have to call this [withBloomFilterEnabled](https://github.com/apache/iceberg/blob/1fabb2db59d6db33609f549024c4371873ad3b10/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L312) explicitly to set the bloom filter property to`encodingPropsBuilder`.  Otherwise, Parquet's `InternalParquetRecordWriter` won't be able to take it.



##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -91,12 +91,21 @@ static Expression bind(StructType struct,
   }
 
   public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
+    return references(struct, exprs, caseSensitive, false);
+  }
+
+  public static Set<Integer> references(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {

Review Comment:
   Sounds good to me!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#issuecomment-1153424648

   Thank you for making support for Parquet bloom filters happen @huaxingao! Your work is highly appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r893698144


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   I made some minimal write-side changes to get the test in. Hope this is OK.



##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.Set;
+import java.util.UUID;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Binder;
+import org.apache.iceberg.expressions.BoundReference;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionVisitors;
+import org.apache.iceberg.expressions.ExpressionVisitors.BoundExpressionVisitor;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.types.Types.StructType;
+import org.apache.iceberg.util.DecimalUtil;
+import org.apache.iceberg.util.UUIDUtil;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.hadoop.BloomFilterReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.DecimalMetadata;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+
+public class ParquetBloomRowGroupFilter {
+  private final Schema schema;
+  private final Expression expr;
+  private final boolean caseSensitive;
+
+  public ParquetBloomRowGroupFilter(Schema schema, Expression unbound) {
+    this(schema, unbound, true);
+  }
+
+  public ParquetBloomRowGroupFilter(Schema schema, Expression unbound, boolean caseSensitive) {
+    this.schema = schema;
+    StructType struct = schema.asStruct();
+    this.expr = Binder.bind(struct, Expressions.rewriteNot(unbound), caseSensitive);
+    this.caseSensitive = caseSensitive;
+  }
+
+  /**
+   * Tests whether the bloom for a row group may contain records that match the expression.
+   *
+   * @param fileSchema  schema for the Parquet file
+   * @param bloomReader a bloom filter reader
+   * @return false if the file cannot contain rows that match the expression, true otherwise.
+   */
+  public boolean shouldRead(MessageType fileSchema, BlockMetaData rowGroup,
+      BloomFilterReader bloomReader) {

Review Comment:
   Thanks for the reminder. I will pay attention to the style next time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r892646688


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   I see. Is it possible to get some tests working without the write-side changes? Maybe write a Parquet file directly and use name mapping? If not then let's try to make the minimal write-side changes to get the test in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4938:
URL: https://github.com/apache/iceberg/pull/4938


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r889026929


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.Set;
+import java.util.UUID;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Binder;
+import org.apache.iceberg.expressions.BoundReference;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionVisitors;
+import org.apache.iceberg.expressions.ExpressionVisitors.BoundExpressionVisitor;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.types.Types.StructType;
+import org.apache.iceberg.util.DecimalUtil;
+import org.apache.iceberg.util.UUIDUtil;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.hadoop.BloomFilterReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.DecimalMetadata;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+
+public class ParquetBloomRowGroupFilter {
+  private final Schema schema;
+  private final Expression expr;
+  private final boolean caseSensitive;
+
+  public ParquetBloomRowGroupFilter(Schema schema, Expression unbound) {
+    this(schema, unbound, true);
+  }
+
+  public ParquetBloomRowGroupFilter(Schema schema, Expression unbound, boolean caseSensitive) {
+    this.schema = schema;
+    StructType struct = schema.asStruct();
+    this.expr = Binder.bind(struct, Expressions.rewriteNot(unbound), caseSensitive);
+    this.caseSensitive = caseSensitive;
+  }
+
+  /**
+   * Tests whether the bloom for a row group may contain records that match the expression.
+   *
+   * @param fileSchema  schema for the Parquet file
+   * @param bloomReader a bloom filter reader
+   * @return false if the file cannot contain rows that match the expression, true otherwise.
+   */
+  public boolean shouldRead(MessageType fileSchema, BlockMetaData rowGroup,
+      BloomFilterReader bloomReader) {

Review Comment:
   Style nit: the starting point of all argument lines should be aligned. That can be at 2 indents from the start of the method OR after the opening `(` to start method arguments. There should not be two different indentation levels like you have here (both after `(` and 2 indents from the method definition line).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r890266766


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   I think that properties will be passed through to the Hadoop `Configuration` automatically. Is that no longer true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#issuecomment-1153423578

   Thank you very much @rdblue!  Also thank you @RussellSpitzer @kbendick @chenjunjiedada @hililiwei @stevenzwu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#issuecomment-1145161321

   I don't have a good way to test the read path, but I have tested these changes with the write path on my local.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r890210450


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   I tried to config using  `ParquetOutputFormat.BLOOM_FILTER_ENABLED` by replacing this [line](
   https://github.com/apache/iceberg/blob/1fabb2db59d6db33609f549024c4371873ad3b10/parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java#L201) with `.set(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#_id", "true") `, but it doesn't work.
   
   Seems Iceberg only honors the Iceberg's properties (which are set by `Context` at [here](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L233)), but it doesn't really take the properties set by Parquet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#issuecomment-1153385758

   Thanks, @huaxingao!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r889023604


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -91,12 +91,21 @@ static Expression bind(StructType struct,
   }
 
   public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
+    return references(struct, exprs, caseSensitive, false);
+  }
+
+  public static Set<Integer> references(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {

Review Comment:
   Can we detect that an expression is unbound? Maybe identify named refs and return? Then we could just use one method for everything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r895214802


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -96,7 +96,15 @@ public static Set<Integer> boundReferences(StructType struct, List<Expression> e
     }
     ReferenceVisitor visitor = new ReferenceVisitor();
     for (Expression expr : exprs) {
-      ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      try {
+        ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      } catch (IllegalStateException e) {
+        if (e.getMessage().contains("Found already bound predicate")) {

Review Comment:
   We shouldn't rely on exception messages like this.
   
   I think instead we should just add a utility to detect whether an expression is bound. Here's an implementation:
   
   ```java
     /**
      * Returns whether an expression is bound.
      * <p>
      * An expression is bound if all of its predicates are bound.
      *
      * @param expr an {@link Expression}
      * @return true if the expression is bound
      * @throws IllegalArgumentException if the expression has both bound and unbound predicates.
      */
     public static boolean isBound(Expression expr) {
       Boolean isBound = ExpressionVisitors.visit(expr, new IsBoundVisitor());
       return isBound != null ? isBound : false; // assume unbound if undetermined
     }
   
     private static class IsBoundVisitor extends ExpressionVisitors.ExpressionVisitor<Boolean> {
       @Override
       public Boolean not(Boolean result) {
         return result;
       }
   
       @Override
       public Boolean and(Boolean leftResult, Boolean rightResult) {
         return combineResults(leftResult, rightResult);
       }
   
       @Override
       public Boolean or(Boolean leftResult, Boolean rightResult) {
         return combineResults(leftResult, rightResult);
       }
   
       @Override
       public <T> Boolean predicate(BoundPredicate<T> pred) {
         return true;
       }
   
       @Override
       public <T> Boolean predicate(UnboundPredicate<T> pred) {
         return false;
       }
   
       private Boolean combineResults(Boolean isLeftBound, Boolean isRightBound) {
         if (isLeftBound != null) {
           Preconditions.checkArgument(isRightBound == null || isLeftBound.equals(isRightBound),
               "Found partially bound expression");
           return isLeftBound;
         } else {
           return isRightBound;
         }
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r895246637


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -96,7 +96,15 @@ public static Set<Integer> boundReferences(StructType struct, List<Expression> e
     }
     ReferenceVisitor visitor = new ReferenceVisitor();
     for (Expression expr : exprs) {
-      ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      try {
+        ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      } catch (IllegalStateException e) {
+        if (e.getMessage().contains("Found already bound predicate")) {

Review Comment:
   No problem! Thanks for all your work on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r889022817


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java:
##########
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;

Review Comment:
   Is it possible to include the unit test in this PR? I think all you'd need to do is configure the bloom filter settings using the Parquet settings in a Hadoop Configuration rather than through the Iceberg write settings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r890211588


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -91,12 +91,21 @@ static Expression bind(StructType struct,
   }
 
   public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
+    return references(struct, exprs, caseSensitive, false);
+  }
+
+  public static Set<Integer> references(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {

Review Comment:
   Can i just use `instanceof Unbound` to detect if an expression is unbound? Something like this:
   
   ```
     private static boolean isUnbound(Expression expr) {
       switch (expr.op()) {
         case TRUE:
           return false;
         case FALSE:
           return false;
         case NOT:
           Not not = (Not) expr;
           return isUnbound(not.child());
         case AND:
           And and = (And) expr;
           return isUnbound(and.left()) || isUnbound(and.right());
         case OR:
           Or or = (Or) expr;
           return isUnbound(or.left()) || isUnbound(or.right());
         default:
           return expr instanceof Unbound;
       }
     }
   ```
   Then the method `boundReferences` can be
   ```
     public static Set<Integer> boundReferences(StructType struct, List<Expression> exprs, boolean caseSensitive) {
       if (exprs == null) {
         return ImmutableSet.of();
       }
       ReferenceVisitor visitor = new ReferenceVisitor();
       for (Expression expr : exprs) {
         if (isUnbound(expr)) {
           ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
         } else {
           ExpressionVisitors.visit(expr, visitor);
         }
       }
       return visitor.references;
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r895215105


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -291,9 +292,17 @@ public <D> FileAppender<D> build() throws IOException {
             .withWriteMode(writeMode)
             .withRowGroupSize(rowGroupSize)
             .withPageSize(pageSize)
-            .withDictionaryPageSize(dictionaryPageSize)
-            .build(),
-            metricsConfig);
+            .withDictionaryPageSize(dictionaryPageSize);
+        // Todo: The following code needs to be improved in the bloom filter write path PR.

Review Comment:
   I think this is okay for now to get this in with tests. Thanks, @huaxingao!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #4938: Support reading Parquet row group bloom filter

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #4938:
URL: https://github.com/apache/iceberg/pull/4938#discussion_r895244568


##########
api/src/main/java/org/apache/iceberg/expressions/Binder.java:
##########
@@ -96,7 +96,15 @@ public static Set<Integer> boundReferences(StructType struct, List<Expression> e
     }
     ReferenceVisitor visitor = new ReferenceVisitor();
     for (Expression expr : exprs) {
-      ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      try {
+        ExpressionVisitors.visit(bind(struct, expr, caseSensitive), visitor);
+      } catch (IllegalStateException e) {
+        if (e.getMessage().contains("Found already bound predicate")) {

Review Comment:
   Sounds great! Thank you very much for the implementation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org