You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/12 20:59:29 UTC

[GitHub] [iceberg] yyanyy opened a new pull request #2464: [Core] exclude NaN from upper/lower bound of floating columns in Parquet/ORC

yyanyy opened a new pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464


   - This PR implements the behavior described in #348, to exclude NaN from upper/lower bound of double/float type columns in parquet/orc. 
   - Support for Avro metrics can be found in #1963, and has some overlap with this PR (especially on `FieldMetrics` related classes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r658338522



##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -235,24 +235,30 @@ private FloatWriter(int id) {
     @Override
     public void nonNullWrite(int rowId, Float data, ColumnVector output) {
       ((DoubleColumnVector) output).vector[rowId] = data;
-      if (Float.isNaN(data)) {
-        nanCount++;
-      }
+      floatFieldMetricsBuilder.addValue(data);
+    }
+
+    @Override
+    public void nullWrite() {
+      nullValueCount++;

Review comment:
       It seems like it may be a good idea to move null tracking into the builder, so you could just run `metricsBuilder.addValue(null)` or `metricsBuilder.addNullValue()`. Then you wouldn't need to rebuild the metrics later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#issuecomment-868013996


   I made a few comments, but I don't think there are any issues with this that need to be fixed so I merged it. Thanks @yyanyy!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655732423



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {

Review comment:
       Can `metrics` expose a better way to detect this than `null`? What about `hasBounds()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r657529477



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -99,6 +105,9 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
     MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
     Schema fileSchema = ParquetSchemaUtil.convertAndPrune(parquetTypeWithIds);
 
+    Map<Integer, FieldMetrics> fieldMetricsMap = fieldMetrics.collect(

Review comment:
       I will modify the ones for Parquet and ORC, and leave the Avro related ones as is, since we are going to modify them in the Avro metrics support PR anyway. Let me know if it works! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655731203



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  @Override
+  public long nullValueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  public static class Builder {
+    private final int id;
+    private long nanValueCount = 0;
+    private Double lowerBound = null;
+    private Double upperBound = null;
+
+    public Builder(int id) {
+      this.id = id;
+    }
+
+    public void addValue(double value) {
+      if (Double.isNaN(value)) {
+        this.nanValueCount++;
+      } else {
+        if (lowerBound == null || Double.compare(value, lowerBound) < 0) {

Review comment:
       I mentioned this on the Avro PR, but I think it would be better to make `lowerBound` and `upperBound` primitives and get rid of the null checks here. Then you can use `valueCount > 0` to determine whether to return the bounds or null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655729747



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");

Review comment:
       We typically prefer `UnsupportedOperationException` for cases like this rather than `IllegalStateException`, which is described in javadoc as an exception that "Signals that a method has been invoked at an illegal or inappropriate time". While you could argue that any time is an "inappropriate time", I think unsupported describes the situation better.

##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");

Review comment:
       Thinking about this a bit more, would it be a bad thing to collect the stats so that future uses don't need to go back and modify this class? Seems like there isn't much benefit to skipping the implementation for just these two methods.

##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  @Override
+  public long nullValueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  public static class Builder {
+    private final int id;
+    private long nanValueCount = 0;
+    private Double lowerBound = null;
+    private Double upperBound = null;
+
+    public Builder(int id) {
+      this.id = id;
+    }
+
+    public void addValue(double value) {
+      if (Double.isNaN(value)) {
+        this.nanValueCount++;
+      } else {
+        if (lowerBound == null || Double.compare(value, lowerBound) < 0) {

Review comment:
       I mentioned this on the Avro PR, but I think it would be better to make `lowerBound` and `upperBound` primitives and get rid of the null checks here. Then you can use `valueCount > 0` to determine whether to return the bounds or null.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -99,6 +105,9 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
     MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
     Schema fileSchema = ParquetSchemaUtil.convertAndPrune(parquetTypeWithIds);
 
+    Map<Integer, FieldMetrics> fieldMetricsMap = fieldMetrics.collect(

Review comment:
       `FieldMetrics` is parameterized, right? I think this should be `FieldMetrics<?>`.

##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  @Override
+  public long nullValueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  public static class Builder {
+    private final int id;
+    private long nanValueCount = 0;
+    private Double lowerBound = null;
+    private Double upperBound = null;
+
+    public Builder(int id) {
+      this.id = id;
+    }
+
+    public void addValue(double value) {
+      if (Double.isNaN(value)) {
+        this.nanValueCount++;
+      } else {
+        if (lowerBound == null || Double.compare(value, lowerBound) < 0) {

Review comment:
       I mentioned this on the Avro PR, but I think it would be better to make `lowerBound` and `upperBound` primitives and get rid of the null checks here. Then you can use `valueCount - nullValueCount > 0` to determine whether to return the bounds or null.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {
+          // upper and lower bounds will both null or neither
+          lowerBounds.remove(fieldId);
+          upperBounds.remove(fieldId);
+        } else if (metrics.upperBound() instanceof Float) {

Review comment:
       I think this should be `metrics instanceof FloatFieldMetrics` instead of checking the bound.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {

Review comment:
       Can `metrics` expose a better way to detect this than `null`? What about `hasBounds()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r657529350



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");

Review comment:
       We had to spend some extra effort on nullable writers to make them update null count and value count in order to get the right results for the unimplemented methods, but that's not complicated and will update. 

##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  @Override
+  public long nullValueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  public static class Builder {
+    private final int id;
+    private long nanValueCount = 0;
+    private Double lowerBound = null;
+    private Double upperBound = null;
+
+    public Builder(int id) {
+      this.id = id;
+    }
+
+    public void addValue(double value) {
+      if (Double.isNaN(value)) {
+        this.nanValueCount++;
+      } else {
+        if (lowerBound == null || Double.compare(value, lowerBound) < 0) {

Review comment:
       Sorry haven't checked the avro PR yet so didn't see that comment... writers creating this already has null values filtered out so null value count is always 0, but the idea could still work. Will update!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655732240



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {
+          // upper and lower bounds will both null or neither
+          lowerBounds.remove(fieldId);
+          upperBounds.remove(fieldId);
+        } else if (metrics.upperBound() instanceof Float) {

Review comment:
       I think this should be `metrics instanceof FloatFieldMetrics` instead of checking the bound.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r657529613



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {
+          // upper and lower bounds will both null or neither
+          lowerBounds.remove(fieldId);
+          upperBounds.remove(fieldId);
+        } else if (metrics.upperBound() instanceof Float) {

Review comment:
       Actually due to having to modify the option writer to return correct value count/null value count, this can go different directions:
   - we either let the option writer to explicitly check for float/double writer and create the corresponding field metric object type (also have to make float/double field metrics' constructor public)
   - or we create the generic field metric type in option writer, and continue to check instanceof on bounds here
   
   For now I decided latter for better flexibility, but don't have strong opinion either way. Please let me know your thoughts!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655730502



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");

Review comment:
       Thinking about this a bit more, would it be a bad thing to collect the stats so that future uses don't need to go back and modify this class? Seems like there isn't much benefit to skipping the implementation for just these two methods.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655731608



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -99,6 +105,9 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
     MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
     Schema fileSchema = ParquetSchemaUtil.convertAndPrune(parquetTypeWithIds);
 
+    Map<Integer, FieldMetrics> fieldMetricsMap = fieldMetrics.collect(

Review comment:
       `FieldMetrics` is parameterized, right? I think this should be `FieldMetrics<?>`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2464: [Core] exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r611953265



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -85,6 +89,7 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
     return footerMetrics(metadata, fieldMetrics, metricsConfig, null);
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       Decided to not refactor this method (at least not in this PR) since in this method we are essentially updating all column stats, and by moving logic directly out of the method we need to pass in those 6 stats map with some extra info, which could easily make the helper method exceeding 10 arguments. 

##########
File path: data/src/test/java/org/apache/iceberg/TestMergingMetrics.java
##########
@@ -105,74 +117,152 @@ public TestMergingMetrics(FileFormat fileFormat) {
   @Test
   public void verifySingleRecordMetric() throws Exception {
     Record record = GenericRecord.create(SCHEMA);
-    record.setField("id", 3);
-    record.setField("float", Float.NaN); // FLOAT_FIELD - 1
-    record.setField("double", Double.NaN); // DOUBLE_FIELD - 1
-    record.setField("floatlist", ImmutableList.of(3.3F, 2.8F, Float.NaN, -25.1F, Float.NaN)); // FLOAT_LIST - 2
-    record.setField("map1", ImmutableMap.of(Float.NaN, "a", 0F, "b")); // MAP_FIELD_1 - 1
-    record.setField("map2", ImmutableMap.of(
+    record.setField(ID_FIELD.name(), 3);
+    record.setField(FLOAT_FIELD.name(), Float.NaN); // FLOAT_FIELD - 1
+    record.setField(DOUBLE_FIELD.name(), Double.NaN); // DOUBLE_FIELD - 1
+    record.setField(FLOAT_LIST.name(), ImmutableList.of(3.3F, 2.8F, Float.NaN, -25.1F, Float.NaN)); // FLOAT_LIST - 2
+    record.setField(MAP_FIELD_1.name(), ImmutableMap.of(Float.NaN, "a", 0F, "b")); // MAP_FIELD_1 - 1
+    record.setField(MAP_FIELD_2.name(), ImmutableMap.of(
         0, 0D, 1, Double.NaN, 2, 2D, 3, Double.NaN, 4, Double.NaN)); // MAP_FIELD_2 - 3
 
     FileAppender<T> appender = writeAndGetAppender(ImmutableList.of(record));
-    Map<Integer, Long> nanValueCount = appender.metrics().nanValueCounts();
+    Metrics metrics = appender.metrics();
+    Map<Integer, Long> nanValueCount = metrics.nanValueCounts();
+    Map<Integer, ByteBuffer> upperBounds = metrics.upperBounds();
+    Map<Integer, ByteBuffer> lowerBounds = metrics.lowerBounds();
 
     assertNaNCountMatch(1L, nanValueCount, FLOAT_FIELD);
     assertNaNCountMatch(1L, nanValueCount, DOUBLE_FIELD);
     assertNaNCountMatch(2L, nanValueCount, FLOAT_LIST);
     assertNaNCountMatch(1L, nanValueCount, MAP_FIELD_1);
     assertNaNCountMatch(3L, nanValueCount, MAP_FIELD_2);
-  }
 
-  private void assertNaNCountMatch(Long expected, Map<Integer, Long> nanValueCount, Types.NestedField field) {
-    Assert.assertEquals(
-        String.format("NaN count for field %s does not match expected", field.name()),
-        expected, nanValueCount.get(FIELDS_WITH_NAN_COUNT_TO_ID.get(field)));
+    assertBoundValueMatch(null, upperBounds, FLOAT_FIELD);
+    assertBoundValueMatch(null, upperBounds, DOUBLE_FIELD);
+    assertBoundValueMatch(3.3F, upperBounds, FLOAT_LIST);
+    assertBoundValueMatch(0F, upperBounds, MAP_FIELD_1);
+    assertBoundValueMatch(2D, upperBounds, MAP_FIELD_2);
+
+    assertBoundValueMatch(null, lowerBounds, FLOAT_FIELD);
+    assertBoundValueMatch(null, lowerBounds, DOUBLE_FIELD);
+    assertBoundValueMatch(-25.1F, lowerBounds, FLOAT_LIST);
+    assertBoundValueMatch(0F, lowerBounds, MAP_FIELD_1);
+    assertBoundValueMatch(0D, lowerBounds, MAP_FIELD_2);
   }
 
   @Test
   public void verifyRandomlyGeneratedRecordsMetric() throws Exception {
-    List<Record> recordList = RandomGenericData.generate(SCHEMA, 50, 250L);
-
+    // too big of the record count will more likely to make all upper/lower bounds +/-infinity,
+    // which makes the tests easier to pass
+    List<Record> recordList = RandomGenericData.generate(SCHEMA, 5, 250L);
     FileAppender<T> appender = writeAndGetAppender(recordList);
-    Map<Integer, Long> nanValueCount = appender.metrics().nanValueCounts();
 
-    FIELDS_WITH_NAN_COUNT_TO_ID.forEach((key, value) -> Assert.assertEquals(
-        String.format("NaN count for field %s does not match expected", key.name()),
-        getExpectedNaNCount(recordList, key),
-        nanValueCount.get(value)));
+    Map<Types.NestedField, AtomicReference<Number>> expectedUpperBounds = new HashMap<>();
+    Map<Types.NestedField, AtomicReference<Number>> expectedLowerBounds = new HashMap<>();
+    Map<Types.NestedField, AtomicLong> expectedNaNCount = new HashMap<>();
+
+    populateExpectedValues(recordList, expectedUpperBounds, expectedLowerBounds, expectedNaNCount);
+
+    Metrics metrics = appender.metrics();
+    expectedUpperBounds.forEach((key, value) -> assertBoundValueMatch(value.get(), metrics.upperBounds(), key));
+    expectedLowerBounds.forEach((key, value) -> assertBoundValueMatch(value.get(), metrics.lowerBounds(), key));
+    expectedNaNCount.forEach((key, value) -> assertNaNCountMatch(value.get(), metrics.nanValueCounts(), key));
 
     SCHEMA.columns().stream()
         .filter(column -> !FIELDS_WITH_NAN_COUNT_TO_ID.containsKey(column))
         .map(Types.NestedField::fieldId)
-        .forEach(id -> Assert.assertNull("NaN count for field %s should be null", nanValueCount.get(id)));
+        .forEach(id -> Assert.assertNull("NaN count for field %s should be null",
+            metrics.nanValueCounts().get(id)));
+  }
+
+  private void assertNaNCountMatch(Long expected, Map<Integer, Long> nanValueCount, Types.NestedField field) {
+    Assert.assertEquals(
+        String.format("NaN count for field %s does not match expected", field.name()),
+        expected, nanValueCount.get(FIELDS_WITH_NAN_COUNT_TO_ID.get(field)));
+  }
+
+  private void assertBoundValueMatch(Number expected, Map<Integer, ByteBuffer> boundMap, Types.NestedField field) {
+    if (field.type().isNestedType() && fileFormat == FileFormat.ORC) {
+      // we don't update floating column bounds values within ORC nested columns

Review comment:
       We actually could make it work for Orc but I'm not sure if it's worth it given metrics in nested types are not recorded/used normally anyway. The reason for the different behavior in Parquet/Orc is the following:
   
   Parquet tracks column stats per block, and when computing metrics we need to update min/max when iterating through all blocks; to avoid doing so for floating point columns that we track min/max ourselves now, we update min/max for these columns after all blocks have been processed. 
   
   However Orc tracks stats per column so we could directly inject the new bound value when we evaluates every column stats. And since Orc doesn't track nested column stats, we never look at columns within nested fields. 
   

##########
File path: core/src/main/java/org/apache/iceberg/FieldMetrics.java
##########
@@ -20,25 +20,23 @@
 package org.apache.iceberg;
 
 
-import java.nio.ByteBuffer;
-
 /**
  * Iceberg internally tracked field level metrics.
  */
-public class FieldMetrics {
+public class FieldMetrics<T> {

Review comment:
       Change in this class overlaps with what #1963 contains. For now this class is not directly used/created by Parquet/ORC, and will be for Avro it will, thus should be safe to change. Same goes for `FloatFieldMetrics`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r658253026



##########
File path: core/src/main/java/org/apache/iceberg/FieldMetrics.java
##########
@@ -78,14 +76,21 @@ public long nanValueCount() {
   /**
    * Returns the lower bound value of this field.
    */
-  public ByteBuffer lowerBound() {
+  public T lowerBound() {
     return lowerBound;
   }
 
   /**
    * Returns the upper bound value of this field.
    */
-  public ByteBuffer upperBound() {
+  public T upperBound() {
     return upperBound;
   }
+
+  /**
+   * Returns if the metrics has bounds (i.e. there is at least non-null value for this field)
+   */
+  public boolean hasBounds() {
+    return upperBound != null;

Review comment:
       Oh, nevermind, I see that `build` handles that case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655729747



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");

Review comment:
       We typically prefer `UnsupportedOperationException` for cases like this rather than `IllegalStateException`, which is described in javadoc as an exception that "Signals that a method has been invoked at an illegal or inappropriate time". While you could argue that any time is an "inappropriate time", I think unsupported describes the situation better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r655731203



##########
File path: core/src/main/java/org/apache/iceberg/DoubleFieldMetrics.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
+ * <p>
+ * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
+ * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
+ * exceptions when they are accessed.
+ */
+public class DoubleFieldMetrics extends FieldMetrics<Double> {
+
+  private DoubleFieldMetrics(int id, long nanValueCount, Double lowerBound, Double upperBound) {
+    super(id, 0L, 0L, nanValueCount, lowerBound, upperBound);
+  }
+
+  @Override
+  public long valueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  @Override
+  public long nullValueCount() {
+    throw new IllegalStateException("Shouldn't access this method, as this metric is tracked in file statistics. ");
+  }
+
+  public static class Builder {
+    private final int id;
+    private long nanValueCount = 0;
+    private Double lowerBound = null;
+    private Double upperBound = null;
+
+    public Builder(int id) {
+      this.id = id;
+    }
+
+    public void addValue(double value) {
+      if (Double.isNaN(value)) {
+        this.nanValueCount++;
+      } else {
+        if (lowerBound == null || Double.compare(value, lowerBound) < 0) {

Review comment:
       I mentioned this on the Avro PR, but I think it would be better to make `lowerBound` and `upperBound` primitives and get rid of the null checks here. Then you can use `valueCount - nullValueCount > 0` to determine whether to return the bounds or null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r658339824



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -147,9 +158,39 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
       upperBounds.remove(fieldId);
     }
 
+    updateFromFieldMetrics(fieldMetricsMap, metricsConfig, fileSchema, lowerBounds, upperBounds);
+
     return new Metrics(rowCount, columnSizes, valueCounts, nullValueCounts,
-        MetricsUtil.createNanValueCounts(fieldMetrics, metricsConfig, fileSchema),
-        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
+        MetricsUtil.createNanValueCounts(fieldMetricsMap.values().stream(), metricsConfig, fileSchema),
+        toBufferMap(fileSchema, lowerBounds),
+        toBufferMap(fileSchema, upperBounds));
+  }
+
+  private static void updateFromFieldMetrics(
+      Map<Integer, FieldMetrics> idToFieldMetricsMap, MetricsConfig metricsConfig, Schema schema,
+      Map<Integer, Literal<?>> lowerBounds, Map<Integer, Literal<?>> upperBounds) {
+    idToFieldMetricsMap.entrySet().forEach(entry -> {
+      int fieldId = entry.getKey();
+      FieldMetrics metrics = entry.getValue();
+      MetricsMode metricsMode = MetricsUtil.metricsMode(schema, metricsConfig, fieldId);
+
+      // only check for MetricsModes.None, since we don't truncate float/double values.
+      if (metricsMode != MetricsModes.None.get()) {
+        if (metrics.upperBound() == null) {
+          // upper and lower bounds will both null or neither
+          lowerBounds.remove(fieldId);
+          upperBounds.remove(fieldId);
+        } else if (metrics.upperBound() instanceof Float) {

Review comment:
       Oh, I see. Good point. Let's go with this for now. In the future, we could keep the value class in the metrics, so you could do something like this:
   
   ```java
     if (metrics.javaType() instanceof Float) { ... }
   ```
   
   But that's probably not worth it at this point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r658252714



##########
File path: core/src/main/java/org/apache/iceberg/FieldMetrics.java
##########
@@ -78,14 +76,21 @@ public long nanValueCount() {
   /**
    * Returns the lower bound value of this field.
    */
-  public ByteBuffer lowerBound() {
+  public T lowerBound() {
     return lowerBound;
   }
 
   /**
    * Returns the upper bound value of this field.
    */
-  public ByteBuffer upperBound() {
+  public T upperBound() {
     return upperBound;
   }
+
+  /**
+   * Returns if the metrics has bounds (i.e. there is at least non-null value for this field)
+   */
+  public boolean hasBounds() {
+    return upperBound != null;

Review comment:
       I think this should be `valueCount - nullValueCount - nanValueCount > 0`. Otherwise, using `DoubleFieldMetrics.Builder` will always produce bounds, even though it could be only NaN values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2464: Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2464:
URL: https://github.com/apache/iceberg/pull/2464#discussion_r654724611



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -99,6 +104,10 @@ public static Metrics footerMetrics(ParquetMetadata metadata, Stream<FieldMetric
     MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
     Schema fileSchema = ParquetSchemaUtil.convertAndPrune(parquetTypeWithIds);
 
+    Map<Integer, FieldMetrics> fieldMetricsMap = Optional.ofNullable(fieldMetrics)
+        .map(stream -> stream.collect(Collectors.toMap(FieldMetrics::id, Function.identity())))
+        .orElseGet(HashMap::new);

Review comment:
       I was trying to be defensive in handling null stream here which is why I had this Optional wrapper and `orElse`, but you are right the current implementation wouldn't produce a null stream. I'll add a precondition check for null stream and then directly use the stream to create the map. 

##########
File path: core/src/main/java/org/apache/iceberg/FloatFieldMetrics.java
##########
@@ -19,25 +19,17 @@
 
 package org.apache.iceberg;
 
-import java.nio.ByteBuffer;
-
 /**
  * Iceberg internally tracked field level metrics, used by Parquet and ORC writers only.
  * <p>
  * Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers.
  * This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing
  * exceptions when they are accessed.
  */
-public class FloatFieldMetrics extends FieldMetrics {
-
-  /**
-   * Constructor for creating a FieldMetrics with only NaN counter.
-   * @param id field id being tracked by the writer
-   * @param nanValueCount number of NaN values, will only be non-0 for double or float field.
-   */
-  public FloatFieldMetrics(int id,
-                           long nanValueCount) {
-    super(id, 0L, 0L, nanValueCount, null, null);
+public class FloatFieldMetrics extends FieldMetrics<Number> {

Review comment:
       Sounds good! I was hesitant to do that since I think there are too many duplicated code, I guess I was trying too hard to eliminate duplications... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org