You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/20 20:36:39 UTC

[GitHub] [iceberg] sudssf opened a new pull request #1221: ISSUE-1220: add option to disable manifest reading during estimateSta…

sudssf opened a new pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221


   …tistics call
   
   https://github.com/apache/iceberg/issues/1220
   this commits add support for table property "read.spark.disable-estimate-statistics"
   which will return default value of table statistics and avoid reading manifest and data files
   to get table size and row count.
   this feature is useful for spark2 as spark2 does not perform predicate pushdown before calling
   estimateStatistics which ends up scanning entire table.
   this feature may be useful for spark3 if table manifest list is large even after predicate pushdown.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458238324



##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                                                                               |
-| --------------- | --------------------- | ----------------------------------------------------------------------------------------- |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description                                                                               |
+| -------------------------- | --------------------- | ----------------------------------------------------------------------------------------- |
+| snapshot-id                | (latest)              | Snapshot ID of the table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       I don't think we need a table option for this. If we were going to return incorrect stats, then I would want a flag to enable or disable it. But because we are going to use table-level stats, we can detect when to do it based on whether or not there are filters. No filter, then use table level stats.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458241058



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+
+  public static long approximateTableSize(Table table, long totalRecords) {
+    if (totalRecords == Long.MAX_VALUE) {
+      return totalRecords;
+    }
+    StructType type = SparkSchemaUtil.convert(table.schema());
+    long approximateSize = 0;
+    for (StructField sparkField : type.fields()) {
+      approximateSize += sparkField.dataType().defaultSize();
+    }
+    return approximateSize * totalRecords;
+  }
+
+  /**
+   * get total records from table metadata using {@link SnapshotSummary.TOTAL_RECORDS_PROP}.
+   *
+   * @param table iceberg table
+   * @return total records from table metadata
+   */
+
+  public static long totalRecordsFromMetadata(Table table) {

Review comment:
       I think it would make more sense to pass in a `Snapshot` because the caller may be reading a specific snapshot.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459166097



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -174,7 +175,7 @@ private Schema lazySchema() {
   }
 
   private Expression filterExpression() {
-    if (filterExpressions != null) {
+    if (filterExpressions != null ) {

Review comment:
       Looks like this is a typo.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458240257



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+
+  public static long approximateTableSize(Table table, long totalRecords) {

Review comment:
       I think this should pass in a `StructType` or `Schema` rather than the table. The table isn't really used.
   
   It would also make sense to pass the value of `lazyType()` from the reader because that will account for column projection.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sudssf commented on pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

sudssf commented on pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#issuecomment-665818165


   thanks for opportunity and quick turn around !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sudssf commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

sudssf commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458266469



##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                                                                               |
-| --------------- | --------------------- | ----------------------------------------------------------------------------------------- |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description                                                                               |
+| -------------------------- | --------------------- | ----------------------------------------------------------------------------------------- |
+| snapshot-id                | (latest)              | Snapshot ID of the table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       what if manifest size is still large after filters are applied?  user may want to disable scanning entire manifest if not trying to broadcast table




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459167500



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkSchema24.java
##########
@@ -19,5 +19,43 @@
 
 package org.apache.iceberg.spark.source;
 
+import java.io.IOException;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.Assert;
+import org.junit.Test;
+
 public class TestSparkSchema24 extends TestSparkSchema {
+
+  @Test
+  public void testReaderUtils() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();
+    HadoopTables tables = new HadoopTables(CONF);
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    tables.create(SCHEMA, spec, null, tableLocation);
+    List<SimpleRecord> expectedRecords = Lists.newArrayList(
+        new SimpleRecord(1, "a")
+    );
+    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, SimpleRecord.class);
+    originalDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(tableLocation);
+
+    Table table = tables.load(tableLocation);
+    long totalRecords = PropertyUtil.propertyAsLong(table.currentSnapshot().summary(),
+              SnapshotSummary.TOTAL_RECORDS_PROP, Long.MAX_VALUE);
+    Assert.assertEquals("totalRecords match", 1, totalRecords);
+    long tableSize = ReaderUtils.approximateTableSize(SparkSchemaUtil.convert(table.schema()), totalRecords);
+    Assert.assertEquals("table size matches with expected approximation", 24, tableSize);

Review comment:
       I don't think the method being tested warrants a test that creates a table, writes data, etc. The method has 2 cases:
   
   * `numRows` is `Long.MAX_VALUE` -> return `Long.MAX_VALUE`
   * `numRows` is not -> multiply `numRows` by `sizeEstimate(dataType)`
   
   All you need is 2 test cases: one for each possibility for `numRows`. The second case should make sure the estimate is sane.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: ISSUE-1220: add option to disable manifest reading during estimateSta…

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r457711539



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -276,6 +280,9 @@ public void pruneColumns(StructType newRequestedSchema) {
 
   @Override
   public Statistics estimateStatistics() {
+    if(disableEstimateStatistics) {
+      return new Stats(Long.MAX_VALUE, Long.MAX_VALUE);
+    }

Review comment:
       I just had an idea for an alternative solution to this. What about detecting that there are no filters and instead returning a value based on the [`total-records`](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L37) value in snapshot metadata?
   
   Usually, estimating stats based on the number of rows and a guess for the size of a row is much better than using the actual size anyway. So if you can get the number of rows and come up with an estimate for the size of each row based on the table schema, then you wouldn't need to disable stats at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459168106



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -276,6 +277,13 @@ public void pruneColumns(StructType newRequestedSchema) {
 
   @Override
   public Statistics estimateStatistics() {
+    if (filterExpressions == null || filterExpressions == Expressions.alwaysTrue()) {
+      LOG.debug("using table metadata to estimate table statistics");

Review comment:
       Minor: I don't see much value in having this debug statement.
   
   If you were trying to debug `estimateStatistics`, I think you'd want to have a message for every call, and you would want it to show the expression as well as the resulting estimate. I think Spark already has those logs, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sudssf commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

sudssf commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458455773



##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                                                                               |
-| --------------- | --------------------- | ----------------------------------------------------------------------------------------- |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description                                                                               |
+| -------------------------- | --------------------- | ----------------------------------------------------------------------------------------- |
+| snapshot-id                | (latest)              | Snapshot ID of the table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       we can discuss this later , I will cleanup PR to remove parameter and always look at filter to decide `estimateStatistics` behaviour. thanks 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sudssf commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

sudssf commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458455594



##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                                                                               |
-| --------------- | --------------------- | ----------------------------------------------------------------------------------------- |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description                                                                               |
+| -------------------------- | --------------------- | ----------------------------------------------------------------------------------------- |
+| snapshot-id                | (latest)              | Snapshot ID of the table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       thanks for reply. to clarify I usecase is while using spark3 we still want to disable estimating statistics by reading manifest ( assuming client code is reading large portion of the table) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458238571



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+

Review comment:
       Nit: no need for a newline here.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**

Review comment:
       Nit: newlines should be added between methods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459169312



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {}
+
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table        iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+  public static long approximateTableSize(StructType tableSchema, long totalRecords) {

Review comment:
       Could you move this method to `SparkSchemaUtil.estimateSize`? I think it makes sense to put it in the schema utility methods, since it relates to a Spark schema. Looks like we don't have a test suite for that, so you could start `TestSparkSchemaUtil` with the tests for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#issuecomment-665235680


   Merged. Thanks for fixing this, @sudssf!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459166483



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkSchema.java
##########
@@ -46,12 +46,12 @@
 
 public abstract class TestSparkSchema {
 
-  private static final Configuration CONF = new Configuration();
-  private static final Schema SCHEMA = new Schema(
+  protected static final Configuration CONF = new Configuration();
+  protected static final Schema SCHEMA = new Schema(
       optional(1, "id", Types.IntegerType.get()),
       optional(2, "data", Types.StringType.get())
   );
-  private static SparkSession spark = null;
+  protected static SparkSession spark = null;

Review comment:
       This doesn't need to change since the tests should be in the `spark` module instead of in a subclass of `TestSparkSchema`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458239096



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+
+  public static long approximateTableSize(Table table, long totalRecords) {
+    if (totalRecords == Long.MAX_VALUE) {
+      return totalRecords;
+    }
+    StructType type = SparkSchemaUtil.convert(table.schema());

Review comment:
       Please add newlines after control flow, like `if`/`else` or loops.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458238870



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+

Review comment:
       And no need for a newline between Javadoc and the documented method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458241058



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+
+  public static long approximateTableSize(Table table, long totalRecords) {
+    if (totalRecords == Long.MAX_VALUE) {
+      return totalRecords;
+    }
+    StructType type = SparkSchemaUtil.convert(table.schema());
+    long approximateSize = 0;
+    for (StructField sparkField : type.fields()) {
+      approximateSize += sparkField.dataType().defaultSize();
+    }
+    return approximateSize * totalRecords;
+  }
+
+  /**
+   * get total records from table metadata using {@link SnapshotSummary.TOTAL_RECORDS_PROP}.
+   *
+   * @param table iceberg table
+   * @return total records from table metadata
+   */
+
+  public static long totalRecordsFromMetadata(Table table) {

Review comment:
       I think it would be easier to use `PropertyUtil.propertyAsLong` instead of adding this method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458240257



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/ReaderUtils.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * common reader code between spark2 and spark3 module
+ */
+public class ReaderUtils {
+  private ReaderUtils() {
+
+  }
+  /**
+   * guess table size based on table schema and total records.
+   *
+   * @param table iceberg table
+   * @param totalRecords total records in the table
+   * @return approxiate size based on table schema
+   */
+
+  public static long approximateTableSize(Table table, long totalRecords) {

Review comment:
       I think this should pass in a `StructType` or `Schema` rather than the table. The table isn't really used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458276824



##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                                                                               |
-| --------------- | --------------------- | ----------------------------------------------------------------------------------------- |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description                                                                               |
+| -------------------------- | --------------------- | ----------------------------------------------------------------------------------------- |
+| snapshot-id                | (latest)              | Snapshot ID of the table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       I'm not sure I understand the case you're talking about. What I'm saying is that because we are using reliable stats that are maintained in the table metadata, we can always use them if there are no filters. That takes care of the bad case in Spark 2.4. Since Spark 2.4 doesn't ever push filters before calling `estimateStatistics`, it will always use metadata stats and will avoid the issue entirely.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459166317



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkSchema24.java
##########
@@ -19,5 +19,43 @@
 
 package org.apache.iceberg.spark.source;
 
+import java.io.IOException;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.Assert;
+import org.junit.Test;
+
 public class TestSparkSchema24 extends TestSparkSchema {
+
+  @Test
+  public void testReaderUtils() throws IOException {

Review comment:
       I don't think this test needs to be here. Tests for utility classes should be in their own files, and preferably in the same module as the utility class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sudssf commented on a change in pull request #1221: ISSUE-1220: add option to disable manifest reading during estimateSta…

Posted by GitBox <gi...@apache.org>.

sudssf commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r457756127



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -276,6 +280,9 @@ public void pruneColumns(StructType newRequestedSchema) {
 
   @Override
   public Statistics estimateStatistics() {
+    if(disableEstimateStatistics) {
+      return new Stats(Long.MAX_VALUE, Long.MAX_VALUE);
+    }

Review comment:
       thats really good idea, I thought about this too but I was under impression that we had to look data file metadata to get this information.
   I try to update PR using  total-records




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org