You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/21 13:19:31 UTC

[GitHub] [iceberg] hililiwei opened a new pull request #3784: ORC:ORC supports rolling writers.

hililiwei opened a new pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784


   close https://github.com/apache/iceberg/issues/3169
   
   base on https://github.com/apache/iceberg/pull/3250 and https://github.com/apache/iceberg/pull/3497, I double checked the code and it looks like ORC already supports scrolling writes. If my understanding is wrong, please give some guidance, thank you.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1053852862


   I prefer to enable those following unit tests before we do any further reviewing & porting: 
   
   ```bash
   ➜  iceberg git:(master) find . -type f -name '*.java'  | xargs grep -i 'TODO.*orc'
   ./core/src/test/java/org/apache/iceberg/TestMetrics.java:      // TODO: The special condition for ORC can be removed when ORC-342 is fixed
   ./core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java:      // TODO: ORC file now not support target file size before closed
   ./core/src/main/java/org/apache/iceberg/io/ClusteredEqualityDeleteWriter.java:    // TODO: support ORC rolling writers
   ./core/src/main/java/org/apache/iceberg/io/ClusteredDataWriter.java:    // TODO: support ORC rolling writers
   ./core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java:    // TODO: support ORC rolling writers
   ./core/src/main/java/org/apache/iceberg/io/FanoutDataWriter.java:    // TODO: support ORC rolling writers
   ./flink/v1.12/flink/src/test/java/org/apache/iceberg/flink/sink/TestTaskWriters.java:    // TODO ORC don't support target file size before closed.
   ./flink/v1.12/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergStreamWriter.java:    // TODO: ORC file does not support target file size before closed.
   ./flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/sink/TestTaskWriters.java:    // TODO ORC don't support target file size before closed.
   ./flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergStreamWriter.java:    // TODO: ORC file does not support target file size before closed.
   ./flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/sink/TestTaskWriters.java:    // TODO ORC don't support target file size before closed.
   ./flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergStreamWriter.java:    // TODO: ORC file does not support target file size before closed.
   ./data/src/test/java/org/apache/iceberg/io/TestRollingFileWriters.java:  // TODO: add ORC once we support ORC rolling file writers
   ./data/src/test/java/org/apache/iceberg/data/TestLocalScan.java:    // TODO: Add multiple timestamp tests - there's an issue with ORC caching TZ in ThreadLocal, so it's not possible
   ./spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore  // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:  // TODO: multiple stripes for ORC
   ./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    // TODO: support metadata structs in vectorized ORC reads
   ./spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:      // TODO: support ORC rolling writers
   ./spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore  // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:  // TODO: multiple stripes for ORC
   ./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    // TODO: support metadata structs in vectorized ORC reads
   ./spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:      // TODO: support ORC rolling writers
   ./spark/v3.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:  @Ignore  // TODO Classpath issues prevent us from actually writing to a Spark ORC table
   ./spark/v3.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:  // TODO: multiple stripes for ORC
   ./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java:    // TODO: ORC file now not support target file size
   ./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    // TODO: support metadata structs in vectorized ORC reads
   ./spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:      // TODO: support ORC rolling writers
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833879114



##########
File path: orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
##########
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAvgWidthVisitor {
+
+  // all supported fields
+  protected static final Types.NestedField ID_FIELD = required(1, "id", Types.IntegerType.get());
+  protected static final Types.NestedField DATA_FIELD = optional(2, "data", Types.StringType.get());
+  protected static final Types.NestedField FLOAT_FIELD = required(3, "float", Types.FloatType.get());
+  protected static final Types.NestedField DOUBLE_FIELD = optional(4, "double", Types.DoubleType.get());
+  protected static final Types.NestedField DECIMAL_FIELD = optional(5, "decimal", Types.DecimalType.of(5, 3));
+  protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed", Types.FixedType.ofLength(4));
+  protected static final Types.NestedField BINARY_FIELD = optional(8, "binary", Types.BinaryType.get());
+  protected static final Types.NestedField FLOAT_LIST_FIELD = optional(9, "floatList",
+      Types.ListType.ofRequired(10, Types.FloatType.get()));
+  protected static final Types.NestedField LONG_FIELD = optional(11, "long", Types.LongType.get());
+  protected static final Types.NestedField BOOLEAN_FIELD = optional(12, "boolean", Types.BooleanType.get());
+  protected static final Types.NestedField TIMESTAMP_ZONE_FIELD = optional(13, "timestampZone",
+      Types.TimestampType.withZone());
+  protected static final Types.NestedField TIMESTAMP_FIELD = optional(14, "timestamp",
+      Types.TimestampType.withoutZone());
+  protected static final Types.NestedField DATE_FIELD = optional(15, "date", Types.DateType.get());
+  protected static final Types.NestedField UUID_FIELD = required(16, "uuid", Types.UUIDType.get());
+
+  protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+      Types.MapType.ofOptional(18, 19, Types.FloatType.get(), Types.StringType.get())
+  );
+  protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+      Types.MapType.ofOptional(21, 22, Types.IntegerType.get(), Types.DoubleType.get())
+  );
+  protected static final Types.NestedField STRUCT_FIELD = optional(23, "struct", Types.StructType.of(
+      required(24, "booleanField", Types.BooleanType.get()),
+      optional(25, "date", Types.DateType.get()),
+      optional(27, "timestamp", Types.TimestampType.withZone())
+  ));
+
+  @Test
+  public void testEstimateDataAveWidth() {

Review comment:
       Besides I will recommend to separate the large test cases into small test cases (methods), so that we can easily find the failure case if anything is incorrect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833949422



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
     this.metricsConfig = metricsConfig;
 
     TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+    this.avgRowByteSize =
+        OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAvgWidthVisitor()).stream().reduce(Integer::sum)
+            .orElse(0);

Review comment:
       I initially set it to 1, but as long as the Schema has a field, it won't be 0. Setting it to 1 might mask some exceptions. When the value is 0, we can raise an WARN in the log.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830941435



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +110,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();

Review comment:
       In iceberg, we usually don't use `this.` to access the private member because we expect to use `this` to distingush  the private member and local variable when assigning a given value to someone.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830956833



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -103,42 +104,57 @@ public TypeDescription type() {
   private ORCSchemaUtil() {
   }
 
-  public static TypeDescription convert(Schema schema) {
+  public static TypeDescription convertSchema2TypeDesc(Schema schema) {
+    return convertSchema2TypeDescWithLength(schema).first();
+  }
+
+  public static Pair<TypeDescription, Integer> convertSchema2TypeDescWithLength(Schema schema) {

Review comment:
       Besides, we also need a unit tests to address correctness of this estimate approach.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r788259559



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Deviating from the actual size is okay. While the file is open, this is an estimate only.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liubo1022126 commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

liubo1022126 commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787484665



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Because we check RollToNewFile when we process every record, so if you flush batch here, does it mean that each piece of data will be flushed during processing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787388761



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       I want to get the byte size of the `batch `here, but I haven't found the method yet.Because length does not contain the size of  batch, some test cases fail.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1075853937


   There are 3 failure cases from travis CI report: 
   
   ```
   org.apache.iceberg.flink.actions.TestRewriteDataFilesAction > testRewriteAvoidRepeateCompress[catalogName=testhive, baseNamespace=, format=ORC] FAILED
       java.lang.AssertionError: Action should add 1 data file expected:<1> but was:<2>
           at org.junit.Assert.fail(Assert.java:89)
           at org.junit.Assert.failNotEquals(Assert.java:835)
           at org.junit.Assert.assertEquals(Assert.java:647)
           at org.apache.iceberg.flink.actions.TestRewriteDataFilesAction.testRewriteAvoidRepeateCompress(TestRewriteDataFilesAction.java:367)
   
   org.apache.iceberg.flink.actions.TestRewriteDataFilesAction > testRewriteAvoidRepeateCompress[catalogName=testhadoop, baseNamespace=, format=ORC] FAILED
       java.lang.AssertionError: Action should add 1 data file expected:<1> but was:<2>
           at org.junit.Assert.fail(Assert.java:89)
           at org.junit.Assert.failNotEquals(Assert.java:835)
           at org.junit.Assert.assertEquals(Assert.java:647)
           at org.apache.iceberg.flink.actions.TestRewriteDataFilesAction.testRewriteAvoidRepeateCompress(TestRewriteDataFilesAction.java:367)
   
   org.apache.iceberg.flink.actions.TestRewriteDataFilesAction > testRewriteAvoidRepeateCompress[catalogName=testhadoop_basenamespace, baseNamespace=l0.l1, format=ORC] FAILED
       java.lang.AssertionError: Action should add 1 data file expected:<1> but was:<2>
           at org.junit.Assert.fail(Assert.java:89)
           at org.junit.Assert.failNotEquals(Assert.java:835)
           at org.junit.Assert.assertEquals(Assert.java:647)
           at org.apache.iceberg.flink.actions.TestRewriteDataFilesAction.testRewriteAvoidRepeateCompress(TestRewriteDataFilesAction.java:367)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1053853573


   Other cases: 
   
   ```
   ➜  iceberg git:(master) find . -type f -name '*.java'  | xargs grep -i 'Assume.*ORC'
   ./mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java:    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
   ./flink/v1.12/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java:    Assume.assumeFalse("ORC does not support getting length when file is opening", format.equals(FileFormat.ORC));
   ./flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java:    Assume.assumeFalse("ORC does not support getting length when file is opening", format.equals(FileFormat.ORC));
   ./flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java:    Assume.assumeFalse("ORC does not support getting length when file is opening", format.equals(FileFormat.ORC));
   ./hive3/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:    // reader will assume that the ORC file ends at the task's start + length, and might fail reading the tail..
   ./data/src/test/java/org/apache/iceberg/io/TestRollingFileWriters.java:    Assume.assumeFalse("ORC delete files are not supported", fileFormat == FileFormat.ORC);
   ./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
   ./data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java:    Assume.assumeFalse("ORC row group filter does not support StringStartsWith", format == FileFormat.ORC);
   ./spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
   ./spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
   ./spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java:    Assume.assumeFalse(fileFormat == FileFormat.ORC && vectorized);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833647509



##########
File path: build.gradle
##########
@@ -449,6 +449,7 @@ project(':iceberg-orc') {
   dependencies {
     implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
     api project(':iceberg-api')
+    implementation project(':iceberg-common')

Review comment:
       Are we sure we need to add `iceberg-common`? Looking at the definition of `iceberg-common`, it appears to be the same as line 450's `iceberg-bundled-guava`, using shadow configuration.
   
   https://github.com/apache/iceberg/blob/718ff6ae514d7a16077c21e5281e765ba5fa120b/build.gradle#L199-L203
   
   I do see it duplicated multiple times though in other places.
   
   EDIT: Nevermind. I see that it does seem to need to be duplicated because of the relocation it seems. I think declaring `iceberg-common` forces it to build the shaded / relocaqted bundled-guava artifact first. Sorry for the noise! I'm still more used to bazel and other somewhat less common tools. 😅 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833663292



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +112,28 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+
+    Preconditions.checkNotNull(treeWriter, "Can't get three writer!");

Review comment:
       Nit: Typo in `tree writer` (unneeded h).
   
   Additionally, we typically format our exception messages as `{Problem} {Cause}`. And we try to make them readable for end users. It's unlikely that casual end users know what `tree writer` is. So something more along the lines of `Cannot estimate length of file being written as the ORC writer's internal writer is not present` might be more useful.

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +112,28 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+
+    Preconditions.checkNotNull(treeWriter, "Can't get three writer!");
+
+    long estimateMemory = treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get strip's length from the file writer with path: %s.",

Review comment:
       Nit: Typo - `Stripe's` not `strip's`.

##########
File path: orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
##########
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAvgWidthVisitor {
+
+  // all supported fields
+  protected static final Types.NestedField ID_FIELD = required(1, "id", Types.IntegerType.get());
+  protected static final Types.NestedField DATA_FIELD = optional(2, "data", Types.StringType.get());
+  protected static final Types.NestedField FLOAT_FIELD = required(3, "float", Types.FloatType.get());
+  protected static final Types.NestedField DOUBLE_FIELD = optional(4, "double", Types.DoubleType.get());
+  protected static final Types.NestedField DECIMAL_FIELD = optional(5, "decimal", Types.DecimalType.of(5, 3));
+  protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed", Types.FixedType.ofLength(4));
+  protected static final Types.NestedField BINARY_FIELD = optional(8, "binary", Types.BinaryType.get());
+  protected static final Types.NestedField FLOAT_LIST_FIELD = optional(9, "floatList",
+      Types.ListType.ofRequired(10, Types.FloatType.get()));
+  protected static final Types.NestedField LONG_FIELD = optional(11, "long", Types.LongType.get());
+  protected static final Types.NestedField BOOLEAN_FIELD = optional(12, "boolean", Types.BooleanType.get());
+  protected static final Types.NestedField TIMESTAMP_ZONE_FIELD = optional(13, "timestampZone",
+      Types.TimestampType.withZone());
+  protected static final Types.NestedField TIMESTAMP_FIELD = optional(14, "timestamp",
+      Types.TimestampType.withoutZone());
+  protected static final Types.NestedField DATE_FIELD = optional(15, "date", Types.DateType.get());
+  protected static final Types.NestedField UUID_FIELD = required(16, "uuid", Types.UUIDType.get());
+
+  protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+      Types.MapType.ofOptional(18, 19, Types.FloatType.get(), Types.StringType.get())
+  );
+  protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+      Types.MapType.ofOptional(21, 22, Types.IntegerType.get(), Types.DoubleType.get())
+  );
+  protected static final Types.NestedField STRUCT_FIELD = optional(23, "struct", Types.StructType.of(
+      required(24, "booleanField", Types.BooleanType.get()),
+      optional(25, "date", Types.DateType.get()),
+      optional(27, "timestamp", Types.TimestampType.withZone())
+  ));
+
+  @Test
+  public void testEstimateDataAveWidth() {
+    // create a schema with integer field
+    Schema schemaWithInteger = new Schema(
+        ID_FIELD
+    );
+    TypeDescription orcSchemaWithInteger = ORCSchemaUtil.convert(schemaWithInteger);
+    long estimateLength = getEstimateLength(orcSchemaWithInteger);
+    Assert.assertEquals("Estimated average length of integer must be 8.", 8, estimateLength);
+
+    // create a schema with string field
+    Schema schemaWithString = new Schema(
+        DATA_FIELD
+    );
+    TypeDescription orcSchemaWithString = ORCSchemaUtil.convert(schemaWithString);
+    estimateLength = getEstimateLength(orcSchemaWithString);
+    Assert.assertEquals("Estimated average length of string must be 128.", 128, estimateLength);
+
+    // create a schema with float field
+    Schema schemaWithFloat = new Schema(
+        FLOAT_FIELD
+    );
+    TypeDescription orcSchemaWithFloat = ORCSchemaUtil.convert(schemaWithFloat);
+    estimateLength = getEstimateLength(orcSchemaWithFloat);
+    Assert.assertEquals("Estimated average length of float must be 8.", 8, estimateLength);
+
+    // create a schema with double field
+    Schema schemaWithDouble = new Schema(
+        DOUBLE_FIELD
+    );
+    TypeDescription orcSchemaWithDouble = ORCSchemaUtil.convert(schemaWithDouble);
+    estimateLength = getEstimateLength(orcSchemaWithDouble);
+    Assert.assertEquals("Estimated average length of double must be 8.", 8, estimateLength);
+
+    // create a schema with decimal field
+    Schema schemaWithDecimal = new Schema(
+        DECIMAL_FIELD
+    );
+    TypeDescription orcSchemaWithDecimal = ORCSchemaUtil.convert(schemaWithDecimal);
+    estimateLength = getEstimateLength(orcSchemaWithDecimal);
+    Assert.assertEquals("Estimated average length of decimal must be 7.", 7, estimateLength);
+
+    // create a schema with fixed field
+    Schema schemaWithFixed = new Schema(
+        FIXED_FIELD
+    );
+    TypeDescription orcSchemaWithFixed = ORCSchemaUtil.convert(schemaWithFixed);
+    estimateLength = getEstimateLength(orcSchemaWithFixed);
+    Assert.assertEquals("Estimated average length of fixed must be 128.", 128, estimateLength);
+
+    // create a schema with binary field
+    Schema schemaWithBinary = new Schema(
+        BINARY_FIELD
+    );
+    TypeDescription orcSchemaWithBinary = ORCSchemaUtil.convert(schemaWithBinary);
+    estimateLength = getEstimateLength(orcSchemaWithBinary);
+    Assert.assertEquals("Estimated average length of binary must be 128.", 128, estimateLength);
+
+    // create a schema with list field
+    Schema schemaWithList = new Schema(
+        FLOAT_LIST_FIELD
+    );
+    TypeDescription orcSchemaWithList = ORCSchemaUtil.convert(schemaWithList);
+    estimateLength = getEstimateLength(orcSchemaWithList);
+    Assert.assertEquals("Estimated average length of list must be 8.", 8, estimateLength);
+
+    // create a schema with long field
+    Schema schemaWithLong = new Schema(
+        LONG_FIELD
+    );
+    TypeDescription orcSchemaWithLong = ORCSchemaUtil.convert(schemaWithLong);
+    estimateLength = getEstimateLength(orcSchemaWithLong);
+    Assert.assertEquals("Estimated average length of long must be 8.", 8, estimateLength);
+
+    // create a schema with boolean field
+    Schema schemaWithBoolean = new Schema(
+        BOOLEAN_FIELD
+    );

Review comment:
       Nit: These can be on one line, like `Schema booSchema = new Schema(BOOLEAN_FIELD)`. The comment above each one is also unnecessary.
   
   The same applies to all of the `Schema` in this test (and possibly elsewhere). There's no need to use multiple lines here.

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
     this.metricsConfig = metricsConfig;
 
     TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+    this.avgRowByteSize =
+        OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAvgWidthVisitor()).stream().reduce(Integer::sum)
+            .orElse(0);

Review comment:
       The use of `orElse(0)` concerns me somewhat.
   
   Looking at its usage, it seems as though using `avgRowByteSize`  of 0 would mean that the entirety of `batch.size` would be unaccounted for in the estimate in the `length` function.
   
   ```java
   return (long) (dataLength + (estimateMemory + (long) batch.size * avgRowByteSize) * 0.2);
   ```
   
   
   Under what situations would we _expect_ this to reasonably return 0? Is that possible / expected in some edge case, or more indicative of a bug?
   
   Would it make sense to default to some non-zero value (even 1) so that the ongoing batch.size isn't entirely dropped?
   
   At the very least, it seems like we should potentially log a debug message stating that `0` is being used. If user's are investigating ORC files being written at sizes they find strange, having a log would be beneficial.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liubo1022126 commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

liubo1022126 commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787484336



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Because we check RollToNewFile when we process every record, so if you flush batch here, does it mean that each piece of data will be flushed during processing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1031111890


   > @hililiwei, I don't understand what #3 is. Why is this tracking the data that hasn't been submitted to the writer? It seems like all you're doing is adding a constant to the estimated size. For Parquet, we use the current file offset plus the size that is buffered in memory.
   
   #3 mainly refers to the data in the `VectorizedRowBatch`
   https://github.com/apache/iceberg/blob/2208b24617f2c64bb02f1a7aec79f3bbaabdc1cc/orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java#L81-L91
   
   The data is written to the `batch` first.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] coolderli commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

coolderli commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1031369828


   Any update about this? We found use orc can save more storage space than using parquet. So I'd like to try the Orc file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1064688444


   Thanks to @kbendick @dongjoon-hyun,, I'll try it after it's released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830942360



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +110,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from file %s", file.location()), e);

Review comment:
       Nit: Cann't get strip's length from the file writer with path: %s.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830943792



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -65,7 +71,11 @@
     this.batchSize = batchSize;
     this.metricsConfig = metricsConfig;
 
-    TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+    Pair<TypeDescription, Integer> typeDescriptionIntegerPair = ORCSchemaUtil.convertSchema2TypeDescWithLength(schema);
+
+    TypeDescription orcSchema = typeDescriptionIntegerPair.first();
+    estimateLength = typeDescriptionIntegerPair.second();

Review comment:
       Let's call it `avgRowByteSize` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r786976034



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +183,14 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter getTreeWriter() {

Review comment:
       Iceberg does not use `get` in method names because it is either not helpful or should be replaced with a more specific verb.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r786978455



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       `length` should not modify the writer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r834914845



##########
File path: core/src/main/java/org/apache/iceberg/io/FanoutDataWriter.java
##########
@@ -52,13 +51,7 @@ public FanoutDataWriter(FileWriterFactory<T> writerFactory, OutputFileFactory fi
 
   @Override
   protected FileWriter<T, DataWriteResult> newWriter(PartitionSpec spec, StructLike partition) {
-    // TODO: support ORC rolling writers
-    if (fileFormat == FileFormat.ORC) {

Review comment:
       Since these methods involve multiple spark/flink versions, I suggest that a separate PR cleans up after this.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832984029



##########
File path: flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java
##########
@@ -331,7 +332,7 @@ public void testRewriteAvoidRepeateCompress() throws IOException {
     try (FileAppender<Record> fileAppender = genericAppenderFactory.newAppender(Files.localOutput(file), format)) {
       long filesize = 20000;
       for (; fileAppender.length() < filesize; count++) {
-        Record record = SimpleDataUtil.createRecord(count, "iceberg");

Review comment:
       When I use the following formula:
   `(long) (dataLength + (estimateMemory + (long) batch.size * avgRowByteSize) * Rate);`
   If we use the "iceberg" as a test data, it is possible to pass this use case when the rate is less than or equal to 0.017. This value seems to be small.
   A large number of repeated data, resulting in only 404 bytes of the generated files, and uuid-generated files about 22,2432 bytes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833878728



##########
File path: orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
##########
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAvgWidthVisitor {
+
+  // all supported fields
+  protected static final Types.NestedField ID_FIELD = required(1, "id", Types.IntegerType.get());
+  protected static final Types.NestedField DATA_FIELD = optional(2, "data", Types.StringType.get());
+  protected static final Types.NestedField FLOAT_FIELD = required(3, "float", Types.FloatType.get());
+  protected static final Types.NestedField DOUBLE_FIELD = optional(4, "double", Types.DoubleType.get());
+  protected static final Types.NestedField DECIMAL_FIELD = optional(5, "decimal", Types.DecimalType.of(5, 3));
+  protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed", Types.FixedType.ofLength(4));
+  protected static final Types.NestedField BINARY_FIELD = optional(8, "binary", Types.BinaryType.get());
+  protected static final Types.NestedField FLOAT_LIST_FIELD = optional(9, "floatList",
+      Types.ListType.ofRequired(10, Types.FloatType.get()));
+  protected static final Types.NestedField LONG_FIELD = optional(11, "long", Types.LongType.get());
+  protected static final Types.NestedField BOOLEAN_FIELD = optional(12, "boolean", Types.BooleanType.get());
+  protected static final Types.NestedField TIMESTAMP_ZONE_FIELD = optional(13, "timestampZone",
+      Types.TimestampType.withZone());
+  protected static final Types.NestedField TIMESTAMP_FIELD = optional(14, "timestamp",
+      Types.TimestampType.withoutZone());
+  protected static final Types.NestedField DATE_FIELD = optional(15, "date", Types.DateType.get());
+  protected static final Types.NestedField UUID_FIELD = required(16, "uuid", Types.UUIDType.get());
+
+  protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+      Types.MapType.ofOptional(18, 19, Types.FloatType.get(), Types.StringType.get())
+  );
+  protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+      Types.MapType.ofOptional(21, 22, Types.IntegerType.get(), Types.DoubleType.get())
+  );
+  protected static final Types.NestedField STRUCT_FIELD = optional(23, "struct", Types.StructType.of(
+      required(24, "booleanField", Types.BooleanType.get()),
+      optional(25, "date", Types.DateType.get()),
+      optional(27, "timestamp", Types.TimestampType.withZone())
+  ));
+
+  @Test
+  public void testEstimateDataAveWidth() {

Review comment:
       Nit: `Ave` -> `Avg` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r788259659



##########
File path: build.gradle
##########
@@ -635,4 +636,3 @@ apply from: 'jmh.gradle'
 apply from: 'baseline.gradle'
 apply from: 'deploy.gradle'
 apply from: 'tasks.gradle'
-

Review comment:
       Can you remove unnecessary whitespace changes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1022891787


   > @hililiwei, can you describe how you're estimating the size of data that is buffered in memory for ORC? I think a description to explain to reviewers would help.
   
   If a file is being written, to estimate its size,  in three steps:
   1. Size of data that has been written to `stripe`.The value is obtained by summing the `offset` and `length `of the last stripe of the `writer`.
   2. Size of data that has been submitted to the `writer `but has not been written to the stripe. When creating OrcFileAppender, the `treeWriter` is obtained through reflection,  and use its `estimateMemory` to estimate how much memory is being used.
   3. Data that has not been submitted to the `writer`, that is, the size of the buffer. The maximum default value of the buffer is used here.
   
   Add these three values to estimate the data size.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819227212



##########
File path: data/src/test/java/org/apache/iceberg/io/TestRollingFileWriters.java
##########
@@ -39,15 +38,15 @@
 @RunWith(Parameterized.class)
 public abstract class TestRollingFileWriters<T> extends WriterTestBase<T> {
 
-  // TODO: add ORC once we support ORC rolling file writers
-
   @Parameterized.Parameters(name = "FileFormat={0}, Partitioned={1}")
   public static Object[] parameters() {
     return new Object[][] {
         new Object[]{FileFormat.AVRO, false},
         new Object[]{FileFormat.AVRO, true},
         new Object[]{FileFormat.PARQUET, false},
         new Object[]{FileFormat.PARQUET, true},
+        new Object[]{FileFormat.ORC, false},
+        new Object[]{FileFormat.ORC, true},

Review comment:
       I found another test class `TestBaseTaskWriter` also need to enable the ORC parameter. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1058112588


   @openinx I've opened the ORC rolling writer related test cases, please take a look, thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819238427



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +176,10 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter treeWriterHiddenInORC() {
+    DynFields.BoundField<TreeWriter> treeWriterFiled =
+            DynFields.builder().hiddenImpl(writer.getClass(), "treeWriter").build(writer);

Review comment:
       Another thing is:  the `treeWriter` is a private member in `writer`'s underlying implementation `WriterImpl`.  Is is possible to create a new pull request in [apache orc ](https://github.com/apache/orc) project to expose the `TreeWriter` or `estimateMemory()`  from `org.apache.orc.Writer` so that we can get rid of this in the further iceberg release ? 
   
   CC @rdblue @dongjoon-hyun @marton-bod @pvary




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819242211



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from file %s", file.location()), e);
+    }
+
+    // This value is estimated, not actual.
+    return dataLength + estimateMemory + batch.size;

Review comment:
       I read this https://github.com/apache/iceberg/pull/3784#issuecomment-1022891787 here. For the file-persisted bytes , I think using the last strip's offset plus the its length should be correct.  For he the memory encoded batch vector , I think the `TreeWriter#estimateMemory` should be okay. 
   
   But for the batch vector whose rows did not flush to encoded memory,  using the batch.size shouldn't be correct. Because the rows can be any data type,  such as Integer, Long, Timestamp, String etc. As their width are  not the same, I think we may need to use an average width minus the batch.size (which is row count actually).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819297661



##########
File path: flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java
##########
@@ -331,7 +330,7 @@ public void testRewriteAvoidRepeateCompress() throws IOException {
     try (FileAppender<Record> fileAppender = genericAppenderFactory.newAppender(Files.localOutput(file), format)) {
       long filesize = 20000;
       for (; fileAppender.length() < filesize; count++) {
-        Record record = SimpleDataUtil.createRecord(count, "iceberg");

Review comment:
       because of some optimization mechanisms, when writing a large amount of duplicate data,
   there will be some deviation between the estimated and the actual size.
   However, when cached data is flushed (the amount or size exceeds the
   threshold), the estimate is revised. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832974611



##########
File path: orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,85 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.Map;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAveWidthVisitor {
+
+  // all supported fields, except for UUID which is on deprecation path: see https://github.com/apache/iceberg/pull/1611
+  // as well as Types.TimeType and Types.TimestampType.withoutZone as both are not supported by Spark
+  protected static final Types.NestedField ID_FIELD = required(1, "id", Types.IntegerType.get());
+  protected static final Types.NestedField DATA_FIELD = optional(2, "data", Types.StringType.get());
+  protected static final Types.NestedField FLOAT_FIELD = required(3, "float", Types.FloatType.get());
+  protected static final Types.NestedField DOUBLE_FIELD = optional(4, "double", Types.DoubleType.get());
+  protected static final Types.NestedField DECIMAL_FIELD = optional(5, "decimal", Types.DecimalType.of(5, 3));
+  protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed", Types.FixedType.ofLength(4));
+  protected static final Types.NestedField BINARY_FIELD = optional(8, "binary", Types.BinaryType.get());
+  protected static final Types.NestedField FLOAT_LIST = optional(9, "floatlist",
+      Types.ListType.ofRequired(10, Types.FloatType.get()));
+  protected static final Types.NestedField LONG_FIELD = optional(11, "long", Types.LongType.get());
+
+  protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+      Types.MapType.ofOptional(18, 19, Types.FloatType.get(), Types.StringType.get())
+  );
+  protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+      Types.MapType.ofOptional(21, 22, Types.IntegerType.get(), Types.DoubleType.get())
+  );
+  protected static final Types.NestedField STRUCT_FIELD = optional(23, "structField", Types.StructType.of(
+      required(24, "booleanField", Types.BooleanType.get()),
+      optional(25, "date", Types.DateType.get()),
+      optional(27, "timestamp", Types.TimestampType.withZone())
+  ));
+
+  private static final Map<Types.NestedField, Integer> FIELDS_WITH_NAN_COUNT_TO_ID = ImmutableMap.of(
+      FLOAT_FIELD, 3, DOUBLE_FIELD, 4, FLOAT_LIST, 10, MAP_FIELD_1, 18, MAP_FIELD_2, 22
+  );
+
+  // create a schema with all supported fields
+  protected static final Schema SCHEMA = new Schema(
+      ID_FIELD,
+      DATA_FIELD,
+      FLOAT_FIELD,
+      DOUBLE_FIELD,
+      DECIMAL_FIELD,
+      FIXED_FIELD,
+      BINARY_FIELD,
+      FLOAT_LIST,
+      LONG_FIELD,
+      MAP_FIELD_1,
+      MAP_FIELD_2,
+      STRUCT_FIELD
+  );
+
+  @Test
+  public void testEstimateDataAveWidth() {
+    // Original mapping (stored in ORC)
+    TypeDescription orcSchema = ORCSchemaUtil.convert(SCHEMA);
+
+    long estimateLength = OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAveWidthVisitor())
+        .stream().reduce(Integer::sum).orElse(1);
+
+    Assert.assertEquals(602,estimateLength);
+  }

Review comment:
       Let's jus separate the whole test cases for different data type into different cases , so that we can easily dig into the root cause when there's any broken case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830939855



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +110,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }

Review comment:
       Nit: Pls just use the Precondition.checkNotNull ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830946360



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +182,10 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter treeWriterHiddenInORC() {
+    DynFields.BoundField<TreeWriter> treeWriterFiled =
+        DynFields.builder().hiddenImpl(writer.getClass(), "treeWriter").build(writer);

Review comment:
       Pls add a TODO saying that in which ORC version we will turn to access the `estimateMemorySize` directly without accessing this hidden `treeWriter` to get this value ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787323325



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Removed this, but if didn't flush here, I was a little concerned that it will deviate a little from the actual size, although it may not be small.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819229326



##########
File path: flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java
##########
@@ -322,7 +322,6 @@ public void testRewriteLargeTableHasResiduals() throws IOException {
    */
   @Test
   public void testRewriteAvoidRepeateCompress() throws IOException {
-    Assume.assumeFalse("ORC does not support getting length when file is opening", format.equals(FileFormat.ORC));

Review comment:
       Please apply this change to flink 1.12 and flink 1.13 also ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dongjoon-hyun commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1059358727


   Thank you for pinging me, @openinx and @kbendick . The following is the ORC PR for this request.
   - https://github.com/apache/orc/pull/1057
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r824542242



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -103,42 +104,57 @@ public TypeDescription type() {
   private ORCSchemaUtil() {
   }
 
-  public static TypeDescription convert(Schema schema) {
+  public static TypeDescription convertSchema2TypeDesc(Schema schema) {
+    return convertSchema2TypeDescWithLength(schema).first();
+  }
+
+  public static Pair<TypeDescription, Integer> convertSchema2TypeDescWithLength(Schema schema) {

Review comment:
       Based on the discussion of the mailing list, I tried to estimate the average length of the field here. @openinx 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832800368



##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {

Review comment:
       Nit: `Ave` -> `Avg` ?

##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {

Review comment:
       Nit: How about rename it to `elementWidth` ?

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
     this.metricsConfig = metricsConfig;
 
     TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+    List<Integer> integers = OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAveWidthVisitor());

Review comment:
       integers ?  That's a wired name... I will suggest to use `fieldWidths`

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
     this.metricsConfig = metricsConfig;
 
     TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+    List<Integer> integers = OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAveWidthVisitor());
+    estimateLength = integers
+        .stream().reduce(Integer::sum).orElse(1);

Review comment:
       orElse(0) ?   If there is no field in this orc schema, its estimate length should be 0 , right ?

##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {
+    return elementResult;
+  }
+
+  @Override
+  public Integer map(TypeDescription map, Integer keyWidth, Integer valueWidth) {
+    return keyWidth + valueWidth;
+  }
+
+  @Override
+  public Integer primitive(TypeDescription primitive) {
+    Optional<Integer> icebergIdOpt = ORCSchemaUtil.icebergID(primitive);
+
+    if (!icebergIdOpt.isPresent()) {
+      return 0;
+    }
+
+    switch (primitive.getCategory()) {

Review comment:
       I think we need to align with the approach to estimate the avg width for each data type.   I think the basic rule is: we  need to read the GenericOrcWriters to see how those data types are encouded into the ORC column vector. That is the occupied in-memory byte size without any columnar compression. 

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
     this.metricsConfig = metricsConfig;
 
     TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+    List<Integer> integers = OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAveWidthVisitor());
+    estimateLength = integers

Review comment:
       Pls use `this.estimateWidth` to assign a value to the private class member ? 

##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {
+    return elementResult;
+  }
+
+  @Override
+  public Integer map(TypeDescription map, Integer keyWidth, Integer valueWidth) {
+    return keyWidth + valueWidth;
+  }
+
+  @Override
+  public Integer primitive(TypeDescription primitive) {
+    Optional<Integer> icebergIdOpt = ORCSchemaUtil.icebergID(primitive);
+
+    if (!icebergIdOpt.isPresent()) {
+      return 0;
+    }
+
+    switch (primitive.getCategory()) {
+      case BOOLEAN:
+      case BYTE:
+      case CHAR:
+        return 1;
+      case SHORT:
+        return 2;
+      case INT:
+      case FLOAT:
+        return 4;
+      case LONG:
+      case DOUBLE:
+        return 8;
+      case DATE:
+      case TIMESTAMP:
+      case TIMESTAMP_INSTANT:
+        return 12;
+      case STRING:
+      case VARCHAR:
+      case BINARY:
+        return 128;
+      case DECIMAL:
+        return primitive.getPrecision() * 4 + 1;

Review comment:
       You mean each digit from the unscaled value will occupied 4 byte ?  In my view, each digit will be range from 0~9, then one byte is enough.

##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {
+    return elementResult;
+  }
+
+  @Override
+  public Integer map(TypeDescription map, Integer keyWidth, Integer valueWidth) {
+    return keyWidth + valueWidth;
+  }
+
+  @Override
+  public Integer primitive(TypeDescription primitive) {
+    Optional<Integer> icebergIdOpt = ORCSchemaUtil.icebergID(primitive);
+
+    if (!icebergIdOpt.isPresent()) {
+      return 0;
+    }
+
+    switch (primitive.getCategory()) {
+      case BOOLEAN:
+      case BYTE:
+      case CHAR:
+        return 1;
+      case SHORT:
+        return 2;
+      case INT:
+      case FLOAT:
+        return 4;
+      case LONG:
+      case DOUBLE:
+        return 8;
+      case DATE:

Review comment:
       According to the above rule, the `DATE` will be encoded as a long type, then I think its width should be 8 bytes.
   https://github.com/apache/iceberg/blob/b16c0a4f16236474e35869e21d1a303a113ac6e1/orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java#L313




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833647509



##########
File path: build.gradle
##########
@@ -449,6 +449,7 @@ project(':iceberg-orc') {
   dependencies {
     implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
     api project(':iceberg-api')
+    implementation project(':iceberg-common')

Review comment:
       Are we sure we need to add `iceberg-common`? Looking at the definition of `iceberg-common`, it appears to be the same as line 450's `iceberg-bundled-guava`, using shadow configuration.
   
   https://github.com/apache/iceberg/blob/718ff6ae514d7a16077c21e5281e765ba5fa120b/build.gradle#L199-L203
   
   I do see it duplicated multiple times though in other places.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei edited a comment on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei edited a comment on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1077337564


   > Thanks @hililiwei. Left some further comments.
   > 
   > Additionally, is it possible for these changes to be backported to earlier Spark versions in subsequent PRs to make reviewing easier? It's possible I missed some discussion on this, so let me know if so.
   
   reverted old version changes for flink and spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r834914845



##########
File path: core/src/main/java/org/apache/iceberg/io/FanoutDataWriter.java
##########
@@ -52,13 +51,7 @@ public FanoutDataWriter(FileWriterFactory<T> writerFactory, OutputFileFactory fi
 
   @Override
   protected FileWriter<T, DataWriteResult> newWriter(PartitionSpec spec, StructLike partition) {
-    // TODO: support ORC rolling writers
-    if (fileFormat == FileFormat.ORC) {

Review comment:
       Since these methods involve multiple spark/flink versions, I suggest that a separate PR cleans it up after this is done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1030929637


   @hililiwei, I don't understand what #3 is. Why is this tracking the data that hasn't been submitted to the writer? It seems like all you're doing is adding a constant to the estimated size. For Parquet, we use the current file offset plus the size that is buffered in memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1019614710


   @hililiwei, can you describe how you're estimating the size of data that is buffered in memory for ORC? I think a description to explain to reviewers would help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r824545080



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -65,7 +71,11 @@
     this.batchSize = batchSize;
     this.metricsConfig = metricsConfig;
 
-    TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+    Pair<TypeDescription, Integer> typeDescriptionIntegerPair = ORCSchemaUtil.convertSchema2TypeDescWithLength(schema);
+
+    TypeDescription orcSchema = typeDescriptionIntegerPair.first();
+    estimateLength = typeDescriptionIntegerPair.second();

Review comment:
       Get the average length of the schema




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830954859



##########
File path: orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
##########
@@ -103,42 +104,57 @@ public TypeDescription type() {
   private ORCSchemaUtil() {
   }
 
-  public static TypeDescription convert(Schema schema) {
+  public static TypeDescription convertSchema2TypeDesc(Schema schema) {
+    return convertSchema2TypeDescWithLength(schema).first();
+  }
+
+  public static Pair<TypeDescription, Integer> convertSchema2TypeDescWithLength(Schema schema) {

Review comment:
       Why not just customize a `OrcSchemaVisitor` to calculate the estimated avg row size  to decouple with the schema  convertion logic ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1077337564


   > Thanks @hililiwei. Left some further comments.
   > 
   > Additionally, is it possible for these changes to be backported to earlier Spark versions in subsequent PRs to make reviewing easier? It's possible I missed some discussion on this, so let me know if so.
   
   reverted old version changes of flink and spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819234836



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +176,10 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter treeWriterHiddenInORC() {
+    DynFields.BoundField<TreeWriter> treeWriterFiled =
+            DynFields.builder().hiddenImpl(writer.getClass(), "treeWriter").build(writer);

Review comment:
       Nit:  In iceberg we usually use 4 space indent, rather than 8 spaces.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1047463823


   > @coolderli yes, parquet query performance is worse than orc when select by trino.
   > 
   > @hililiwei and does this pr have any remaining unfinished work? I want merge this pr to my branch.
   
   For now, there are no major changes. However, I'm still waiting for comments from @rdblue or anyone else, so may revise it again. 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787488631



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Yes, it is poor wirte. It would be better to get the `batch` length here and add it to `length result`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r786977928



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +183,14 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter getTreeWriter() {
+    try {
+      Field treeWriterFiled = writer.getClass().getDeclaredField("treeWriter");

Review comment:
       Also, typo in `Field`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787388761



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       I want to get the byte size of the `batch `here, but I haven't found the method yet. Because `length` does not contain the size of batch, some test cases fail.

##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       I want to get the byte size of the `batch `here, but I haven't found the method yet. Because `length` does not contain the size of `batch`, some test cases fail.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787388761



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       I want to get the byte size of the `batch `here, but I haven't found the method yet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r786976442



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -153,4 +183,14 @@ private static Writer newOrcWriter(OutputFile file,
                                                          createWriterFunc) {
     return (OrcRowWriter<D>) createWriterFunc.apply(schema, orcSchema);
   }
+
+  private TreeWriter getTreeWriter() {
+    try {
+      Field treeWriterFiled = writer.getClass().getDeclaredField("treeWriter");

Review comment:
       Please use the `DynFields` helpers for reflection.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830944876



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +110,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from file %s", file.location()), e);
+    }
+
+    // This value is estimated, not actual.
+    return dataLength + estimateMemory + (long) batch.size * estimateLength;

Review comment:
       Do we need to multiple the `batch.size * avgRowByteSize` with an experience compress ratio ?  according to the mail list discussion..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832972626



##########
File path: orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,85 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.Map;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAveWidthVisitor {
+
+  // all supported fields, except for UUID which is on deprecation path: see https://github.com/apache/iceberg/pull/1611
+  // as well as Types.TimeType and Types.TimestampType.withoutZone as both are not supported by Spark
+  protected static final Types.NestedField ID_FIELD = required(1, "id", Types.IntegerType.get());
+  protected static final Types.NestedField DATA_FIELD = optional(2, "data", Types.StringType.get());
+  protected static final Types.NestedField FLOAT_FIELD = required(3, "float", Types.FloatType.get());
+  protected static final Types.NestedField DOUBLE_FIELD = optional(4, "double", Types.DoubleType.get());
+  protected static final Types.NestedField DECIMAL_FIELD = optional(5, "decimal", Types.DecimalType.of(5, 3));
+  protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed", Types.FixedType.ofLength(4));
+  protected static final Types.NestedField BINARY_FIELD = optional(8, "binary", Types.BinaryType.get());
+  protected static final Types.NestedField FLOAT_LIST = optional(9, "floatlist",
+      Types.ListType.ofRequired(10, Types.FloatType.get()));
+  protected static final Types.NestedField LONG_FIELD = optional(11, "long", Types.LongType.get());
+
+  protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+      Types.MapType.ofOptional(18, 19, Types.FloatType.get(), Types.StringType.get())
+  );
+  protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+      Types.MapType.ofOptional(21, 22, Types.IntegerType.get(), Types.DoubleType.get())
+  );
+  protected static final Types.NestedField STRUCT_FIELD = optional(23, "structField", Types.StructType.of(
+      required(24, "booleanField", Types.BooleanType.get()),
+      optional(25, "date", Types.DateType.get()),
+      optional(27, "timestamp", Types.TimestampType.withZone())
+  ));
+
+  private static final Map<Types.NestedField, Integer> FIELDS_WITH_NAN_COUNT_TO_ID = ImmutableMap.of(
+      FLOAT_FIELD, 3, DOUBLE_FIELD, 4, FLOAT_LIST, 10, MAP_FIELD_1, 18, MAP_FIELD_2, 22
+  );

Review comment:
       This can be removed now because there's no usage ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833647509



##########
File path: build.gradle
##########
@@ -449,6 +449,7 @@ project(':iceberg-orc') {
   dependencies {
     implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
     api project(':iceberg-api')
+    implementation project(':iceberg-common')

Review comment:
       Are we sure we need to add `iceberg-common`? Looking at the definition of `iceberg-common`, it appears to be the same as line 450's `iceberg-bundled-guava`, using shadow configuration.
   
   https://github.com/apache/iceberg/blob/718ff6ae514d7a16077c21e5281e765ba5fa120b/build.gradle#L199-L203
   
   I do see it duplicated multiple times though in other places.
   
   EDIT: Nevermind. I see that it does seem to need to be duplicated because of the relocation it seems. I think declaring `iceberg-common` forces it to build the shaded / relocaqted bundled-guava artifact first. Sorry for the noise! I'm still more used to bazel and other somewhat less common tools. 😅 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1080153051


   Got this merged now, thanks all for reviewing, and thanks @hililiwei for the contribution !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx merged pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx merged pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r788753044



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from file %s", file.location()), e);
+    }
+
+    // This value is estimated, including the default max bytes in batch(not actual).
+    return dataLength + estimateMemory + VectorizedRowBatch.DEFAULT_BYTES;

Review comment:
        I added the default byte size of batch instead of trying to calculate its actual size. @rdblue Is this feasible?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819229491



##########
File path: flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java
##########
@@ -331,7 +330,7 @@ public void testRewriteAvoidRepeateCompress() throws IOException {
     try (FileAppender<Record> fileAppender = genericAppenderFactory.newAppender(Files.localOutput(file), format)) {
       long filesize = 20000;
       for (; fileAppender.length() < filesize; count++) {
-        Record record = SimpleDataUtil.createRecord(count, "iceberg");

Review comment:
       Why do we change this ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819242211



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() + stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from file %s", file.location()), e);
+    }
+
+    // This value is estimated, not actual.
+    return dataLength + estimateMemory + batch.size;

Review comment:
       I read this https://github.com/apache/iceberg/pull/3784#issuecomment-1022891787 here. For the file-persisted bytes , I think using the last strip's offset plus the its length should be correct.  For he the memory encoded batch vector , I think the `TreeWriter#estimateMemory` should be okay. 
   
   But for the batch vector whose rows did not flush to encoded memory,  using the batch.size shouldn't be correct. Because the rows can be any data type,  such as Integer, Long, Timestamp, String etc. As their width are  not the same, I think we may need to use an average width to multiple the batch.size (which is row count actually).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832959920



##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {
+    return elementResult;
+  }
+
+  @Override
+  public Integer map(TypeDescription map, Integer keyWidth, Integer valueWidth) {
+    return keyWidth + valueWidth;
+  }
+
+  @Override
+  public Integer primitive(TypeDescription primitive) {
+    Optional<Integer> icebergIdOpt = ORCSchemaUtil.icebergID(primitive);
+
+    if (!icebergIdOpt.isPresent()) {
+      return 0;
+    }
+
+    switch (primitive.getCategory()) {

Review comment:
       The estimated byte size of decimal is just:  precision + 2.  Just as I said in another comment,  each digit will occupy just one byte.  and in fact,  the `BigDecimal`'s unscaled value is usually a `BigInteger`, and the `BigInteger` will just encode each digit into a byte.    




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r832921020



##########
File path: orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import java.util.List;
+import java.util.Optional;
+import org.apache.orc.TypeDescription;
+
+public class EstimateOrcAveWidthVisitor extends OrcSchemaVisitor<Integer> {
+
+  @Override
+  public Integer record(TypeDescription record, List<String> names, List<Integer> fields) {
+    return fields.stream().reduce(Integer::sum).orElse(0);
+  }
+
+  @Override
+  public Integer list(TypeDescription array, Integer elementResult) {
+    return elementResult;
+  }
+
+  @Override
+  public Integer map(TypeDescription map, Integer keyWidth, Integer valueWidth) {
+    return keyWidth + valueWidth;
+  }
+
+  @Override
+  public Integer primitive(TypeDescription primitive) {
+    Optional<Integer> icebergIdOpt = ORCSchemaUtil.icebergID(primitive);
+
+    if (!icebergIdOpt.isPresent()) {
+      return 0;
+    }
+
+    switch (primitive.getCategory()) {

Review comment:
       The corresponding relationship is as follows:
   ```
   Boolean	->	LongColumnVector
   Byte	->	LongColumnVector
   Short	->	LongColumnVector
   INT	->	LongColumnVector
   LONG 	->	LongColumnVector 	
   FLOAT	->	DoubleColumnVector
   DOUBLE	->	DoubleColumnVector
   DATE	->	LongColumnVector
   TIMESTAMP	->	TimestampColumnVector
   BINARY	->	BytesColumnVector
   STRING	->	BytesColumnVector
   DECIMAL	->	Decimal18Writer or Decimal38Writer
   ```
   The byte estimation corresponds:
   ```
   LongColumnVector	->	8 byte
   DoubleColumnVector	->	8 byte
   TimestampColumnVector	->	12 byte
   Decimal18Writer/Decimal38Writer	->	(precision + 4) / 2 byte
   BytesColumnVector	->	128 byte
   ```
   How about this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1041178609


   > Any update about this? We found use orc can save more storage space than using parquet. So I'd like to try the Orc file.
   
   ping @rdblue @liubo1022126 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei edited a comment on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei edited a comment on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1022891787


   > @hililiwei, can you describe how you're estimating the size of data that is buffered in memory for ORC? I think a description to explain to reviewers would help.
   
   If a file is being written, to estimate its size,  in three steps:
   1. Size of data that has been written to `stripe`.The value is obtained by summing the `offset` and `length `of the last stripe of the `writer`.
   2. Size of data that has been submitted to the `writer `but has not been written to the stripe. When creating OrcFileAppender, `treeWriter` is obtained through reflection,  and use its `estimateMemory` to estimate how much memory is being used.
   3. Data that has not been submitted to the `writer`, that is, the size of the buffer. The maximum default value of the buffer is used here.
   
   Add these three values to estimate the data size.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liubo1022126 commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

liubo1022126 commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1047395323


   @coolderli yes, parquet query performance is worse than orc when select by trino.
   
   @hililiwei and does this pr have any remaining unfinished work? I want merge this pr to my branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r787323325



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,34 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter != null) {
+      if (batch.size > 0) {
+        try {
+          writer.addRowBatch(batch);
+          batch.reset();

Review comment:
       Removed this, but if didn't flush here, I was a little concerned that it will deviate a little from the actual size, although it may be small.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r830932481



##########
File path: flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java
##########
@@ -331,7 +332,7 @@ public void testRewriteAvoidRepeateCompress() throws IOException {
     try (FileAppender<Record> fileAppender = genericAppenderFactory.newAppender(Files.localOutput(file), format)) {
       long filesize = 20000;
       for (; fileAppender.length() < filesize; count++) {
-        Record record = SimpleDataUtil.createRecord(count, "iceberg");

Review comment:
       Do we still need to change this line after we add the compress ratio according to this discussion ? https://lists.apache.org/thread/q9w7rwr31y4lykdkfx7dpdthk2x6bow1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dungdm93 commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

dungdm93 commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r834417381



##########
File path: core/src/main/java/org/apache/iceberg/io/FanoutDataWriter.java
##########
@@ -52,13 +51,7 @@ public FanoutDataWriter(FileWriterFactory<T> writerFactory, OutputFileFactory fi
 
   @Override
   protected FileWriter<T, DataWriteResult> newWriter(PartitionSpec spec, StructLike partition) {
-    // TODO: support ORC rolling writers
-    if (fileFormat == FileFormat.ORC) {

Review comment:
       When `ORC` support rolling writers, `fileFormat` is using in no where.
   Should we deprecated/remove it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1080160516


   Thanks openinx and all for reviewing. 😃 
   I'm going to start port it to multiple supported flink/spark versions and do some cleanup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

hililiwei commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1053085656


   cc @rdblue @openinx, Could you please review when you have some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1059382284


   > Thank you for pinging me, @openinx and @kbendick . The following is the ORC PR for this request.
   > 
   > * [ORC-1123: Add `estimationMemory` method for writer orc#1057](https://github.com/apache/orc/pull/1057)
   
   Thank you for the prompt response! 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1058822726


   cc @dongjoon-hyun 
   
   This is a proposed PR for estimating file size with ORC files, to support rolling file writers using ORC in Iceberg.
   
   Right now, the feature is disabled entirely because of inability to estimate the file size for an open ORC file that’s still being written to. Adding this in would add alot of parity between ORC and Parquet from Iceberg.
   
   @openinx has summarized their thoughts and the current situation pretty well here: https://lists.apache.org/thread/g6yo7m46mr86ov1vkm9wnmshgw7hcl6b
   
   If you have time, could you or somebody from the ORC community provide any feedback for the better approach to estimating file size, so that ORC might have equivalent support to Parquet in this regard? L
   
   I was hoping you or somebody else from the ORC community might chime in, given @openinx’s summary of the situation (on the dev list here https://lists.apache.org/thread/g6yo7m46mr86ov1vkm9wnmshgw7hcl6b).
   
   Thanks in advance for any guidance you might be able to provide 🙂 
   
   Also cc @marton-bod and other ORC developers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick edited a comment on pull request #3784: ORC:ORC supports rolling writers.

Posted by GitBox <gi...@apache.org>.

kbendick edited a comment on pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#issuecomment-1058822726


   cc @dongjoon-hyun 
   
   This is a proposed PR for estimating file size with ORC files, to support rolling file writers using ORC in Iceberg.
   
   Right now, the feature is disabled entirely because of inability to estimate the file size for an open ORC file that’s still being written to. Adding this in would add alot of parity between ORC and Parquet from Iceberg.
   
   @openinx has summarized their thoughts and the current situation pretty well here: https://lists.apache.org/thread/g6yo7m46mr86ov1vkm9wnmshgw7hcl6b
   
   If you have time, could you or somebody from the ORC community provide any feedback for the better approach to estimating file size, so that ORC might have equivalent support to Parquet in this regard?
   
   I was hoping you or somebody else from the ORC community might chime in, given @openinx’s summary of the situation (on the dev list here https://lists.apache.org/thread/g6yo7m46mr86ov1vkm9wnmshgw7hcl6b).
   
   Thanks in advance for any guidance you might be able to provide 🙂 
   
   Also cc @marton-bod and other ORC developers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org