You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/01 13:45:53 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request, #6335: Core: Avoid generating a large ManifestFile when committing

ConeyLiu opened a new pull request, #6335:
URL: https://github.com/apache/iceberg/pull/6335

   In our production env, we noticed the manifest files have a large random size, ranging from several KB to larger than 100 MB.  It seems the `MANIFEST_TARGET_SIZE_BYTES` has not worked during the commit phase.
   
   In this PR, we avoid generating a manifest file larger than `MANIFEST_TARGET_SIZE_BYTES` for newly added content files and will generate multiple manifest files when the size is covered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1333794619

   Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review this when you are free. Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
nastra commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1337557734

   It is mentioned in the docs that `MANIFEST_TARGET_SIZE_BYTES` relates to `Target size when merging manifest files`, meaning that this setting only takes effect when merging of manifest files happens (e.g. when using `newAppend()`). Merging of manifest files would not happen when using `newFastAppend()` for example. This might explain why it seemed that this setting wouldn't take any effect in your env. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277056534


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {

Review Comment:
   I have checked some tables. The average size has a lot to do with the schema of the table, the more columns take up more storage space. That's because of the column metrics. I noticed the biggest one (with thousand columns) needs about 5KB for one entry. The small tables (about 10 columns) need about several hundred bytes. So 250 should be an acceptable number.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1356905016

   This seems like a reasonable thing to add to me. I'm actually more concerned about the use case that caused this.
   
   @ConeyLiu, what was the case where you were creating huge manifests? How many data files were committed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1063976959


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -49,8 +51,9 @@ class FastAppend extends SnapshotProducer<AppendFiles> implements AppendFiles {
   private final List<DataFile> newFiles = Lists.newArrayList();
   private final List<ManifestFile> appendManifests = Lists.newArrayList();
   private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList();
-  private ManifestFile newManifest = null;
+  private List<ManifestFile> cachedNewManifests = null;

Review Comment:
   @RussellSpitzer Do you mean the `caching` part here? Here just rename the variable because it is plural now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1251154568


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -513,6 +515,38 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(

Review Comment:
   Thanks @ConeyLiu !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1656510506

   Thanks @aokolnychyi for mering this. Thanks @rdblue @RussellSpitzer @szehon-ho @amogh-jahagirdar for reviewing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276528811


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {

Review Comment:
   We usually call such boundaries `F`, short from file: `F extends ContentFile<F>`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276525397


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {
+      closeCurrentWriter();
+      openCurrentWriter();
+    }
+  }
+
+  private void openCurrentWriter() {
+    Preconditions.checkState(currentWriter == null, "Current writer has been already initialized");
+
+    this.currentFile = outputFileFactory.newManifestOutput();
+    this.currentFileRows = 0;
+    this.currentWriter = newManifestWriter(formatVersion, snapshotId, spec, currentFile);
+  }
+
+  private void closeCurrentWriter() {
+    if (currentWriter != null) {
+      try {
+        currentWriter.close();
+      } catch (IOException e) {
+        throw new UncheckedIOException("Failed to close current writer", e);
+      }
+
+      if (currentFileRows == 0L) {

Review Comment:
   I feel like this class can be simplified by not dealing with `FileIO` and manifest creation. Also, I am not sure there is a lot of value in implementing `FileAppender` given that we don't implement length and metrics (I don't think we should). What about something like this?
   
   ```
   public class RollingManifestWriter<F extends ContentFile<F>> implements Closeable {
   
     private static final int ROWS_DIVISOR = SOME_VALUE;
   
     private final Supplier<ManifestWriter<F>> manifestWriterSupplier;
     private final long targetFileSizeInBytes;
     private final List<ManifestFile> manifestFiles;
   
     private long currentFileRows = 0;
     private ManifestWriter<F> currentWriter = null;
   
     private boolean closed = false;
   
     public RollingManifestWriter(
         Supplier<ManifestWriter<F>> manifestWriterSupplier,
         long targetFileSizeInBytes) {
       this.manifestWriterSupplier = manifestWriterSupplier;
       this.targetFileSizeInBytes = targetFileSizeInBytes;
       this.manifestFiles = Lists.newArrayList();
     }
   
     public void add(F addedFile) {
       currentWriter().add(addedFile);
       currentFileRows++;
     }
   
     ...
   
     private ManifestWriter<F> currentWriter() {
       if (currentWriter == null) {
         this.currentWriter = manifestWriterSupplier.get();
       } else if (shouldRollToNewFile()) {
         closeCurrentWriter();
         this.currentWriter = manifestWriterSupplier.get();
       }
   
       return currentWriter;
     }
   
     private boolean shouldRollToNewFile() {
       return currentFileRows % ROWS_DIVISOR == 0 && currentWriter.length() >= targetFileSizeInBytes;
     }
   }
   ```
   
   By migrating to a lazy writer, we don't have to check if the current file is empty and delete it on close. It should be enough to have a manifest writer supplier and the target manifest size. That way, this class won't need to change if we change the signature of how the underlying manifest writers are created. 
   
   Calling this in `SnapshotProducer` would be trivial.
   
   ```
   protected RollingManifestWriter<DataFile> newRollingManifestWriter(PartitionSpec spec) {
     return new RollingManifestWriter<>(() -> newManifestWriter(spec), targetManifestSize);
   }
   
   protected RollingManifestWriter<DeleteFile> newRollingDeleteManifestWriter(PartitionSpec spec) {
     return new RollingManifestWriter<>(() -> newDeleteManifestWriter(spec), targetManifestSize);
   }
   ```
   
   We may skip changes in `ManifestFiles` and drop `ManifestOutputFileFactory` for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277338905


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.

Review Comment:
   This is used to explain the above comments about why keep the `committed` here. I just noticed the doc in the parent class. Removed it since there is already an elaborate doc.



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : newManifests) {
+        if (!committed.contains(manifestFile)) {

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277334797


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1039689716


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -838,9 +839,17 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewManifest != null && !committed.contains(cachedNewManifest)) {
-      deleteFile(cachedNewManifest.path());
-      this.cachedNewManifest = null;
+    if (cachedNewManifests != null) {
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : cachedNewManifests) {
+        if (!committed.contains(manifestFile)) {
+          deleteFile(manifestFile.path());
+        } else {
+          cleanedNewManifests.add(manifestFile);
+        }
+      }
+
+      this.cachedNewManifests = cleanedNewManifests;

Review Comment:
   From the failed UTs, it seems it could call the `cleanUncommittedAppends` multiple times when retrying the commit. And we need to make sure the previous wroten manifest file is deleted.



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -948,13 +949,10 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
           (specId, deleteFiles) -> {
             PartitionSpec spec = ops.current().spec(specId);
             try {
-              ManifestWriter<DeleteFile> writer = newDeleteManifestWriter(spec);
-              try {
-                writer.addAll(deleteFiles);
-              } finally {
-                writer.close();
-              }
-              cachedNewDeleteManifests.add(writer.toManifestFile());
+              List<ManifestFile> manifestFiles =

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1650966712

   I will take a look with fresh eyes by the end of tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1275760159


##########
core/src/main/java/org/apache/iceberg/ManifestOutputFileFactory.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.iceberg.io.OutputFile;
+
+class ManifestOutputFileFactory {

Review Comment:
   Question: Do we really need this class given that `SnapshotProducer` previously kept the manifest count? Can we consider using `Supplier<OutputFile> outputFileSupplier` as an argument to the rolling writer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276752392


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();

Review Comment:
   Shouldn't this be called `committedNewManifests`? The name suggests it holds a set of cleaned up manifests (which I interpret as deleted) while it is the opposite of that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1261307875


##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,30 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest file, and will create a new one

Review Comment:
   changed to `the target file size for manifest files`



##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276528090


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {

Review Comment:
   Could you check the average cost of an entry in a manifest in some of your real tables? My assumption was something around 250 but let's check the average size of an entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277784370


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -907,9 +907,17 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifest != null && !committed.contains(cachedNewDataManifest)) {
-      deleteFile(cachedNewDataManifest.path());
-      this.cachedNewDataManifest = null;
+    if (cachedNewDataManifests != null) {
+      List<ManifestFile> committedNewManifests = Lists.newArrayList();

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277339559


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -907,9 +907,28 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifest != null && !committed.contains(cachedNewDataManifest)) {
-      deleteFile(cachedNewDataManifest.path());
-      this.cachedNewDataManifest = null;
+    if (cachedNewDataManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.

Review Comment:
   Updated as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277339813


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -952,10 +971,8 @@ protected void cleanUncommitted(Set<ManifestFile> committed) {
   private Iterable<ManifestFile> prepareNewDataManifests() {
     Iterable<ManifestFile> newManifests;
     if (newDataFiles.size() > 0) {
-      ManifestFile newManifest = newDataFilesAsManifest();
-      newManifests =
-          Iterables.concat(
-              ImmutableList.of(newManifest), appendManifests, rewrittenAppendManifests);
+      List<ManifestFile> newManifest = newDataFilesAsManifests();

Review Comment:
   Yeah, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1039115360


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -948,13 +949,10 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
           (specId, deleteFiles) -> {
             PartitionSpec spec = ops.current().spec(specId);
             try {
-              ManifestWriter<DeleteFile> writer = newDeleteManifestWriter(spec);
-              try {
-                writer.addAll(deleteFiles);
-              } finally {
-                writer.close();
-              }
-              cachedNewDeleteManifests.add(writer.toManifestFile());
+              List<ManifestFile> manifestFiles =

Review Comment:
   Nit: Naming, deleteManifests?



##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -499,6 +501,40 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(
+      Iterable<F> files,
+      Supplier<ManifestWriter<F>> createWriter,
+      Long newFilesSequenceNumber,
+      long targetSizeBytes)
+      throws IOException {
+    List<ManifestFile> result = Lists.newArrayList();
+    Iterator<F> fileIterator = files.iterator();
+    ManifestWriter<F> writer = null;

Review Comment:
   I think the writer needs to be wrapped in a try with resources to guarantee it gets closed. Also a nit which will be solved with try with resources: Could we initialize the writer upfront? 



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -838,9 +839,17 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewManifest != null && !committed.contains(cachedNewManifest)) {
-      deleteFile(cachedNewManifest.path());
-      this.cachedNewManifest = null;
+    if (cachedNewManifests != null) {
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : cachedNewManifests) {
+        if (!committed.contains(manifestFile)) {
+          deleteFile(manifestFile.path());
+        } else {
+          cleanedNewManifests.add(manifestFile);
+        }
+      }
+
+      this.cachedNewManifests = cleanedNewManifests;

Review Comment:
   Sorry not sure if I follow the logic, why does cachedNewManifests need to be initialized to cleanedNewManifests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1039686191


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -499,6 +501,40 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(
+      Iterable<F> files,
+      Supplier<ManifestWriter<F>> createWriter,
+      Long newFilesSequenceNumber,
+      long targetSizeBytes)
+      throws IOException {
+    List<ManifestFile> result = Lists.newArrayList();
+    Iterator<F> fileIterator = files.iterator();
+    ManifestWriter<F> writer = null;

Review Comment:
   Updated with try-finally, while try-with-resources seems not easy to implement.
   
   > Could we initialize the writer upfront?
   
   There are some UTs failed if do that. I tried the following:
   ```java
       try {
         writer = createWriter.get();
         while (fileIterator.hasNext()) {
           if (writer.length() >= targetSizeBytes) { 
             // here could produce an empty file because the intialize size of the newly writer could be larger than targetSizeBytes
             writer.close();
             result.add(writer.toManifestFile());
             writer = createWriter.get();
           }
   
           F file = fileIterator.next();
           if (newFilesSequenceNumber == null) {
             writer.add(file);
           } else {
             writer.add(file, newFilesSequenceNumber);
           }
         }
       } finally {
         if (writer != null) {
           writer.close();
         }
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1545281185

   @rdblue @RussellSpitzer @nastra @amogh-jahagirdar Would you mind taking another look at this when you are free? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1654626170

   I feel like this would be super valuable for CTAS kind of use cases where a large chunk of data is being added at the same time. I had a few suggestions on how to simplify the rolling writer.
   
   Nice work, @ConeyLiu!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1275761355


##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,29 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest files
+   * @return a rolling manifest writer which could generate multiple manifest files
+   */
+  public static RollingManifestWriter<DataFile> rollingWrite(

Review Comment:
   Hm, the method is public but the exposed class is package-private. Let me think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1613877213

   Yea sorry let me try to review this tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1656002869

   @aokolnychyi thanks for the details reviewing. The empty file checking is not necessary. Reverted related code changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1258991416


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // Operation succeeds and calls cleanUncommitted'
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   // Some other operations ...
+      //   // Commit fails and needs to clean up newManifests

Review Comment:
   Nit: should this comment be after next one?  (seems comments refer to the line above)



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -49,8 +51,9 @@ class FastAppend extends SnapshotProducer<AppendFiles> implements AppendFiles {
   private final List<DataFile> newFiles = Lists.newArrayList();
   private final List<ManifestFile> appendManifests = Lists.newArrayList();
   private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList();
-  private ManifestFile newManifest = null;
+  private List<ManifestFile> newManifests = null;
   private boolean hasNewFiles = false;
+  private final long targetSizeBytes;

Review Comment:
   Nit: put the final with the other final vars?



##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */

Review Comment:
   Nit file => files



##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {

Review Comment:
   Thanks, this definitely looks cleaner. 
   
   Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.  
   
   Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.
   
   The problem may be the method manifestFile() in writer?  It should have been manifestFiles().  Maybe we can deprecate existing method?
   
   CC @aokolnychyi .



##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,30 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest file, and will create a new one

Review Comment:
   Nit: i would say, 'the target file size for manifest files' is good enough, it's a bit complicated to explain more.
   
   Or if we need, at least fix the file => files.  
   `the target size for manifest files.  This writer will create a new manifest file when the current one has reached this size'



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // Operation succeeds and calls cleanUncommitted'

Review Comment:
   Nit: typo (trailing ')?



##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,30 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest file, and will create a new one
+   *     when reached
+   * @return a rolling manifest writer which could generate multiple manifest file

Review Comment:
   Nit: files



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1357204759

   @rdblue thanks for the reviewing. It happens for the larger table (has several PBs data) and with many columns(thousand columns which is very common for log data or feature data). And the size of the DataFile metrics' is increased with the number of columns. So the manifest file could be really large when the user adds into/overwrites a larger number of data (eg, sync hourly data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1255990149


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {

Review Comment:
   @szehon-ho pls take a look if this is the right direction. Here implements `FileAppender` instead of `FileWriter` that's because the `ManifestWriter` implements `FileAppender`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1261306767


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // Operation succeeds and calls cleanUncommitted'
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   // Some other operations ...
+      //   // Commit fails and needs to clean up newManifests

Review Comment:
   I add the blank line between the comments to separate them.



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // Operation succeeds and calls cleanUncommitted'

Review Comment:
   updated



##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,30 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest file, and will create a new one
+   *     when reached
+   * @return a rolling manifest writer which could generate multiple manifest file

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276102390


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {

Review Comment:
   How about checking every 16? Just simply calculated as: `1000 / (512 / 8)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277739592


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -907,9 +907,17 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifest != null && !committed.contains(cachedNewDataManifest)) {
-      deleteFile(cachedNewDataManifest.path());
-      this.cachedNewDataManifest = null;
+    if (cachedNewDataManifests != null) {
+      List<ManifestFile> committedNewManifests = Lists.newArrayList();

Review Comment:
   Optional: `committedNewManifests` -> `committedNewDataManifests`?



##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+public class RollingManifestWriter<F extends ContentFile<F>> implements Closeable {
+  private static final int ROWS_DIVISOR = 250;
+
+  private final FileIO fileIO;
+  private final Supplier<ManifestWriter<F>> manifestWriterSupplier;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private ManifestWriter<F> currentWriter = null;
+
+  private boolean closed = false;
+
+  public RollingManifestWriter(
+      FileIO fileIO,

Review Comment:
   Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop `FileIO` here and simplify `closeCurrentWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1656124272

   Thanks, @ConeyLiu! Thanks everyone for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zinking commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "zinking (via GitHub)" <gi...@apache.org>.
zinking commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1571431793

   @rdblue @RussellSpitzer @szehon-ho  can we have another look at this review ?  thanks.  large manifests have huge performance penalties for PB tables. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1251154568


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -513,6 +515,38 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(

Review Comment:
   Thanks @ConeyLiu !  really appreciate the work



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1255992620


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // success operation and call this cleanUncommittedAppends
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   some other operations ...
+      //   // commit failed need to clean up those successes committed manifests

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276099633


##########
core/src/main/java/org/apache/iceberg/ManifestOutputFileFactory.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.iceberg.io.OutputFile;
+
+class ManifestOutputFileFactory {

Review Comment:
   I want to avoid using the lambda function. It should be OK to use a `Supplier`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277785627


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+public class RollingManifestWriter<F extends ContentFile<F>> implements Closeable {
+  private static final int ROWS_DIVISOR = 250;
+
+  private final FileIO fileIO;
+  private final Supplier<ManifestWriter<F>> manifestWriterSupplier;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private ManifestWriter<F> currentWriter = null;
+
+  private boolean closed = false;
+
+  public RollingManifestWriter(
+      FileIO fileIO,

Review Comment:
   Sorry my bad, the empty file checking is not necessary. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277359630


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+public class RollingManifestWriter<F extends ContentFile<F>> implements Closeable {
+  private static final int ROWS_DIVISOR = 250;
+
+  private final FileIO fileIO;
+  private final Supplier<ManifestWriter<F>> manifestWriterSupplier;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private ManifestWriter<F> currentWriter = null;
+
+  private boolean closed = false;
+
+  public RollingManifestWriter(
+      FileIO fileIO,

Review Comment:
   Used to delete the empty file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1339410922

   @nastra `newAppend ` will create a large manifest as well. It only merges small manifests into a large one, and does not split large one into target file size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1051687190


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -896,33 +903,27 @@ private Iterable<ManifestFile> prepareNewManifests() {
         manifest -> GenericManifestFile.copyOf(manifest).withSnapshotId(snapshotId()).build());
   }
 
-  private ManifestFile newFilesAsManifest() {
-    if (hasNewFiles && cachedNewManifest != null) {
-      deleteFile(cachedNewManifest.path());
-      cachedNewManifest = null;
+  private List<ManifestFile> newFilesAsManifest() {

Review Comment:
   This should also fix method naming. If a method returns more than one item, it should be plural.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1039686191


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -499,6 +501,40 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(
+      Iterable<F> files,
+      Supplier<ManifestWriter<F>> createWriter,
+      Long newFilesSequenceNumber,
+      long targetSizeBytes)
+      throws IOException {
+    List<ManifestFile> result = Lists.newArrayList();
+    Iterator<F> fileIterator = files.iterator();
+    ManifestWriter<F> writer = null;

Review Comment:
   ~~Updated with try-finally, while try-with-resources seems not easy to implement.
   
   > Could we initialize the writer upfront?
   
   There are some UTs failed if do that. I tried the following:
   ```java
       try {
         writer = createWriter.get();
         while (fileIterator.hasNext()) {
           if (writer.length() >= targetSizeBytes) { 
             // here could produce an empty file because the intialize size of the newly writer could be larger than targetSizeBytes
             writer.close();
             result.add(writer.toManifestFile());
             writer = createWriter.get();
           }
   
           F file = fileIterator.next();
           if (newFilesSequenceNumber == null) {
             writer.add(file);
           } else {
             writer.add(file, newFilesSequenceNumber);
           }
         }
       } finally {
         if (writer != null) {
           writer.close();
         }
       } 
   ```~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1039686191


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -499,6 +501,40 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(
+      Iterable<F> files,
+      Supplier<ManifestWriter<F>> createWriter,
+      Long newFilesSequenceNumber,
+      long targetSizeBytes)
+      throws IOException {
+    List<ManifestFile> result = Lists.newArrayList();
+    Iterator<F> fileIterator = files.iterator();
+    ManifestWriter<F> writer = null;

Review Comment:
   ~~Updated with try-finally, while try-with-resources seems not easy to implement.~~
   



##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -499,6 +501,40 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(
+      Iterable<F> files,
+      Supplier<ManifestWriter<F>> createWriter,
+      Long newFilesSequenceNumber,
+      long targetSizeBytes)
+      throws IOException {
+    List<ManifestFile> result = Lists.newArrayList();
+    Iterator<F> fileIterator = files.iterator();
+    ManifestWriter<F> writer = null;

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1191989826


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -838,9 +839,17 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewManifest != null && !committed.contains(cachedNewManifest)) {
-      deleteFile(cachedNewManifest.path());
-      this.cachedNewManifest = null;
+    if (cachedNewManifests != null) {
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : cachedNewManifests) {
+        if (!committed.contains(manifestFile)) {
+          deleteFile(manifestFile.path());
+        } else {
+          cleanedNewManifests.add(manifestFile);
+        }
+      }
+
+      this.cachedNewManifests = cleanedNewManifests;

Review Comment:
   Added the comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1275766984


##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,29 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest files
+   * @return a rolling manifest writer which could generate multiple manifest files
+   */
+  public static RollingManifestWriter<DataFile> rollingWrite(

Review Comment:
   Are we sure this is the best way to expose rolling writers in general? Have we considered a single parameterized rolling writer that accepts a factory of manifest writers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277339244


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : newManifests) {

Review Comment:
   updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277341143


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -143,12 +143,12 @@ private ManifestFile copyManifest(ManifestFile manifest) {
 
   @Override
   public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) {
-    List<ManifestFile> newManifests = Lists.newArrayList();

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1255992322


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.

Review Comment:
   Updated



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // success operation and call this cleanUncommittedAppends
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   some other operations ...

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1259125410


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {

Review Comment:
   Actually on second read, this may be ok, as its a composition, and not duplicating that much code.  Lets see if Anton has any comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1248320316


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.

Review Comment:
   'keep others as here' is a bit ambigious, how about?
   'Delete newManifests that have not been committed and clear them from the list'



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // success operation and call this cleanUncommittedAppends

Review Comment:
    'operation succeeds and calls cleanUncommitted'?



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // success operation and call this cleanUncommittedAppends
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   some other operations ...

Review Comment:
   some other operations need an additional // before for consistency?



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +184,25 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete those manifests are not contained in the committed set and keep others as here.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //   // success operation and call this cleanUncommittedAppends
+      //   txn.newFastAppend().appendFile(...).commit();
+      //   some other operations ...
+      //   // commit failed need to clean up those successes committed manifests

Review Comment:
   'Commit fails and needs to clean up newManifests' ?  (they are not committed , right?)



##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -513,6 +515,38 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(

Review Comment:
   I feel like , its easier to read and fits more Iceberg pattern if the logic is inside ManifestWriter (similar to RollingFileWriter), is that possible?  Can also eliminate the double method for data/delete file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1051870044


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -896,33 +903,27 @@ private Iterable<ManifestFile> prepareNewManifests() {
         manifest -> GenericManifestFile.copyOf(manifest).withSnapshotId(snapshotId()).build());
   }
 
-  private ManifestFile newFilesAsManifest() {
-    if (hasNewFiles && cachedNewManifest != null) {
-      deleteFile(cachedNewManifest.path());
-      cachedNewManifest = null;
+  private List<ManifestFile> newFilesAsManifest() {

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277334232


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {
+      closeCurrentWriter();
+      openCurrentWriter();
+    }
+  }
+
+  private void openCurrentWriter() {
+    Preconditions.checkState(currentWriter == null, "Current writer has been already initialized");
+
+    this.currentFile = outputFileFactory.newManifestOutput();
+    this.currentFileRows = 0;
+    this.currentWriter = newManifestWriter(formatVersion, snapshotId, spec, currentFile);
+  }
+
+  private void closeCurrentWriter() {
+    if (currentWriter != null) {
+      try {
+        currentWriter.close();
+      } catch (IOException e) {
+        throw new UncheckedIOException("Failed to close current writer", e);
+      }
+
+      if (currentFileRows == 0L) {

Review Comment:
   Much cleaner implementation. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1277334568


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276525397


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {
+      closeCurrentWriter();
+      openCurrentWriter();
+    }
+  }
+
+  private void openCurrentWriter() {
+    Preconditions.checkState(currentWriter == null, "Current writer has been already initialized");
+
+    this.currentFile = outputFileFactory.newManifestOutput();
+    this.currentFileRows = 0;
+    this.currentWriter = newManifestWriter(formatVersion, snapshotId, spec, currentFile);
+  }
+
+  private void closeCurrentWriter() {
+    if (currentWriter != null) {
+      try {
+        currentWriter.close();
+      } catch (IOException e) {
+        throw new UncheckedIOException("Failed to close current writer", e);
+      }
+
+      if (currentFileRows == 0L) {

Review Comment:
   I feel like this class can be simplified by not dealing with `FileIO` and manifest creation. Also, I am not sure there is a lot of value in implementing `FileAppender` given that we don't implement length and metrics (I don't think we should, just saying). What about something like this?
   
   ```
   public class RollingManifestWriter<F extends ContentFile<F>> implements Closeable {
   
     private static final int ROWS_DIVISOR = SOME_VALUE;
   
     private final Supplier<ManifestWriter<F>> manifestWriterSupplier;
     private final long targetFileSizeInBytes;
     private final List<ManifestFile> manifestFiles;
   
     private long currentFileRows = 0;
     private ManifestWriter<F> currentWriter = null;
   
     private boolean closed = false;
   
     public RollingManifestWriter(
         Supplier<ManifestWriter<F>> manifestWriterSupplier,
         long targetFileSizeInBytes) {
       this.manifestWriterSupplier = manifestWriterSupplier;
       this.targetFileSizeInBytes = targetFileSizeInBytes;
       this.manifestFiles = Lists.newArrayList();
     }
   
     public void add(F addedFile) {
       currentWriter().add(addedFile);
       currentFileRows++;
     }
   
     ...
   
     private ManifestWriter<F> currentWriter() {
       if (currentWriter == null) {
         this.currentWriter = manifestWriterSupplier.get();
       } else if (shouldRollToNewFile()) {
         closeCurrentWriter();
         this.currentWriter = manifestWriterSupplier.get();
       }
   
       return currentWriter;
     }
   
     private boolean shouldRollToNewFile() {
       return currentFileRows % ROWS_DIVISOR == 0 && currentWriter.length() >= targetFileSizeInBytes;
     }
   }
   ```
   
   By migrating to a lazy writer, we don't have to check if the current file is empty and delete it on close. It should be enough to have a manifest writer supplier and the target manifest size. That way, this class won't need to change if we change the signature of how the underlying manifest writers are created. 
   
   Calling this in `SnapshotProducer` would be trivial.
   
   ```
   protected RollingManifestWriter<DataFile> newRollingManifestWriter(PartitionSpec spec) {
     return new RollingManifestWriter<>(() -> newManifestWriter(spec), targetManifestSize);
   }
   
   protected RollingManifestWriter<DeleteFile> newRollingDeleteManifestWriter(PartitionSpec spec) {
     return new RollingManifestWriter<>(() -> newDeleteManifestWriter(spec), targetManifestSize);
   }
   ```
   
   We may skip changes in `ManifestFiles` and drop `ManifestOutputFileFactory` for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1276752392


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();

Review Comment:
   Shouldn't this be called `committedNewManifests`? The name suggests it holds a set of cleaned up manifests while it is the opposite of that.



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.

Review Comment:
   How useful is this comment given that `cleanUncommitted` has an elaborate doc? It does not seem specific to this case but rather a general comment when manifests may be cleaned up.
   
   I'd be open to include a few examples (like auto merging and transactions) in the parent doc in a separate PR but I am not sure it makes sense here.



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -907,9 +907,28 @@ public Object updateEvent() {
   }
 
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifest != null && !committed.contains(cachedNewDataManifest)) {
-      deleteFile(cachedNewDataManifest.path());
-      this.cachedNewDataManifest = null;
+    if (cachedNewDataManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.

Review Comment:
   Same comments for this block as in `FastAppend`.



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -952,10 +971,8 @@ protected void cleanUncommitted(Set<ManifestFile> committed) {
   private Iterable<ManifestFile> prepareNewDataManifests() {
     Iterable<ManifestFile> newManifests;
     if (newDataFiles.size() > 0) {
-      ManifestFile newManifest = newDataFilesAsManifest();
-      newManifests =
-          Iterables.concat(
-              ImmutableList.of(newManifest), appendManifests, rewrittenAppendManifests);
+      List<ManifestFile> newManifest = newDataFilesAsManifests();

Review Comment:
   Should this be called something like `dataFileManifests` given that it is a list now?



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : newManifests) {
+        if (!committed.contains(manifestFile)) {

Review Comment:
   What about flipping the if to avoid the negation? Negation is always harder to read.
   
   ```
   if (committed.contains(manifest)) {
     committedNewManifests.add(manifest);
   } else {
     deleteFile(manifest.path());
   }
   ```



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -143,12 +143,12 @@ private ManifestFile copyManifest(ManifestFile manifest) {
 
   @Override
   public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) {
-    List<ManifestFile> newManifests = Lists.newArrayList();

Review Comment:
   What about renaming `writeManifests` to `writeNewManifests` to match `newManifests` variable and calling this variable simply `manifests` like we do in `MergingSnapshotProducer`?



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -178,8 +178,28 @@ public Object updateEvent() {
 
   @Override
   protected void cleanUncommitted(Set<ManifestFile> committed) {
-    if (newManifest != null && !committed.contains(newManifest)) {
-      deleteFile(newManifest.path());
+    if (newManifests != null) {
+      // Delete newManifests that have not been committed and clear them from the list.
+      // This is needed for manifests cleanup especially in transaction mode, for example:
+      //   Transaction txn = beginTransaction(...)
+      //
+      //   // Operation succeeds and calls cleanUncommitted
+      //   txn.newFastAppend().appendFile(...).commit();
+      //
+      //   // Some other operations ...
+      //
+      //   // Commit fails and needs to clean up newManifests
+      //   txn.commitTransaction()
+      List<ManifestFile> cleanedNewManifests = Lists.newArrayList();
+      for (ManifestFile manifestFile : newManifests) {

Review Comment:
   Minor: `manifestFile` -> `manifest` to match the existing naming pattern in this class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #6335:
URL: https://github.com/apache/iceberg/pull/6335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1275770078


##########
core/src/main/java/org/apache/iceberg/RollingManifestWriter.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+/** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest files. */
+abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> {
+  private final int formatVersion;
+  private final Long snapshotId;
+  private final PartitionSpec spec;
+  private final FileIO io;
+  private final ManifestOutputFileFactory outputFileFactory;
+  private final long targetFileSizeInBytes;
+  private final List<ManifestFile> manifestFiles;
+
+  private long currentFileRows = 0;
+  private OutputFile currentFile;
+  private ManifestWriter<T> currentWriter = null;
+
+  private boolean closed = false;
+
+  RollingManifestWriter(
+      int formatVersion,
+      Long snapshotId,
+      PartitionSpec spec,
+      FileIO io,
+      ManifestOutputFileFactory outputFileFactory,
+      long targetFileSizeInBytes) {
+    this.formatVersion = formatVersion;
+    this.snapshotId = snapshotId;
+    this.spec = spec;
+    this.io = io;
+    this.outputFileFactory = outputFileFactory;
+    this.targetFileSizeInBytes = targetFileSizeInBytes;
+    this.manifestFiles = Lists.newArrayList();
+
+    openCurrentWriter();
+  }
+
+  protected abstract ManifestWriter<T> newManifestWriter(
+      int targetFormatVersion,
+      Long targetSnapshotId,
+      PartitionSpec targetSpec,
+      OutputFile targetManifestPath);
+
+  /**
+   * Add an added entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The data and file sequence
+   * numbers will be assigned at commit.
+   *
+   * @param addedFile a data file
+   */
+  @Override
+  public void add(T addedFile) {
+    currentWriter.add(addedFile);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an added entry for a file with a specific sequence number.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. The entry's data sequence
+   * number will be the provided data sequence number. The entry's file sequence number will be
+   * assigned at commit.
+   *
+   * @param addedFile a data file
+   * @param dataSequenceNumber a data sequence number for the file
+   */
+  public void add(T addedFile, long dataSequenceNumber) {
+    currentWriter.add(addedFile, dataSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add an existing entry for a file.
+   *
+   * <p>The original data and file sequence numbers, snapshot ID, which were assigned at commit,
+   * must be preserved when adding an existing entry.
+   *
+   * @param existingFile a file
+   * @param fileSnapshotId snapshot ID when the data file was added to the table
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void existing(
+      T existingFile, long fileSnapshotId, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.existing(existingFile, fileSnapshotId, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  /**
+   * Add a delete entry for a file.
+   *
+   * <p>The entry's snapshot ID will be this manifest's snapshot ID. However, the original data and
+   * file sequence numbers of the file must be preserved when the file is marked as deleted.
+   *
+   * @param deletedFile a file
+   * @param dataSequenceNumber a data sequence number of the file (assigned when the file was added)
+   * @param fileSequenceNumber a file sequence number (assigned when the file was added)
+   */
+  public void delete(T deletedFile, long dataSequenceNumber, Long fileSequenceNumber) {
+    currentWriter.delete(deletedFile, dataSequenceNumber, fileSequenceNumber);
+    tryRollingToNewFile();
+  }
+
+  private void tryRollingToNewFile() {
+    currentFileRows++;
+
+    if (currentWriter.length() > targetFileSizeInBytes) {

Review Comment:
   Will it be expensive to call for every file? In rolling file writers, we check only every 1000 records. I understand that value may be too big but I still don't think it is a good idea to check the length for every entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1275772221


##########
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##########
@@ -167,6 +167,29 @@ public static ManifestWriter<DataFile> write(
         "Cannot write manifest for table version: " + formatVersion);
   }
 
+  /**
+   * Create a new {@link RollingManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID
+   * @param spec a {@link PartitionSpec}
+   * @param io a {@link FileIO}
+   * @param outputFileFactory a {@link ManifestOutputFileFactory} to generate the manifest output
+   *     file
+   * @param targetFileSizeInBytes the target file size for manifest files
+   * @return a rolling manifest writer which could generate multiple manifest files
+   */
+  public static RollingManifestWriter<DataFile> rollingWrite(

Review Comment:
   Let me think, no need to update anything now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1261305309


##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -49,8 +51,9 @@ class FastAppend extends SnapshotProducer<AppendFiles> implements AppendFiles {
   private final List<DataFile> newFiles = Lists.newArrayList();
   private final List<ManifestFile> appendManifests = Lists.newArrayList();
   private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList();
-  private ManifestFile newManifest = null;
+  private List<ManifestFile> newManifests = null;
   private boolean hasNewFiles = false;
+  private final long targetSizeBytes;

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #6335: Core: Avoid generating a large ManifestFile when committing

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#discussion_r1248856762


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -513,6 +515,38 @@ protected long snapshotId() {
     return snapshotId;
   }
 
+  protected static <F extends ContentFile<F>> List<ManifestFile> writeFilesToManifests(

Review Comment:
   It seems possible. Let me try to implement it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org