You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/21 00:38:16 UTC

[GitHub] [iceberg] rdblue opened a new pull request #3780: Lazily initialize the underlying writer in ParquetWriter

rdblue opened a new pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780


   This is an update to #3293 that implements the `FileAppender` contract more closely to avoid test failures.
   
   This renames `writer` to `ensureWriterInitialized` and updates previous calls to `writer()` that would initialize the writer -- even if the appender was closed -- with null checks. `ensureWriterInitialized` is called when flushing a row group from the column write store to the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-998919853


   Looks like this is now causing rolling file writer tests to fail. We'll have to take a look at that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1008425939


   This defers creating an actual Parquet file writer until the first row group is flushed. That avoids interacting with the underlying object store (which for GCS FIleSystem will allocate lots of memory) and removes the need to delete files that are closed with 0 records. But, the Iceberg writer structure is still maintained. So the writer classes other than Parquet is unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] NeQuissimus commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
NeQuissimus commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-998851872


   This is looking good! Thank you for taking the time to address the concerns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1008365713


   @RussellSpitzer, could you take a look at this? I have tests passing now.
   
   This lazily creates Parquet writers to reduce memory consumption.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] NeQuissimus commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
NeQuissimus commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1008879098


   Thank you for the additional work you did here. This looks like you had a bit of an adventure to work through :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#discussion_r784294230



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -138,30 +153,41 @@ public Metrics metrics() {
   @Override
   public long length() {
     try {
-      if (closed) {
-        return writer.getPos();
-      } else {
-        return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);
+      long length = 0L;
+
+      if (writer != null) {
+        length += writer.getPos();
       }
+
+      if (!closed && recordCount > 0) {
+        // recordCount > 0 when there are records in the write store that have not been flushed to the Parquet file
+        length += writeStore.getBufferedSize();

Review comment:
       I'm not sure what it is exactly, but it is definitely non-zero when record count is 0 or the writer is closed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1012484187


   Thanks for reviewing, @RussellSpitzer! And thanks for the original fix @NeQuissimus!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-998507583


   > This is an update to #3293 that implements the `FileAppender` contract more closely to avoid test failures.
   >
   > This renames `writer` to `ensureWriterInitialized` and updates previous calls to `writer()` that would initialize the writer -- even if the appender was closed -- with null checks. `ensureWriterInitialized` is called when flushing a row group from the column write store to the file.
   
   I like this idea. My analysis on the former approach was the same as yours. There are also some fields that are built based off of the writer and associated factories that need to also be adjusted for lazy initialization (I believe). 
   
   I'd be happy to also put a pair of eyes on this PR as well as it's an important part of the code.
   
   A large thank you to @NeQuissimus for the GCP-HDFS library concern. 👍 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1008367834


   So the idea here is we will have null writers instead of initialized empty writers? I'm about to get in a flight but I will check it out tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#discussion_r784241755



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
##########
@@ -138,30 +153,41 @@ public Metrics metrics() {
   @Override
   public long length() {
     try {
-      if (closed) {
-        return writer.getPos();
-      } else {
-        return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);
+      long length = 0L;
+
+      if (writer != null) {
+        length += writer.getPos();
       }
+
+      if (!closed && recordCount > 0) {
+        // recordCount > 0 when there are records in the write store that have not been flushed to the Parquet file
+        length += writeStore.getBufferedSize();

Review comment:
       I know this isn't realated to this pr, but what is `getBufferedSize()` when record count == 0 or the writer is closed? Just seems odd that we can't always add this on. NBD just wondering




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #3780: Lazily initialize the underlying writer in ParquetWriter

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #3780:
URL: https://github.com/apache/iceberg/pull/3780#issuecomment-1008425939


   This defers creating an actual Parquet file writer until the first row group is flushed. That avoids interacting with the underlying object store (which for GCS FIleSystem will allocate lots of memory) and removes the need to delete files that are closed with 0 records. But, the Iceberg writer structure is still maintained. So the writer classes other than Parquet are unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org