You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/23 18:30:43 UTC

[GitHub] [iceberg] dmgcodevil opened a new issue #1980: Spark Iceberg manifest writer writes wrong parquet file sizes.

dmgcodevil opened a new issue #1980:
URL: https://github.com/apache/iceberg/issues/1980


   We are using spark iceberg and some iceberg manifest files report the wrong data file (parquet)  size, it's ~ 2x  larger than the actual parquet file size. The issue was found while investigating Presto Iceberg [iss6369](https://github.com/prestosql/presto/issues/6369)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755291866


   @rdblue @dmgcodevil does this problem happen all the time or only in given cases? Shall we create a repair action that would fix the metadata by checking the fs?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755372622


   @aokolnychyi I'd suggest creating a repair action. This is what we are about to do with our data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-754976837


   We should have a release sometime this month.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-750765052


   Ok, based on my debugging research the _problem_ in this [line](https://github.com/apache/iceberg/blob/1b66bdfc084ac73fe999299d041aa2e5677f43c9/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L130)
   
   i.e. `writer.getPos() ` returns the corrent size of the actual parquet file saved on disk, but `writeStore.isColumnFlushNeeded()` returns true and adds extra bytes. 
   
   In my case, the actual file size is `4351656`, rows count = `470099`
   
   at the moment when `length()` is called , `ColumnWriteStoreV1.rowCountForNextSizeCheck` == `470100` so `writeStore.isColumnFlushNeeded()` returns true:
   
   ```java
   public boolean isColumnFlushNeeded() {
   //     rowCount  == 470099
   //     rowCountForNextSizeCheck == 470100
   return rowCount + 1 >= rowCountForNextSizeCheck;
     }
   ```
   
   
   maybe it can be fixed by adding an extra check : 
   
   ```java
     @Override
     public void close() throws IOException {
       flushRowGroup(true);
       writeStore.close();
       writer.end(metadata);
       this.closed = true; // new flag indicates that writer was closed
   
   
     @Override
     public long length() {
       try {
         if (closed) {
           return writer.getPos();
         } else {
           return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);
         }
       } catch (IOException e) {
         throw new RuntimeIOException(e, "Failed to get file length");
       }
     }  
   
     }
   ```
   
   
   cc @rdblue 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-751919793


   @rdblue [PR](https://github.com/apache/iceberg/pull/2001) The fix fork, but I believe that there is a better way to fix it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755426532


   @rdblue if you are using prestosql >= 348 you will face this [problem](https://github.com/trinodb/trino/issues/6369)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-750578485


   Quick update: I've reproduced the issue. will update the ticked as soon as I have the details


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-750578485


   Quick update: I've reproduced the issue. will update the ticket as soon as I have more details


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-750950712


   UPDATE: I did some testing: 739578885 total rows, 900 parquet files, 1953 manifest files. No errors found. 
   The fix might be not ideal but I can confirm the now manifest files contain the correct data file sizes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755539428


   Thanks for the context, @dmgcodevil. That's is definitely a problem. I think we will want to have a Trino fix for it, with the ability to fix metadata as a work-around until that is released. If you have a utility to share that fixes the metadata, I think that would be useful for other people. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755405035


   Why repair this? Is it causing problems with split planning?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-755426532


   @rdblue if you are using prestosql >= 348 you will face this [problem](https://github.com/trinodb/trino/issues/6369)
   
   cc @aokolnychyi 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-754943032


   @rdblue do you know when a new release will be cut off?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-750765052


   Ok, based on my debugging research the _problem_ in this [line](https://github.com/apache/iceberg/blob/1b66bdfc084ac73fe999299d041aa2e5677f43c9/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L130)
   
   i.e. `writer.getPos() ` returns the corrent size of the actual parquet file saved on disk, but `writeStore.isColumnFlushNeeded()` returns true and adds extra bytes. 
   
   In my case, the actual file size is `4351656`, rows count = `470099`
   
   at the moment when `length()` is called , `ColumnWriteStoreV1.rowCountForNextSizeCheck` == `470100` so `writeStore.isColumnFlushNeeded()` returns true. 
   
   maybe it can be fixed by adding an extra check : 
   
   ```java
     @Override
     public void close() throws IOException {
       flushRowGroup(true);
       writeStore.close();
       writer.end(metadata);
       this.closed = true; // new flag indicates that writer was closed
   
   
         @Override
     public long length() {
       try {
         if (closed) {
           return writer.getPos();
         } else {
           return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);
         }
       } catch (IOException e) {
         throw new RuntimeIOException(e, "Failed to get file length");
       }
     }  
   
     }
   ```
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil closed issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
dmgcodevil closed issue #1980:
URL: https://github.com/apache/iceberg/issues/1980


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1980: Spark Iceberg manifest reports wrong parquet file sizes.

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1980:
URL: https://github.com/apache/iceberg/issues/1980#issuecomment-751031371


   Sounds like you have a fix. Could you open a PR with it? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org