You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/27 05:51:47 UTC

[GitHub] [iceberg] zhangjun0x01 opened a new issue #1666: the fileSizeInBytes of orc and parquet are inconsistent

zhangjun0x01 opened a new issue #1666:
URL: https://github.com/apache/iceberg/issues/1666


   I found that the value stored in the variable fileSizeInBytes of DataFile, orc and parquet format are inconsistent. The orc format stores the deserialized data size, while the parquet stores the file size.
   
   This will cause a problem. In RewriteDataFilesAction, the default value of the targetSizeInBytes is 128M，if it is  orc format, , after rewrite action,the size of the datafile is only 10M. Because in RewriteDataFilesAction ,we read the orc data according to the deserialized data size ,not the file size ,so  the size of the new generated datafile is not enough to 128M.
   
   The parquet format is normal and meets my expectations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #1666: the fileSizeInBytes of orc and parquet are inconsistent

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #1666:
URL: https://github.com/apache/iceberg/issues/1666#issuecomment-718265647


   @shardulm94, can you take a look at the ORC file size metric? It looks like it may be incorrect, which would affect scan planning.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] zhangjun0x01 commented on issue #1666: the fileSizeInBytes of orc and parquet are inconsistent

Posted by GitBox <gi...@apache.org>.

zhangjun0x01 commented on issue #1666:
URL: https://github.com/apache/iceberg/issues/1666#issuecomment-718301579

hi,@rdblue,@shardulm94:
I read the source code. I found that when constructing the DataFile in the BaseTaskWriter.RollingFileWriter#closeCurrent method, we get the fileSizeInBytes by the length() method of the currentAppender, and the OrcFileAppender uses the getRawDataSize() method of the ORC Writer to get the length. I read the comments of this method. It use the deserialized data size.

```
/**
* Return the deserialized data size. Raw data size will be compute when
* writing the file footer. Hence raw data size value will be available only
* after closing the writer.
*
* @return raw data size
*/
long getRawDataSize();
```

Parquet get length by position. I don't know which is correct of orc and parquet, but I think the length obtained in parquet format meets my expectations. Because when I query the hdfs file by the fsck command of hdfs, I found that it split the block according to the file size, not the deserialized data size.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org