You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/14 15:49:07 UTC

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351671855

   @RussellSpitzer  Right, I missed the modifiucation of " - splitOffset".
   
   Though the bug, which I think is in formula, still remains.
   
   My reasoning is as follows:
   the function estimatedRowCounts has to estimate the total row count of a split/file (or a single file) by analyzing a fraction of split (file) .
   which means that 
   total row count of a split/file >= scanned fraction row count( which is what we call record count)
   
   now if total row count of a split/file  = (scannedFileFraction * file().recordCount())
   and scanned fraction is <= 1
   this would result in total row count <= fraction's record count.
   
   the change i proposed is based on this ratio/proportion
   
    when  scanned file/split size is length()                                    rows is file().recordCount()
   so when total size of file/split is (file().fileSizeInBytes() - splitOffset).            the total count X = ?
   
   X = (     file().recordCount()   *  (file().fileSizeInBytes() - splitOffset) ) / length()  
   
   do u think my understanding is correct , of the objective of the function?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org