You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/14 08:37:38 UTC

[GitHub] [iceberg] ahshahid opened a new issue, #6424: The size estimation formula for spark task is incorrect

ahshahid opened a new issue, #6424:
URL: https://github.com/apache/iceberg/issues/6424

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The size estimation formula used for non partition cols as seen in 
   ContentScanTask is presently as 
   default long estimatedRowsCount() {
       double scannedFileFraction = ((double) length()) / file().fileSizeInBytes();
       return (long) (scannedFileFraction * file().recordCount());
     }
   
   IMO it should be
   (file().fileSizeInBytes()   * file().recordCount()) / length() 
   
   We are estimating the full row count by scanning part of the file, and the rows contained in it.
   
   the current formula is wroing , because scannedFileFraction is bound to be <= 1
   so full row count has to be >= file().recordCount()
   but if full row count = scannedFileFraction * file().recordCount() implies that 
   full row count <= file().recordCount()
   which is incorrect.
   
   I have bugtest which shows that because of this, inefficient broadcast hashjoins are getting created.
   Will create a PR & bug test tomorrow.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351671855

   @RussellSpitzer  Right, I missed the modifiucation of " - splitOffset".
   
   Though the bug, which I think is in formula, still remains.
   
   My reasoning is as follows:
   the function estimatedRowCounts has to estimate the total row count of a split/file (or a single file) by analyzing a fraction of split (file) .
   which means that 
   total row count of a split/file >= scanned fraction row count( which is what we call record count)
   
   now if total row count of a split/file  = (scannedFileFraction * file().recordCount())
   and scanned fraction is <= 1
   this would result in total row count <= fraction's record count.
   
   the change i proposed is based on this ratio/proportion
   
    when  scanned file/split size is length()                                    rows is file().recordCount()
   so when total size of file/split is (file().fileSizeInBytes() - splitOffset).            the total count X = ?
   
   X = (     file().recordCount()   *  (file().fileSizeInBytes() - splitOffset) ) / length()  
   
   do u think my understanding is correct , of the objective of the function?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351924900

   Right.. I was also thinking that this is where I have a misunderstanding or bug...
   The question is :
   where the recordCount represents the scanned fraction row count, or the total row count of the split/file.
   
   I will look into the code, but as per me., the recordCount has to be partial scanned count.
   This is for 2 reasons:
   1)  The estimated Row count function needs to return the total row count in the split/file. If that was available , then calculations should not even be needed.
   2) For a split which is partial on a single file, or if it spawns multiple files, the estimated row count of total split will need to be calculated.  and for that the recordCount should be something which refers to partial scanned row count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352519479

   @RussellSpitzer . actually part of the issue what I was seeing was related to scannedFraction approximately equal to 1, but record count of 1., which was resulting in net rows seen = 0. due to double casted to long.
   And that was happening because splitOffset value was not being subtracted from fileSize.  as a result scannedFraction was .98... * 1   , which resulted in 0 rows instead of 1. 
   By putting splitOffset , the ratio is now exactly equal to 1. 
   Thanks for the information..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid closed issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid closed issue #6424: The size estimation formula for spark task is incorrect
URL: https://github.com/apache/iceberg/issues/6424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351621767

   The current code is slightly different than this, 
   
   https://github.com/apache/iceberg/blob/33217abf7f88c6c22a8c43b320f9de48de998b94/api/src/main/java/org/apache/iceberg/ContentScanTask.java#L65-L70
   
   but i'm not sure I follow your math. 
   
   Why would Full Row Count have to be greater than the record count for a split? If we scan only part of a file it should only be a portion of the rows in the file? When we sum over all of our partial scans we should get the full amount (if the full file is read)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352058762

   Parquet files have non-data metadata which is not scanned when we read the split. So if for example our first row-group starts at byte 1000, we don't want to count 1000 bytes of the total file size as part of the file size representing data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352159451

   Thank you @RussellSpitzer ..I will close this.. may be issue I m seeing is conversion of double to long for fractional value. Will update once I debug more.
   Sorry for false alarm.. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351901052

   > here length() is the amount of bytes scanned ( only partially read)
   
   https://github.com/apache/iceberg/blob/33217abf7f88c6c22a8c43b320f9de48de998b94/api/src/main/java/org/apache/iceberg/ContentScanTask.java#L48-L53
   
   Your hypothetical has a contradiction
   
   `length = splitSize`
   So
   `if (splitSize == fileSizeInBytes) then length == fileSizeInBytes`
   
   `If length == fileSizeInBytes/2 then splitSize = fileSizeInBytes/2` 
   Therefor there must be one other task of length
   `fileSizeInBytes/2` otherwise we wouldn't be reading the whole file. So again the fraction here is correct at 50% for the given task. The other Task will count the other 50%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351938390

   Record count does not represent the scanned fraction. I linked you to the code, it's a representation of a row in a manifestFile which is a the metadata for the entire file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352025232

   @RussellSpitzer : I see what you are saying about record count corresponding to total file size.
   Let me look into what is causing something wrong in my test for join


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352040671

   @RussellSpitzer : 
   apologies for bugging , I was hoping one more clarification on this aspect:
   long splitOffset = (file().splitOffsets() != null) ? file().splitOffsets().get(0) : 0L;
    double scannedFileFraction = ((double) length()) / (file().fileSizeInBytes() - splitOffset);
    return (long) (scannedFileFraction * file().recordCount());
   
   Given that  file.recordCount() corresponds to file().fileSizeInBytes()
   and length() is the length of the split ( if my understanding is correct)
   then the subtraction of splitOffset is not needed as per my understanding...
   why is splitOffset being subtracted from fileSize...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352520299

   The other issue which I was looking at was comparing perf of parquet with iceberg and in that it seems, that iceberg because of better size estimation as compared to parquet,  results in joins which tend to be less performant than parquet. But this is an issue of spark side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351768402

   > now if total row count of a split/file = (scannedFileFraction * file().recordCount())
   
   This is I think the confusion, we are attempting to determine how many rows are in this split specifically because we are summing over all splits. 
   
   https://github.com/apache/iceberg/blob/7fd9ded0a119c050746d765bd90c59fef93506b1/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L143
   
    A file may be turned into multiple FileScanTasks so we can only count the number of rows contributed by the tasks we are currently looking at. Let's say we do a scan which only ends up touching FileA but File A is divided into multiple  FileScanTasks (or now Content Scan Tasks) for read parallelism A1, A2, and A3
   
   If they have lengths 2/5 A , 2/5 A and 1/5 A then we do the following math
   Sum over all tasks (2/5 A * Total A Rows + 2/5 A * total A Rows + 1/5 A * total A Rows) = Total A Rows
   This should be the total amount of rows in the scan
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351939955

   I see. let me see if I can explain what I mean by test...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

Posted by GitBox <gi...@apache.org>.

ahshahid commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351885991

   Right... That I agree. May be along with split offset ( which is the start of split) , we need the end of split..
   
   But still, pls allow me to describe this simplified case , where the split is same as the file being considered, so that split offset is 0. 
   and assume split size = file.size.
   
   Now as per current formula 
   splitOffset = 0
   so double scannedFileFraction = ((double) **length()**) / (file().fileSizeInBytes() ); 
   here  ** length() ** is the amount of bytes scanned ( only partially read)
   and ** file().recordCount() ** is the number of records read in that partial scan...
   right ?
   now scannedFileFraction  is < 1 ( assume that total file is not scanned) .
   say scannedFileFraction = 0.5
   and numRowsInScan = recordCount = 50
   
   So ideally total estimated row count in that file should be 100.
   
   But with the formula we will get total row count = 0.5 * 50  = 25
   
   But with corrected formula it will be = 50 / 0.5  = 100
   
   because in corrected formula , the fraction is = (file().fileSizeInBytes()) / length = 2 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org