You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/01 03:12:17 UTC

[GitHub] [iceberg] chenwyi2 opened a new issue, #6326: estimateStatistics cost mush time to compute stats

chenwyi2 opened a new issue, #6326:
URL: https://github.com/apache/iceberg/issues/6326

   ### Query engine
   
   spark 3.1
   
   ### Question
   
   spark driver will cost too much time to computes stats since listing all the files to getstats,and the thread is below:
   ![reliao_img_1669864216597](https://user-images.githubusercontent.com/19389434/204957073-f7672b97-4ad5-418a-a5b1-2f50ec020146.png)
    and the code is 
   ````java
   for (CombinedScanTask task : tasks()) {
         for (FileScanTask file : task.files()) {
           // TODO: if possible, take deletes also into consideration.
           double fractionOfFileScanned = ((double) file.length()) / file.file().fileSizeInBytes();
           numRows += (fractionOfFileScanned * file.file().recordCount());
         }
       }
   ````
   maybe we can add an option to disable estimateStatistics ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6326: estimateStatistics cost mush time to compute stats

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1592149961

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6326: estimateStatistics cost mush time to compute stats

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1569307733

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #6326: estimateStatistics cost mush time to compute stats

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6326: estimateStatistics cost mush time to compute stats
URL: https://github.com/apache/iceberg/issues/6326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6326: estimateStatistics cost mush time to compute stats

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1333946715

   While I don't have a problem with disabling statistics reporting, I am pretty dubious this takes that long. What I believe you are actually seeing is the task list being created fort the first time and stored in a list. We use a lazy iterator which needs to be turned into a list before the job begins (even if statistics are not reported). This means even if we don't spend the time iterating the list when we are estimating stats, we will spend that same amount of time later when planning tasks. The only difference would be in the current case the second access to "tasks()" is cached so it's very fast.
   
   In this case the speed could probably be improved if the parallelism of the Manifest Reads was increased. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rbalamohan commented on issue #6326: estimateStatistics cost mush time to compute stats

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1334526348

   Check if increasing "iceberg.worker.num-threads" helps in this case. Default should be the number of processors available in the system. This can be increased by setting it as system property (try sending it via spark driver/executor jvm opt). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org