You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/04/28 19:38:58 UTC

[GitHub] [iceberg] aokolnychyi opened a new issue, #7465: Pick split size automatically

aokolnychyi opened a new issue, #7465:
URL: https://github.com/apache/iceberg/issues/7465

   ### Feature Request / Improvement
   
   I wonder whether we can pick the split size automatically instead of relying on users to supply a correct value. For instance, if the scheduler is FIFO, can we use the default cluster parallelism and the size of the data to be processed to come up with an optimal split size? We first find matching files and then plan splits so the split size can be dynamic, we just need a good way to estimate it correctly. It would be great if someone could investigate this a bit more.
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Pick split size automatically [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1817275291

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Pick split size automatically [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1783948766

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #7465: Pick split size automatically

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1528352395

   Oh, I did not know there was a previous discussion! Thanks for linking it here. Would you be interested to look into this, @singhpk234? I believe we can come up with something for Spark given that can access the default parallelism. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #7465: Pick split size automatically

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1528316419

   linking previous discussions: 
   - https://github.com/apache/iceberg/issues/7071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #7465: Pick split size automatically

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1529077839

   Yeah this would also be very helpful for Trino because it can consume a large number of splits, most of the time the scan node does not fully utilize the cluster.
   
   If the cluster size can be accessed, what would be a good default behavior? For example for a 30 node cluster, are we saying we should generate at least 30 splits? Or would it be even more optimal to generate a multiple of that like 60 or 90 splits? What would be a good criteria?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #7465: Pick split size automatically

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1528361551

   > Would you be interested to look into this, @singhpk234? 
   
   +1, @aokolnychyi, Happy to work on it !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #7465: Pick split size automatically

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1529318415

   Historically, Spark advised users to have multiple either 1 or 2 or 4 tasks per slot. I am not sure about other engines, though. I think 2x slots is a good target but we should take the split size into account as well. If it gets too large, pick a bigger number of tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #7465: Pick split size automatically

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #7465:
URL: https://github.com/apache/iceberg/issues/7465#issuecomment-1528471963

   @rdblue @szehon-ho @RussellSpitzer @flyrain @jackye1995 @amogh-jahagirdar @nastra, any thoughts on when this may not work or should not be done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Pick split size automatically [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #7465: Pick split size automatically
URL: https://github.com/apache/iceberg/issues/7465


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org