You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/27 06:15:57 UTC

[GitHub] [iceberg] zhangjun0x01 opened a new issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

zhangjun0x01 opened a new issue #1667:
URL: https://github.com/apache/iceberg/issues/1667


   In RewriteDataFilesAction, the default value of targetSizeInBytes is 128M, if there are the following data files: 20M, 20M, 20M, 70M, 100M,The current logic is to scan these data file in turn until the sum of the data file sizes  <= targetSizeInBytes,
   So three CombinedScanTask tasks will be generated, (20M, 20M, 20M), (70M), (100M).
   
   Obviously, it is more appropriate to generate two CombinedScanTask tasks (20M, 20M, 70M), (20M, 100M).
   
   We should optimize this algorithm to generate as few target data files as possible and make its size as close to targetSizeInBytes as possible.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-959180907


   We should really get in my Medium Files PR first which would fix the
   suggested layout ^ (if the blocks were the right sizes)
   https://github.com/apache/iceberg/pull/3292
   
   On Wed, Nov 3, 2021 at 4:11 AM kingeasternsun ***@***.***>
   wrote:
   
   > It seems like a Knapsack problem
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/1667#issuecomment-958765565>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YLKP7BZPIGYRMXAS6DUKD4EPANCNFSM4TALJMXA>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kingeasternsun commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
kingeasternsun commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-958765565


   It seems like a Knapsack problem 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kingeasternsun commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
kingeasternsun commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-958765565


   It seems like a Knapsack problem 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kingeasternsun removed a comment on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
kingeasternsun removed a comment on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-958765565


   It seems like a Knapsack problem 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kingeasternsun commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
kingeasternsun commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-958765565






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-959180907






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangjun0x01 commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
zhangjun0x01 commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-718293764


   My idea is to use the dynamic programming algorithm to get an optimal result, but I haven't implemented this algorithm yet. I will think about how to do it and do a test later


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-718266022


   What alternative algorithm would you suggest? The current algorithm is simple and I'm sure could be improved.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1667: Optimize generation of CombinedScanTask for RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1667:
URL: https://github.com/apache/iceberg/issues/1667#issuecomment-959180907


   We should really get in my Medium Files PR first which would fix the
   suggested layout ^ (if the blocks were the right sizes)
   https://github.com/apache/iceberg/pull/3292
   
   On Wed, Nov 3, 2021 at 4:11 AM kingeasternsun ***@***.***>
   wrote:
   
   > It seems like a Knapsack problem
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/1667#issuecomment-958765565>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YLKP7BZPIGYRMXAS6DUKD4EPANCNFSM4TALJMXA>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org