You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/08/06 03:17:10 UTC

[GitHub] [incubator-uniffle] jerqi opened a new issue, #137: [Improvement][Aqe] Sort MapId before the data are flushed

jerqi opened a new issue, #137:
URL: https://github.com/apache/incubator-uniffle/issues/137

   When we use aqe, we need use mapId to filter the data which we don't need, If we sort MapId before the data are flushed. We split the data to segments, if a segment don't have the data which we want to read, we will drop the data. If data is sorted by mapId, we can filter more data and mprove our performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294610968

   > > > It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
   > > 
   > > 
   > > Does data need to sort by mapId?
   > 
   > Yes, we only need local order. If we have local order, we can filter much data effectively.
   
   Emm...  I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][Aqe] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1229444698

   Do we need to sort data by MapID of one partition before flushing data for all jobs? I think no. This will bring unused cost for those non-AQE optimized stages. Maybe we could sort the partition data by MapId when AQE's specified `ShufflePartitionSpec` is applied in first time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294530345

   I propose the design of this issue. https://docs.google.com/document/d/1G0cOFVJbYLf2oX1fiadh7zi2M6DlEcjTQTh4kSkb0LA/edit?usp=sharing 
   
   PTAL @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][Aqe] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1229465934

   You are right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294551812

   It's better to sort MapId before the data are flushed.It  won't bring too much cost for non-AQE optimized stages.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][Aqe] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1272445076

   Do u have implemented this in your internal version? If not, I'm interested on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294556594

   > > It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
   > 
   > Does data need to sort by mapId?
   
   Yes, we only need local order. If we have local order, we can filter much data effectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294701430

   > > taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.
   > 
   > If one reader want the data from taskId=1, so it still want to read the data segment from `taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block`. The data of `taskId-2 block, taskId-3 block` is unnecessary for this reader. Right?
   
   Yes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294715154

   > > This looks ineffective and it's the same with the original block filter.
   > 
   > Actually considering random io, It will cost the same time when you read 3 records or 2 records.
   
   Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this respective, we should sort the data file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1301958156

   #293 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294629015

   > > > > It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
   > > > 
   > > > 
   > > > Does data need to sort by mapId?
   > > 
   > > 
   > > Yes, we only need local order. If we have local order, we can filter much data effectively.
   > 
   > Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?
   
   Give an example:
   We have three buffers to flush, they taskId 1 block, taskId 2 block, taskId 3 block. We should sort them to taskId 1 block, taskId 2 block, taskId 3 block. And then we can flush them to disks.Then we receive taskId 2 block, taskId 6 block, taskId 1 block, we sort them and flush them, so currently the data on the disk should be
   taskId 1 block , taskId 2 block, taskId 3 block, taskId 1 block, taskId 2 block, taskId 6 block.
   The data only have local order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294709541

   > This looks ineffective and it's the same with the original block filter.
   
   Actually considering random io, It will cost the same time when you read 3 records or 2 records.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294682609

   > taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.
   
   If one reader want the data from taskId=1, so it still want to read the data segment from `taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block`. The data of `taskId-2 block, taskId-3 block` is unnecessary for this reader. Right?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi closed issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi closed issue #137: [Improvement][AQE] Sort MapId before the data are flushed
URL: https://github.com/apache/incubator-uniffle/issues/137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][Aqe] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1272445272

   No. You can go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294554545

   > It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.
   
   Does data need to sort by mapId?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294707361

   This looks ineffective and it's the same with the original block filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #137: [Improvement][AQE] Sort MapId before the data are flushed

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #137:
URL: https://github.com/apache/incubator-uniffle/issues/137#issuecomment-1294727847

   > > > This looks ineffective and it's the same with the original block filter.
   > > 
   > > 
   > > Actually considering random io, It will cost the same time when you read 3 records or 2 records.
   > 
   > Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file.
   
   We don't need global order, local order should be enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org