You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "baharberna (via GitHub)" <gi...@apache.org> on 2023/05/12 20:21:27 UTC

[GitHub] [arrow-datafusion] baharberna opened a new pull request, #6346: Sort Preserving Repartition exec

baharberna opened a new pull request, #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346

   
   # Rationale for this change
   
   RepartitionExec , when handling multiple input partitions, creates N channels for each input partition, where N is the output partition count. This results in a total of input_partition * output_partition channels. During processing, the channels are pulled for each output partition, depending on the processing time, which disrupts the order of records. This is particularly problematic when the input partition count is greater than 1, as it leads to an unpredictable order of records within the output partitions. To address this issue, a more sophisticated algorithm is needed, one that can combine the existing hash partitioner and round-robin partitioner functionalities while preserving the original order of records within partitions, even when the input partition count is greater than 1.
   
   # What changes are included in this PR?
   
   SortPreservingRepartitionExec that implements the ExecutionPlan trait and its associated APIs. 
   the sort preserving repartition operator maps N input partitions to M output partitions based on a partitioning scheme meanwhile preserving their order. To achieve this, we exploit from SortPreservingMergeStream: with this, we first merge multiple input partitions into one output stream preserving their order, then give this output into RepartitionExec. Since RepartitionExec preserve the order when the the number of input partitions is one, we reach our goal, hopefully :)
   SortPreservingRepartitionExec mainly combines the functionality of SortPreservingMergeStream in the first order and as the next, RepartitionExec
   
   # Are these changes tested?
   
   Tests are not included in the PR, there are some tests in sort_enforcement.rs but are failing :(
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6346: Sort Preserving Repartition exec

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346#issuecomment-1613936010

   https://github.com/apache/arrow-datafusion/pull/6742 for any reference


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on pull request #6346: Sort Preserving Repartition exec

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on PR #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346#issuecomment-1613905920

   The successor of this work has merged, closing for house cleaning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6346: Sort Preserving Repartition exec

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346#issuecomment-1568372741

   @baharberna  -- I have not reviewed this test in detail, but @tustvold  mentioned to me this morning that one subtlety about the current merging algorithms in DataFusion is that they are stable :
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs#L58-L60
   
   I wonder if this operator implementation is stable 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6346: Sort Preserving Repartition exec

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346#issuecomment-1568375045

   FWIW I filed https://github.com/apache/arrow-datafusion/issues/6486 to track this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6346: Sort Preserving Repartition exec

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346#issuecomment-1570481384

   I can take a look late this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak closed pull request #6346: Sort Preserving Repartition exec

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak closed pull request #6346: Sort Preserving Repartition exec
URL: https://github.com/apache/arrow-datafusion/pull/6346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org