You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jaylmiller (via GitHub)" <gi...@apache.org> on 2023/02/10 17:02:21 UTC

[GitHub] [arrow-datafusion] jaylmiller opened a new pull request, #5242: Draft: Arrow Row Format in SortExec #5230

jaylmiller opened a new pull request, #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242

   # Which issue does this PR close?
   
   
   Closes #5230 .
   
   # Rationale for this change
   
   Modifying SortExec to use arrow row format should increase performance when sorting by multiple columns.
   
   # What changes are included in this PR?
   
   This is a draft PR and currently only the first iteration is implemented:
   > A first iteration could simply modify sort_batch to use the row format for multi-column sorts, as demonstrated [here](https://docs.rs/arrow-row/latest/arrow_row/#lexsort), falling back to sort_to_indices if only a single column.
   
   # Are these changes tested?
   
   Existing unit tests for SortExec cover this.
   
   # Are there any user-facing changes?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5242: v1 Arrow Row Format in SortExec #5230

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller commented on PR #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242#issuecomment-1428789786

   @alamb Does this SortExec benchmark seem good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5242: Arrow Row Format in SortExec #5230

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242#issuecomment-1430328799

   >  @alamb Does this SortExec benchmark seem good?
   
   
   Yes, I think it looks like a good start. Thank you @jaylmiller . Could you please create a new PR with just the sort benchmark (so we can merge that and use it as a way to compare the code on master and this PR)?
   
   Thanks again@


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5242: v1 Arrow Row Format in SortExec #5230

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller commented on PR #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242#issuecomment-1427047729

   > Thanks @jaylmiller -- this is looking good. Very nice first contribution
   > 
   > I think this code could almost go in as it -- the only thing that I think we need to see is some sort of benchmark improvement.
   > 
   > I looked around in our existing benchmarks to find a benchmark for our sort operator, but was not able to. Maybe @tustvold or @Dandandan or @andygrove know of a preexisting benchmark
   > 
   > If not, perhaps we can adapt some of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/merge.rs to make a SortExec benchmark
   
   Thanks! Should I add said benchmark to this PR then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller closed pull request #5242: Arrow Row Format in SortExec #5230

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller closed pull request #5242: Arrow Row Format in SortExec #5230
URL: https://github.com/apache/arrow-datafusion/pull/5242


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5242: Arrow Row Format in SortExec #5230

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242#discussion_r1103772779


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -38,13 +38,14 @@ use crate::physical_plan::{
     RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 use crate::prelude::SessionConfig;
-use arrow::array::{make_array, Array, ArrayRef, MutableArrayData};
+use arrow::array::{make_array, Array, ArrayRef, MutableArrayData, UInt32Array};
 pub use arrow::compute::SortOptions;
 use arrow::compute::{concat, lexsort_to_indices, take, SortColumn, TakeOptions};
 use arrow::datatypes::SchemaRef;
 use arrow::error::ArrowError;
 use arrow::ipc::reader::FileReader;
 use arrow::record_batch::RecordBatch;
+use arrow::row::{OwnedRow, Row, RowConverter, SortField};

Review Comment:
   Using OwnedRow is likely to seriously hurt performance, as it will allocate per row.
   
   What do you think about getting the v1 in that doesn't preserve the row encoding, and then we can iterate on v2 in a separate PR. I suspect we will want to preserve rows, and a list of sorted indices.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5242: Arrow Row Format in SortExec #5230

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller commented on code in PR #5242:
URL: https://github.com/apache/arrow-datafusion/pull/5242#discussion_r1103812300


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -38,13 +38,14 @@ use crate::physical_plan::{
     RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 use crate::prelude::SessionConfig;
-use arrow::array::{make_array, Array, ArrayRef, MutableArrayData};
+use arrow::array::{make_array, Array, ArrayRef, MutableArrayData, UInt32Array};
 pub use arrow::compute::SortOptions;
 use arrow::compute::{concat, lexsort_to_indices, take, SortColumn, TakeOptions};
 use arrow::datatypes::SchemaRef;
 use arrow::error::ArrowError;
 use arrow::ipc::reader::FileReader;
 use arrow::record_batch::RecordBatch;
+use arrow::row::{OwnedRow, Row, RowConverter, SortField};

Review Comment:
   That sound's good. I will just revert this 2nd commit then. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org