You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/06 17:13:46 UTC

[GitHub] [arrow] waynexia opened a new pull request #8856: Arrow-10355: [Rust] Add support for list_sort

waynexia opened a new pull request #8856:
URL: https://github.com/apache/arrow/pull/8856


   This supports sort on `ListArray` (primitive array inside Variable Sized Array only for now).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739675217


   Hi @waynexia . Thanks a lot for this. I think that the idea of a list_sort is a bit different: in the test you wrote, it would correspond to:
   
   ```
               vec![
                   Some(vec![1, 0]),
                   Some(vec![4, 3, 2, 1]),
                   Some(vec![2, 3, 4]),
                   Some(vec![3, 3, 3, 3]),
                   Some(vec![1, 1]),
               ],
               Some(SortOptions {
                   descending: false,
                   nulls_first: false,
               }),
               vec![
                   Some(vec![0, 1]),
                   Some(vec![1, 2, 3, 4]),
                   Some(vec![2, 3, 4]),
                   Some(vec![3, 3, 3, 3]),
                   Some(vec![1, 1]),
               ],
   ```
   
   i.e. it is a operator that operates on the individual entries of the array, and sorts them individually. This would correspond to the `array_sort` operator in Spark: https://spark.apache.org/docs/latest/api/sql/index.html#array_sort
   
   I would place this operator on a separate location, i.e. not in `sort`, to avoid confusion with the operator that sorts the array.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748602732


   Appreciate you too @jorgecarleitao, this cannot be done without your help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-746412925


   https://issues.apache.org/jira/browse/ARROW-10940


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-745110619


   I found `SortExec` in `datafusion/physical_plan/sort.rs`, is it requested by ARROW-10355?
   
   One thing that may matter is `SortExec` requires every individual entry to be of the same length, unlike this [example](https://github.com/apache/arrow/pull/8856#issuecomment-739675217)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8856: Arrow-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739675217


   Hi @waynexia . Thanks a lot for this. I think that the idea of a list_sort is a bit different: in the test you wrote, it would correspond to:
   
   ```
               vec![
                   Some(vec![1, 0]),
                   Some(vec![4, 3, 2, 1]),
                   Some(vec![2, 3, 4]),
                   Some(vec![3, 3, 3, 3]),
                   Some(vec![1, 1]),
               ],
               Some(SortOptions {
                   descending: false,
                   nulls_first: false,
               }),
               vec![
                   Some(vec![0, 1]),
                   Some(vec![1, 2, 3, 4]),
                   Some(vec![2, 3, 4]),
                   Some(vec![3, 3, 3, 3]),
                   Some(vec![1, 1]),
               ],
   ```
   
   i.e. it is a operator that operates on the individual entries of the array, and sorts them individually. This would correspond to the `array_sort` operator in Spark: https://spark.apache.org/docs/latest/api/sql/index.html#array_sort
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: Arrow-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739533505


   There are some points I'm not very clear about:
   1. Ordering. I didn't find a definition of `List` in the documentation or CPP implementation. So I use something like lexicographical order (refer to `sort_list()` in sort.rs) to compare two lists. Please let me know if I missed anything about the ordering's definition.
   2. Nulls. It seems that nulls element in list array is different than [documentation](http://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout). As `From<ArrayData>` requires buffers number should be 1 (0 in `FixedSizeListArray`) which is not the same as the spec. It says that variable sized list has two buffers for "validate bitmap" and "offsets". In the current implementation validate bitmap is placed in `ArrayData`'s `null_bitmap`. I think nullable array in `List` cannot be represented in this way?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-746365537


   Okay, thanks for your help. I filed ARROW-10940 for this PR.
   
   There are some works to do like support fixed-size list and i64 indices type. I'll finish it in a few days :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-744386615


   Hi @jorgecarleitao,very appreciate your clear explanation! But I am sorry that I don't seem to describe it very well, please allow me to explain my question again.
   
   The code in this PR does not implement `list_sort`, but another function that sorts (so-called rearrange) the subarray of the List Array level (maybe this is part of the "sort" kernel?). 
   
   And the point is while implementing the functionality required by `list_array`, does this current code need to be preserved and moved forward? Thanks for your time ☺


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-744270647


   @waynexia , good question. In general, yes: arrow arrays are immutable and thus we can't sort them in place. Therefore, we use two steps:
   
   1. get the index array that would sort the array
   2. `take` items from the array according to said indexes
   
   The second point is where we "re-arrange" the array.
   
   Sorry for the delay and keep the questions coming if you have them. :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-743701938


   FWIW, I find it useful to go through the array equality code whenever I need to remember how an array is described, because, in the end, the equality is the operator that will dictate how the expected and observed results match.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-744036391


   > FWIW, I find it useful to go through the array equality code whenever I need to remember how an array is described, because, in the end, the equality is the operator that will dictate how the expected and observed results match.
   
   Thanks, this is really helpful for me!
   
   And one more thing, what about this: 
   > BTW, is the current implementation that rearranges the input array needed to be part of Sort?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739730551


   Thanks for your reply @jorgecarleitao. Sorry for the incorrect understand I've made. 
   
   So the required `list_sort` is not part of `Sort( ArrayRef )` in `sort.rs` which will rearrange the given array. But a method of `ListArray` that performs `Sort` on each underlying sub-array, right?
   
   And then where do you think is a proper place to put `list_sort`, a separate file under `compute/kernel` or as a member function of `ListArray`, or someplace else. Please let me know what do you think :raised_hands:
   
   BTW, is the current implementation that rearranges the input array needed to be part of `Sort`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748498108


   Hey @waynexia, I understand. It's effectively `array_sort` at https://github.com/nevi-me/rust-dataframe/blob/d4ae4ae6a2541b6625e702a350bb08efe42d6975/src/functions/array.rs#L328 , right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748562551


   Yes, I think ARROW-10355 requests something like that. But I didn't understand it clearly at the beginning and made this "wrong" PR :stuck_out_tongue:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748188311


   Hi @jorgecarleitao @nevi-me, I think this is ready for review. Could you please take a look and let me know what do you think about this? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739682064


   https://issues.apache.org/jira/browse/ARROW-10355


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-745782164


   Hi @waynexia , ahh, got it.
   
   It could be interpreted as such, yes: this PR is the implementation of a `sort` of a `ListArray`, based on the lexicographic order of the individual items on the list (a bit like how a `string` also has a natural order based on the individual items).
   
   If you want please file a new issue on Jira for that, and let's review this PR based on that (in which case `sort` is the right place to put it).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748262641


   Hi @waynexia, thank you for taking this on, and I apologise for the delay, I haven't had enough time to review PRs.
   
   I suggest that we keep the `list_sort` in the sort kernel for now. We already have other ListArray-specific kernels like `ListArray::contains(PrimitiveArray)` in the other kernel modules, so we can keep to that practise for now. @jorgecarleitao do you concur?
   
   We can later move all ListArray-specific functions to a separate module.
   
   I'll review this over the weekend.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10355: [Rust] Add support for list_sort

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-739872518


   > There are some points I'm not very clear about ...
   
   For the first one, thanks to @jorgecarleitao. This is not a problem that should be considered when implementing `list_sort`.
   
   For the second point, I find some examples and think I can make it clear. `List Array` is an (at least) two-level ArrayData with `valids_bitmap` on each level to represent nulls in `List` itself and its child data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
waynexia commented on pull request #8856:
URL: https://github.com/apache/arrow/pull/8856#issuecomment-748494258


   > I suggest that we keep the list_sort in the sort kernel for now.
   
   Hi @nevi-me, thanks for your reply. From my understanding, `list_sort` is not implemented in this pr. To avoid ambiguity, let me say that `list_sort` stands for an operator in `DataFusion`, and this pr implements the `ListArray` part of the `sort` kernel. Am I right @jorgecarleitao?
   
   > We can later move all ListArray-specific functions to a separate module.
   
   I agree with this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8856: ARROW-10940: [Rust] Extend sort kernel to ListArray

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8856:
URL: https://github.com/apache/arrow/pull/8856


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org