You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/12 13:05:36 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #1542: Add `take_from_iter()`compute kernel

alamb opened a new issue, #1542:
URL: https://github.com/apache/arrow-rs/issues/1542

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Often rust code calculates some indices and then calls the take kernel: https://docs.rs/arrow/11.1.0/arrow/compute/kernels/take/fn.take.html
   
   However, it is necessary to create an Array (aka buffer all the indexes) which is both not necessary time wise as well as memory inefficient
   
   here are some examples of creating an Array from an iterator simply to call the `take` kernel in DataFusion: 
   https://github.com/apache/arrow-datafusion/blob/fa9e0164127c233a16b53e1d12f162f3b00171f5/ballista/rust/core/src/execution_plans/shuffle_writer.rs#L233-L244
   
   https://github.com/apache/arrow-datafusion/blob/9e3bec811549b61bcb7957554879b77fd9c249ef/datafusion/physical-expr/src/physical_expr.rs#L57-L63
   
   
   **Describe the solution you'd like**
   I would like a `take` kernel that takes an iterator of optional indices, perhaps something like:
   
   ```rust
   fn take_iter(
     values: &dyn Array,
     iter: impl IntoIterator<Item=Option<usize>>,
     options: Option<TakeOptions>,
   ) -> Result<ArrayRef> {
   ...
   }
   ```
   
   Which would act the same as the following, but would not instantiate the intermediate `indicies` array
   
   ```rust
   let indices: UInt64Array = iter.into_iter().collect();
   take(values, &indices, options)
   ```
   
   Bonus points for an API that can take iterators of `Option<usize>`, `usize`, `Option<i32>`, `i32`, etc (aka anything that can be turned into an Option<usize>).
   
   **Describe alternatives you've considered**
   N/A
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #1542: Add `take_from_iter()`compute kernel

Posted by GitBox <gi...@apache.org>.

tustvold closed issue #1542: Add `take_from_iter()`compute kernel
URL: https://github.com/apache/arrow-rs/issues/1542


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1542: Add `take_from_iter()`compute kernel

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1542:
URL: https://github.com/apache/arrow-rs/issues/1542#issuecomment-1097049462

   One thing to perhaps think about also is there are separate code paths based on the null count of the indices arrays, and so it might be worth also providing something that accepts a non-optional iterator


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1542: Add `take_from_iter()`compute kernel

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1542:
URL: https://github.com/apache/arrow-rs/issues/1542#issuecomment-1270140912

   Coming back to this, I'm not sure this is such a good idea anymore. The take kernel results in a substantial amount of codegen, as specialized implementations are generated for every possible array. Just compiling the current take kernels for all the different types of dictionary key represents a substantial amount of codegen and compilation time, multiplying this by all the various different types of iterator, is likely to make this orders of magnitude worse.
   
   I also found when implementing the specialized filter kernels that it was oftentimes faster to materialize the indices than to use an iterator, as it was easier for the compiler to reason about correctly.
   
   TLDR I think this would likely explode compilation times, and bloat the final executables, without necessarily yielding any performance wins


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Dandandan commented on issue #1542: Add `take_from_iter()`compute kernel

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1542:
URL: https://github.com/apache/arrow-rs/issues/1542#issuecomment-1097054946

   I think the same thing could be done is to do this as well for `filter`. There are a couple of places in DataFusion that could benefit from not materializing the intermediate `BooleanArray`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org