You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/04 17:40:16 UTC

[GitHub] [arrow] drusso edited a comment on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

drusso edited a comment on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-703289564

I've rebased against the latest changes on the master branch. Some notes and questions about the changes:

* Most of the changes are scoped to DataFusion, but an update to the `concat` kernel was required. Can the Arrow changes be bundled with the DataFusion changes, or should this be a separate issue/pull request?
* I kept the changes to the `concat` kernel as small as possible, scoped strictly to support the count distinct feature in this pull request. However, there is room for improvement here, because we can support lists of any data type.
* Is there an implementation to go from `Vec<Vec<_>>` to `ListArray`? I wasn't aware of one, but something along those lines I think would improve the readability of the `test_concat_primitive_list_arrays` test added in the `concat` kernel.
* Made the switch from `LargeListArray` to `ListArray`. This was motivated by the `take()` kernel already supporting `ListArray`, but not `LargeListArray`. Supporting `LargeListArray` would require handling of `i64` offsets – not for addressing the array's lists (`i32` is fine for that), but for addressing the array's lists' values (there is currently a recursive call to take values [here](https://github.com/apache/arrow/blob/57f548c743acfb4d8311fae034abf790ecb2d374/rust/arrow/src/compute/kernels/take.rs#L328)). Nothing that can't be solved, but I thought it best to leave that out of scope for this particular pull request.
* Following the #8172 changes, and since aggretations now have their own state, there was no longer a need for a new `AggregateMode` variant. This change was removed.
* As discussed, I've replaced `DistinctScalarValue` with `GroupByScalar`. To do so, I promoted `GroupByScalar` to its own module.
* As described in the issue, the goal is to implement `COUNT(DISTINCT col)` for a _single_ argument. Much of this implementation should work as-is for multiple arguments, but I haven't yet tested or tried that.
* The state of `DistinctCountAccumulator` uses one field per input input argument. Each field is emitted as a `ListArray`, i.e., a vector of _N_ arrays of lists of primitives. An alternate implementation, which may be worth exploring, would be to use a _single_ state field, and it emits a vector on _1_ array of lists of structs of primitives.

Let me know what you think and if any changes are needed. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org