You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/26 04:00:47 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new issue #1486: Add median, std, and corr functions

matthewmturner opened a new issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   In the context of adding datafusion to db-benchmark (#147) there are some advanced group by queries that are benchmarked which require median, standard deviation, and correlation functions which datafusion does not currently provide out of the box.
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   Builtin support for median, standard deviation, and correlation functions.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1007000331


   PR for standard deviation is up for review: https://github.com/apache/arrow-datafusion/pull/1525
   
   Please let me know who I should add as reviewer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016684233


   @realno you can take this with a grain of salt as I am new to this.
   
   My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature).  I could be wrong but I believe datafusion had `DISTINCT` before `approx_distinct`.
   
   Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:
   
   - sort: https://docs.rs/arrow/latest/arrow/compute/kernels/sort/fn.sort.html
   - length: https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.len
   - value: https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value
   
   I suppose this would be somewhere between your Option 1 and Option 2.
   
   i definitely defer to @alamb though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016877360


   (we should probably bring this discussion into a new ticket, FWIW)
   
   TLDR is I think `median` is quite complicated to get both correct and high performance. 
   
   The key problem is that you basically need to have buffered the entire input stream before you know what the output is
   
   Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable. 
   
   To implement an exact median, you could implement an `Aggregator` like we have for other aggregates that buffers up all the input, sorts it,  and produces the median value. It would have terrible memory consumption but is probably reasonably straightforward to implement using existing compute kernels as @matthewmturner suggests
   
   If you wanted to try and reuse existing operators (e.g. `LogicalPlan` and `ExecutionPlan`), I think that would be tough. 
   
   The mapping of `SQL` --> LogicalPlan is formulaic but non trivial. All aggregate functions are treated the same (create the same `LogicalPlan` nodes, with different expr lists, so if we special case `median` with a special  operator we'll have to sort out how to handle queries like `SELECT sum(foo), median(foo) from bar GROUP BY baz` which I think will be complicated
   
   Another  challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together)
   
   ```
                           ┌──────────┐                     
                           │          │                     
                           │Aggregator│                     
                           │ (Final)  │                     
                           │          │                     
                           │          │                     
                           └──────────┘                     
                                 ▲                          
                                 │                          
         ┌────────────────┬──────┴────────────────────┐     
         │                │                           │     
         │                │                           │     
         │                │                           │     
   ┌──────────┐     ┌──────────┐                ┌──────────┐
   │          │     │          │                │          │
   │Aggregator│     │Aggregator│                │Aggregator│
   │(Partial) │     │(Partial) │         ...    │(Partial) │
   │          │     │          │                │          │
   │          │     │          │                │          │
   └──────────┘     └──────────┘                └──────────┘
         ▲                ▲                           ▲     
         │                │                           │     
         └────────────────┴──────┬────────────────────┘     
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │           │                    
                           │Repartition│                    
                           │           │                    
                           │           │                    
                           └───────────┘                    
                                 ▲                          
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │Input      │                    
                           │           │                    
                           └───────────┘                    
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016885003


   In principle we could aggregate the partials via a `ListArray<i32, PrimitiveArray<T>>`, and merge them by concatenating the results on a single `PrimitiveArray<T>` per group (and then compute the aggregate that requires all results in the `finish`).
   
   But I agree that the majority of cases we want an online estimate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016037596

I have a update/question for this issue: `stddev` and `corr` have merged few days ago. I took a look at `median` today and had an initial idea how to implement it. Now the problem is the implementation might be somewhat controversial.

There are some operations needed to calculate median: 1. sort 2. count 3. get nth value. I thought about few options:
1. Implement a new function/operator - the function will look like `sort`, we may need to add a new type of expression for it. It feels to be a lot of overhead just for median.

- Pros: No mess around plan builder and query parsing logic
- Cons: Need to add a specific expression just for median; there may be some duplicate logic regarding sort and count

2. Reuse some of the existing code through rewriting the logic plan. If we can have `sort` to keep the total number of record, we can add a new function to rewrite the logical plan to something like ` FIND(n) | SORT(col)`. But currently there is no such behavior in the code base.

- Pros: Better code reuse; May potentially provide a way to handle compound (high-order) functions, e.g. `Function1` -> `FunctionA(FunctionB(col))`
- Cons: Need to modify logic for building plans;

3. Use approximate algorithms like KLL or t-digest. This way it can fit in existing aggregator API, but the result will be an approximation. There is already a PR https://github.com/apache/arrow-datafusion/pull/1539, we can help merge that then implement `median` as `quantile(0.5)`.

- Pros: Easy to implement; Using existing aggregator API, Much better performance
- Cons: The result is an approximation

My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward.

@matthewmturner @alamb

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016800036


   > @realno you can take this with a grain of salt as I am new to this.
   > 
   > My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had `DISTINCT` before `approx_distinct`.
   > 
   > Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:
   > 
   > * sort: https://docs.rs/arrow/latest/arrow/compute/kernels/sort/fn.sort.html
   > * length: https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.len
   > * value: https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value
   > 
   > I suppose this would be somewhere between your Option 1 and Option 2.
   > 
   > i definitely defer to @alamb though.
   
   Thanks for the comments @matthewmturner . I am also new and wouldn't call myself database internal expert :) Yes we have all the functionality ready, the complication is what's the best/most efficient way to implement this. I definitely want to hear more opinions on this. 
   
   Do you think it worth having a approximation to unblock the perf benchmark work? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016977537


   Let me move the conversation to a new issue. I agree with @jorgecarleitao that we could introduce an API in addition to current Aggregator to provide the functionality - the current `sort` actually does exactly that. And to @alamb 's point performance and memory consumption would be an issue,  we may need to handle spilling to disk. As said, I'll create a new issue for this.
   
   For the sake of this task, I suggest we close with the approximation version, what do you think? @matthewmturner are you comfortable using the approximate version for your benchmark work? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016986361


   @realno practically, yes definitely okay with approximate version.
   
   Within the context of performance benchmark my only concern is that we are able to produce the expected results.  For example, i see clickhouse uses `medianExact` (https://h2oai.github.io/db-benchmark/).  It would be nice if there was a way to test and see if this would give the expected result in the benchmarking context. But, at the end of the day approximate would still be an improvement so even if it didnt match that shouldnt be a blocker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016037596

I have a update/question for this issue: `stddev` and `corr` have been merged few days ago. I took a look at `median` today and had an initial idea how to implement it. Now the problem is the implementation might be somewhat controversial.

- Pros: No mess around plan builder and query parsing logic
- Cons: Need to add a specific expression just for median; there may be some duplicate logic regarding sort and count

- Pros: Better code reuse; May potentially provide a way to handle compound (high-order) functions, e.g. `Function1` -> `FunctionA(FunctionB(col))`
- Cons: Need to modify logic for building plans;

- Pros: Easy to implement; Using existing aggregator API, Much better performance
- Cons: The result is an approximation

My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward.

@matthewmturner @alamb

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016877360


   (we should probably bring this discussion into a new ticket, FWIW)
   
   TLDR is I think `median` is quite complicated to get both correct and high performance. 
   
   The key problem is that you basically need to have buffered the entire input stream before you know what the output is
   
   Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable. 
   
   To implement an exact median, you could implement an `Aggregator` like we have for other aggregates that buffers up all the input, sorts it,  and produces the median value. It would have terrible memory consumption but is probably reasonably straightforward to implement. 
   
   If you wanted to try and reuse existing operators (e.g. `LogicalPlan` and `ExecutionPlan`), I think that would be tough. 
   
   The mapping of `SQL` --> LogicalPlan is formulaic but non trivial. All aggregate functions are treated the same (create the same `LogicalPlan` nodes, with different expr lists, so if we special case `median` with a special  operator we'll have to sort out how to handle queries like `SELECT sum(foo), median(foo) from bar GROUP BY baz` which I think will be complicated
   
   Another  challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together)
   
   ```
                           ┌──────────┐                     
                           │          │                     
                           │Aggregator│                     
                           │ (Final)  │                     
                           │          │                     
                           │          │                     
                           └──────────┘                     
                                 ▲                          
                                 │                          
         ┌────────────────┬──────┴────────────────────┐     
         │                │                           │     
         │                │                           │     
         │                │                           │     
   ┌──────────┐     ┌──────────┐                ┌──────────┐
   │          │     │          │                │          │
   │Aggregator│     │Aggregator│                │Aggregator│
   │(Partial) │     │(Partial) │         ...    │(Partial) │
   │          │     │          │                │          │
   │          │     │          │                │          │
   └──────────┘     └──────────┘                └──────────┘
         ▲                ▲                           ▲     
         │                │                           │     
         └────────────────┴──────┬────────────────────┘     
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │           │                    
                           │Repartition│                    
                           │           │                    
                           │           │                    
                           └───────────┘                    
                                 ▲                          
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │Input      │                    
                           │           │                    
                           └───────────┘                    
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1005128470


   @realno happy to have the help :) i actually just updated my PR to get datafusion included in db-benchmark with the current features.  hopefully getting close.
   
   separate issues per function sounds good. we can just use this as a tracker issue - ill add a task list to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1486: Add median, std, and corr functions

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1486:
URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1005125626


   I am interested in helping with this - I am very curious about the result of the db-benchmark as well and want to help. Does it make sense to create separate issues for the functions? I started looking into stddev and maybe somebody else want to work on the others. @matthewmturner @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org