You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/09 09:44:59 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #299: Support window functions with empty PARTITION BY clause

Dandandan opened a new issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Window functions have a `PARTITION BY` clause to split the data in partitions and calculate window functions over those partitions individually.
   
   **Describe the solution you'd like**
   We can use `Repartition::Hash` to parallelize the execution.
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   http://www.vldb.org/pvldb/vol8/p1058-leis.pdf&ved=2ahUKEwj80-3OjrfwAhUJPOwKHfdRAssQFjAMegQIEhAC&usg=AOvVaw2KKUPeYhyc-pEFTmlqyboj
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859529899


   > > > @Jimexist
   > > > No, but AFAIK you can pre-partition based on the partition expression, like for example we do for hash joins.
   > > > You have to execute the partition too in the implementation of the window functions, but each partition has all of the equal partition values after doing a hash repartition.
   > > > So a `HashPartition(partition_by_expr) -> Window(partition_by_expr, order_by)` (per partition), should be the same as `Window(partition_by_expr, order_by)` (on 1 partition)
   > > 
   > > 
   > > i wonder why [postgres decided to use sort instead of hash partition regardless](https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/executor/nodeWindowAgg.c#L10-L12)
   > > ```
   > > # explain select max(c2) over (partition by c3) from test;
   > >                             QUERY PLAN
   > > -------------------------------------------------------------------
   > >  WindowAgg  (cost=44.96..55.81 rows=620 width=6)
   > >    ->  Sort  (cost=44.96..46.51 rows=620 width=6)
   > >          Sort Key: c3
   > >          ->  Seq Scan on test  (cost=0.00..16.20 rows=620 width=6)
   > > (4 rows)
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > maybe most likely due to code reuse?
   > 
   > PostgreSQL uses a minimal amount of multithreading, as it is designed mostly for transactional processing (OLTP) on smaller datasets. For execution on one thread, doing extra work would slow it down a bit, so it would be better to not use that at all. For hash join we do the same too, the partitioning is only applied when `concurrency>1`.
   > Only for really big tables / costly queries PostgreSQL will opt to use multiple workers (which will be visible in the query plan), but not sure whether it will even use hash repartitioning in that case.
   > 
   > I believe e.g. Spark always does a partitioning based on partition by, which makes it execute much faster / scalable in the presence of a partition by clause as each worker/thread can execute each part individually.
   
   that's a valid point. i guess here my plan is / continues to be:
   1. implement a correct version using global sort that covers partition and order by and window frame
   2. setup integration tests that compare results
   3. [come up with a more realistic benchmark dataset that's much larger than 100 rows](https://github.com/apache/arrow-datafusion/issues/565)
   4. [migrate to use repartition for inter-partition parallism](https://github.com/apache/arrow-datafusion/pull/569)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859516691






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859462498






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

alamb closed issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859481171






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-844230370


   We'd like to support window function in three or more steps:
   1. #359 basic structure 
   2. #298 empty over clause
   3. #299 with partition clause
   4. #360 with order by
   5. #361 with window frame


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859462498






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859529899


   > > > @Jimexist
   > > > No, but AFAIK you can pre-partition based on the partition expression, like for example we do for hash joins.
   > > > You have to execute the partition too in the implementation of the window functions, but each partition has all of the equal partition values after doing a hash repartition.
   > > > So a `HashPartition(partition_by_expr) -> Window(partition_by_expr, order_by)` (per partition), should be the same as `Window(partition_by_expr, order_by)` (on 1 partition)
   > > 
   > > 
   > > i wonder why [postgres decided to use sort instead of hash partition regardless](https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/executor/nodeWindowAgg.c#L10-L12)
   > > ```
   > > # explain select max(c2) over (partition by c3) from test;
   > >                             QUERY PLAN
   > > -------------------------------------------------------------------
   > >  WindowAgg  (cost=44.96..55.81 rows=620 width=6)
   > >    ->  Sort  (cost=44.96..46.51 rows=620 width=6)
   > >          Sort Key: c3
   > >          ->  Seq Scan on test  (cost=0.00..16.20 rows=620 width=6)
   > > (4 rows)
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > maybe most likely due to code reuse?
   > 
   > PostgreSQL uses a minimal amount of multithreading, as it is designed mostly for transactional processing (OLTP) on smaller datasets. For execution on one thread, doing extra work would slow it down a bit, so it would be better to not use that at all. For hash join we do the same too, the partitioning is only applied when `concurrency>1`.
   > Only for really big tables / costly queries PostgreSQL will opt to use multiple workers (which will be visible in the query plan), but not sure whether it will even use hash repartitioning in that case.
   > 
   > I believe e.g. Spark always does a partitioning based on partition by, which makes it execute much faster / scalable in the presence of a partition by clause as each worker/thread can execute each part individually.
   
   that's a valid point. i guess here my plan is / continues to be:
   1. implement a correct version using global sort that covers partition and order by and window frame
   2. setup integration tests that compare results
   3. [come up with a more realistic benchmark dataset that's much larger than 100 rows](https://github.com/apache/arrow-datafusion/issues/565)
   4. [migrate to use repartition for inter-partition parallism](https://github.com/apache/arrow-datafusion/pull/569)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299#issuecomment-844230370


   We'd like to support window function in three or more steps:
   1. #359 basic structure 
   2. #298 empty over clause
   3. #299 with partition clause (this one)
   4. #360 with order by
   5. #361 with window frame


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #299: Support window functions with PARTITION BY clause

Posted by GitBox <gi...@apache.org>.

alamb closed issue #299:
URL: https://github.com/apache/arrow-datafusion/issues/299


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org