You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jychen7 (via GitHub)" <gi...@apache.org> on 2023/04/11 22:45:35 UTC

[GitHub] [arrow-datafusion] jychen7 opened a new issue, #5969: Clickbench q32 not working

jychen7 opened a new issue, #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969

   ### Describe the bug
   
   q32 does not work, it got killed.
   ```
   SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10
   ```
   
   ### To Reproduce
   
   https://github.com/ClickHouse/ClickBench/tree/main/datafusion
   
   ### Expected behavior
   
   q32 should work
   
   ps: DuckDB takes 5.28s
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504602856

   It takes some CPU time on `datafusion::physical_plan::aggregates::row_hash::slice_and_maybe_filter`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1513959517

   with optimized `datafusion::physical_plan::aggregates::row_hash::slice_and_maybe_filter` there is ~19%  gain.
   was 68sec, now 55 sec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1604520039

   > Also, do you guys think https://github.com/apache/arrow-datafusion/pull/6657 help with this query?
   
   I do think so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1507915671

   > reg to flamegraph the datafusion::physical_plan::aggregates::row_hash::slice_and_maybe_filter is expensive.
   it is expensive because excessive vector allocations
   
   that makes sense. From my debugging above, I think `SortExec` is not a slow part. Looks like its children's plan like `RepartitionExec` or `AggregateExec` is slow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1506036788

   > I haven't try in larger instance, e.g. c6a.8xlarge (64GB ram) or up
   
   confirm it works in `c6a.8xlarge` (32vCPU, 64GB RAM)
   ```
   Query took 19.378 seconds
   ```
   
   from `htop` observation, it uses full 32 core and up to 34 GB RAM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1506166745

   Before investigate deeper on reducing memory usage, we may improve `datafusion-cli` (or new option in datafusion core) to set memory pool with machine memory limit.
   (ps: It already uses number of CPU cores for parallel reading/sorting, but not memory)
   
   https://github.com/apache/arrow-datafusion/blob/4c7833ebfdb2d022830bb97862e0ce36b0b3d6b1/datafusion/execution/src/runtime_env.rs#L152-L161
   
   ---
   
   But unfortunately, my local run returns following error
   
   ### before
   unlimited, Query took 115.120 second
   https://github.com/apache/arrow-datafusion/blob/4c7833ebfdb2d022830bb97862e0ce36b0b3d6b1/datafusion-cli/src/main.rs#L149-L152
   
   ### after
   limit 16*10^9 bytes (~15GB), 80% fraction
   `Resources exhausted: Failed to allocate additional 59290364 bytes for GroupedHashAggregateStream[1] with 149089206 bytes already allocated - maximum available is 16563438`
   ```
   fn create_runtime_env() -> Result<RuntimeEnv> {
       let rn_config = RuntimeConfig::new().with_memory_limit(
           16000000000,
           0.8
       );
       RuntimeEnv::new(rn_config)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1506065309

   I thought `sort` knows the machine's memory limit and spills when it is insufficient. Not sure why OOM yet
   https://github.com/apache/arrow-datafusion/blob/4c7833ebfdb2d022830bb97862e0ce36b0b3d6b1/datafusion/core/src/physical_plan/sorts/sort.rs#L63-L72
   
   
   for reference, the plan
   ```
   explain SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |   Sort: c DESC NULLS FIRST, fetch=10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |     Projection: hits.WatchID, hits.ClientIP, COUNT(UInt8(1)) AS c, SUM(hits.IsRefresh), AVG(hits.ResolutionWidth)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |       Aggregate: groupBy=[[hits.WatchID, hits.ClientIP]], aggr=[[COUNT(UInt8(1)), SUM(hits.IsRefresh), AVG(hits.ResolutionWidth)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |         TableScan: hits projection=[WatchID, ClientIP, IsRefresh, ResolutionWidth]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                            |
   | physical_plan | GlobalLimitExec: skip=0, fetch=10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |   SortPreservingMergeExec: [c@2 DESC]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |     SortExec: fetch=10, expr=[c@2 DESC]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |       ProjectionExec: expr=[WatchID@0 as WatchID, ClientIP@1 as ClientIP, COUNT(UInt8(1))@2 as c, SUM(hits.IsRefresh)@3 as SUM(hits.IsRefresh), AVG(hits.ResolutionWidth)@4 as AVG(hits.ResolutionWidth)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |         AggregateExec: mode=FinalPartitioned, gby=[WatchID@0 as WatchID, ClientIP@1 as ClientIP], aggr=[COUNT(UInt8(1)), SUM(hits.IsRefresh), AVG(hits.ResolutionWidth)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |             RepartitionExec: partitioning=Hash([Column { name: "WatchID", index: 0 }, Column { name: "ClientIP", index: 1 }], 12), input_partitions=12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |               AggregateExec: mode=Partial, gby=[WatchID@0 as WatchID, ClientIP@1 as ClientIP], aggr=[COUNT(UInt8(1)), SUM(hits.IsRefresh), AVG(hits.ResolutionWidth)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                            |
   |               |                 ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, projection=[WatchID, ClientIP, IsRefresh, ResolutionWidth] |
   |               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                            |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504585917

   ![df-distinct-graph](https://user-images.githubusercontent.com/10247224/231349286-063c4c5f-61da-4cb1-8894-8504cc9b846f.svg)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1507903769

   when running with `RUST_LOG=debug datafusion-cli`, I find out it is slow during `do_sort`, but not sure which part is slow: `insert_batch` or final `sort`.
   
   So I add debug log in https://github.com/apache/arrow-datafusion/compare/main...jychen7:arrow-datafusion:debug-clickbench-q32?expand=1. Surprisingly, it takes 13 mins between `Start do_sort` and `Start insert_batch`
   ```
   [2023-04-14T02:26:51Z DEBUG datafusion::physical_plan::sorts::sort] Start do_sort for partition 11 of context session_id 039882c2-6b74-4d6a-9739-a50c4fbe1ac7 and task_id None
   [2023-04-14T02:39:01Z DEBUG datafusion::physical_plan::sorts::sort] Start insert_batch for partition 11 with input size 295992
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504551303

   Looking into this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5969: Clickbench q32 not working

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1504589481

   I am wondering if parallel TopK operator could reduce the memory usage for q32 to run
   https://github.com/apache/arrow-datafusion/issues/3516#issuecomment-1250258614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1603361175

   I wonder if this query still fails with OOM after the recent improvements. Also, do you guys think #6657 help with this query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1638066217

   After https://github.com/apache/arrow-datafusion/pull/6904 is merged, I have been able to run all click bench queries successfully with 32GB of ram where in datafusion 27.0.0 q32 caused an OOM for me. This I think this is done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5969: Clickbench q32 not working in 32GB RAM
URL: https://github.com/apache/arrow-datafusion/issues/5969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1507912263

   reg to flamegraph the `datafusion::physical_plan::aggregates::row_hash::slice_and_maybe_filter` is expensive.
   its expensive because excessive vector allocations. I will try to rewrite this method tomorrow and share results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #5969: Clickbench q32 not working in 32GB RAM

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on issue #5969:
URL: https://github.com/apache/arrow-datafusion/issues/5969#issuecomment-1515188902

   Filed https://github.com/apache/arrow-datafusion/issues/6064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org