You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/03 13:55:09 UTC

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

alamb edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1029014217


   > However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. 
   
   One thing we might consider is not storing the group key values directly in the hash table, but separately. Something like:
   
   ```text
    ┌─────────────┐               ┌─────────┬─────────┐   
    │             │               │Group Key│Group Key│   
    │  HashTable  │               │  Col 1  │  Col 2  │   
    │             │               ├─────────┼─────────┤   
    ├──────┬──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               ├ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┤   
    │      │      │       ┌───────▶        idx        │   
    │ key  │value │       │       ├ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ┤   
    │      ├──────┤       │       │         │         │   
    │      │ idx  │───────┘       │         │         │   
    │      ├──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    └──────┴──────┘               │         │         │   
                                  │         │         │   
   HashTable holds indexes        │         │         │   
   to mutable arary               │         │         │   
                                  │   ...   │   ...   │   
                                  └─────────┴─────────┘   
                                                          
                                                          
                                Mutable (Appendable) Array
                                New group keys are        
                                appended at the end       
   ```
   
   The current hash aggregate code takes this approach -- but instead of using Mutable/Appendable arrays it uses a Vec of `GroupState`: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L615
   
   > I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past.
   
   I agree with this 💯  and among other reasons is why I enjoy working with you (and the rest of the people on this chain!)
   
   BTW the DuckDB sorting blog https://duckdb.org/2021/08/27/external-sorting.html (and the paper it references by Goetz Grafe) have a good treatment of sorting (specifically the calculation of sort keys and then a secondary memory shuffle to sort move the original data around correctly)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org