You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/23 13:49:06 UTC

[GitHub] [arrow-datafusion] alamb commented on issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

alamb commented on issue #2723:
URL: https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1325093786

   I general, other than the fact that this proposal sounds like a lot of work, I think it sounds wonderful 🏆 
   
   I did have a question about the proposed trait implementation:
   
   ```rust
   trait Aggregator: MemoryConsumer {
       /// Update aggregator state for given groups.
       fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> Result<()>;
       ...
   }
   ```
   
   Is the idea that each aggregator would contain a HashMap or some other way to map keys --> intermediates (rather than the hash map being in the aggregator? This seems like it would result in a fair amount of duplication
   
   I would have expected something like the following (basically push the `take` into the aggregator)
   
   ```rust
   trait Aggregator: MemoryConsumer {
       /// Update aggregator state for given rows
       fn update_batch(&mut self, indices: Vec<usize>, batch: &RecordBatch)) -> Result<()>;
       ...
   }
   ```
   
   
   The big danger of the plan initially seems like  that it wouldn't be finished and then we are in an even worse state (3 implementations!) but I think your idea to incrementally rewrite V2 to V3 and then remove V1 sounds like a good mitigation strategy
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org