You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/27 15:50:29 UTC

[GitHub] [arrow-datafusion] alamb commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

alamb commented on issue #1570:
URL: https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1525937016

   > Can we make the GroupState and the Accumulator states serializable ?
   > With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the GroupState to avoid the re-calculating.
   
   @mingmwang  when we serialize the groups to disk when  the hash table is full, we then need to read them back in again somehow and do the final combination. <y assumption is that we don't have the memory to read them all back in at once as we had to spill in the first place
   
   If we sort the data that is spilled on the group keys, we can  stream data from the different spill files in parallel, merge, and then do a streaming group by, which we will have sufficient memory to accomplished. 
   
   > IMHO, aggregation should start with hash map, we can assume that there is not going to be spill, if we're wrong we would pay penalty of being wrong as we will have to sort it before spill.
   
   Yes I agree with @milenkovicm - this approach will work well. It does have a disadvantage of a performance "cliff" when the query goes from in memory --> spilling (it doesn't degrade gracefully) but it is likely the fastest for queries that have sufficient memory 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org