You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "pablodcar (via GitHub)" <gi...@apache.org> on 2023/10/03 07:51:46 UTC

Re: [I] Real Time Analytics: concat_tables memory and performance degradation [arrow]

pablodcar commented on issue #37801:
URL: https://github.com/apache/arrow/issues/37801#issuecomment-1744394914

   Thanks for the response.
   
   > In a process where you are continuously appending data, the better approach would be a two-step logic: first gather the batches of data in a separate Table, and only when this reaches a certain size threshold (eg 65k), actually combine the chunks of this table (involving a copy) into a single chunk, and then append this combined chunk to the overall `rates_table`.
   > 
   > It would be interesting to see when using such an approach if you still notice memory issues.
   > 
   > > and if I use `combine_chunks` from time to time, things get worse.
   > 
   > Given my explanation above, I have to admit that this is a bit strange ..
   
   Calling `combine_chunks` every 65k items - instead of a single call at the end - helps in terms of CPU and memory. In fact, the memory remains almost constant, i.e. if I run the cycle twice, the memory is not doubled, only it's increased a little.
   
   Will test with a long-running process, thanks.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org