You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/27 15:03:32 UTC

[GitHub] [arrow-rs] Dandandan commented on issue #623: Confusing memory usage with CSV reader

Dandandan commented on issue #623:
URL: https://github.com/apache/arrow-rs/issues/623#issuecomment-887589188


   What the CSV reader does in the CSV parser is reusing some allocations over time in a batch to reduce allocations / time.
   So generally, this might increase the memory usage a bit as more allocations are kept around from previous batches.
   
   However, with a very small batch size of 10, this won't cause the high memory usage, but the data and metadata around a single `RecordBatch` does: each batch has a schema with field names, some different pointers to the data etc. which will make up the most of the data when choosing a low size. When you store them in a `Vec` instead of iterating over them (where they will be dropped) you'll keep them in memory, which I expect will consume the most memory.
   
   So generally
   
   * Use a batch size of some 1000s (so you have less overhead of metadata and makes use the columnar Arrow format)
   * If you don't have to store them in a `Vec` - don't keep them in a Vec but iterate over them like in your first example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org