You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/07/11 23:26:09 UTC

[GitHub] [arrow] westonpace commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

westonpace commented on issue #36303:
URL: https://github.com/apache/arrow/issues/36303#issuecomment-1631631311

   > Is this in reference specifically to each RecordBatch that arrives to the write_dataset function?
   
   Hmm...this is probably close enough to say yes :)  Technically, the `write_dataset` function gets a generator.  This generator is passed to the C++ layer (Acero) which then pulls from the generator one item at a time and (I think but would have to verify) slices the batches that come from the generator into smaller chunks (32Ki rows), partitions those smaller chunks, and then sends those into "the dataset writer".
   
   > My assumption was that pyarrow is keeping the files open until it's arrived at 500k rows per file, but this never happens, because we never get through enough rows to write a full file before we are OOM'd.
   
   Pyarrow will keep the files "open" until they reach 500k rows.  This shouldn't prevent data from being flushed to disk and evicted from RAM.
   
   > Right, so that makes sense, is it possible that the flush is not happening fast enough? I.e. the part that is reading rows into memory is happening "too fast" for us to flush to disk and alleviate some of said memory pressure?
   
   Yes, sort of.  There are a few threads here.  Writes happen asynchronously.  So one thread (may be more than one depending on a few factors but let's assume one and call it the scanning thread) will ask python for a batch of data, partition the batch, and then create a new thread task to write the partitioned batch.  These thread tasks are quite short (they should mostly just be a memcpy from user space into kernel space) and they are submitted to the I/O thread pool (which has 8 threads we can call I/O threads or write threads).  So it's possible that these tasks are piling up.
   
   > Max row counts above ~75k will end up with a container of 2GB memory being OOM killed and my write process doesn't complete :(
   
   This part I don't really understand.  It's possible that docker is complicating things here.  I don't know much about the docker filesystem (and it appears there are several).  What is doing the OOM killing?  Is this the host OS killing the docker process because it is using too much host memory?  Or is this some internal kubernetes monitor that is killing things?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org