You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "cwang9208 (via GitHub)" <gi...@apache.org> on 2023/03/03 02:48:45 UTC

[GitHub] [arrow] cwang9208 opened a new issue, #34423: pyarrow MemoryMappedFile close does not release memory

cwang9208 opened a new issue, #34423:
URL: https://github.com/apache/arrow/issues/34423

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   `path = ""
   files = os.listdir(path)
   expr = pc.field("l_shipdate") <= datetime.date(1998, 12, 1)
   
   for file in files:
       source = pa.memory_map(path + file)
       table = pa.ipc.RecordBatchFileReader(source).read_all().filter(expr)
       source.close()
   `
   
   I have a simple program to test the memory usage of MemoryMappedFile. In my above test, I'll loop over hundreds of 7GB files. I find the system memory usage continues to increase until exhausted even though I called source.close().
   
   Is there anything wrong with my code or is this a bug?
   
   Thanks for help in advance.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cwang9208 commented on issue #34423: [Python] pyarrow MemoryMappedFile.close() does not release memory

Posted by "cwang9208 (via GitHub)" <gi...@apache.org>.
cwang9208 commented on issue #34423:
URL: https://github.com/apache/arrow/issues/34423#issuecomment-1457324390

   @westonpace I already tried that (code below), but the memory will still be exhausted (`watch -n 1 free -mh`)
   
   ```
   import pyarrow.compute as pc
   import pyarrow as pa
   import gc
   
   path = ""
   files = os.listdir(path)
   expr = pc.field("l_shipdate") <= datetime.date(1998, 12, 1)
   
   for file in files:
       source = pa.memory_map(path + file)
       reader = pa.ipc.RecordBatchFileReader(source)
       table = reader.read_all().filter(expr)
       del table
       del reader
       source.close()
       gc.collect()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34423: [Python] pyarrow MemoryMappedFile.close() does not release memory

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34423:
URL: https://github.com/apache/arrow/issues/34423#issuecomment-1454259289

   Assuming you aren't saving `Table` or references to its data somewhere then yes, I would expect it to release the memory.  A memory mapped file should call `munmap` once the file is closed and all references to the mapped memory are released (e.g. table is destroyed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cwang9208 commented on issue #34423: [Python] pyarrow MemoryMappedFile.close() does not release memory

Posted by "cwang9208 (via GitHub)" <gi...@apache.org>.
cwang9208 commented on issue #34423:
URL: https://github.com/apache/arrow/issues/34423#issuecomment-1455529427

   @westonpace Hi, thanks for your reply. So you mean my code is correct and this should be a bug, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34423: [Python] pyarrow MemoryMappedFile.close() does not release memory

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34423:
URL: https://github.com/apache/arrow/issues/34423#issuecomment-1456938403

   Probably.  What happens if you put a manual `gc` run in the loop?
   
   ```
   import gc
   ...
   for file in files:
       source = pa.memory_map(path + file)
       table = pa.ipc.RecordBatchFileReader(source).read_all().filter(expr)
       source.close()
       gc.collect()
   ```
   
   I'm wondering if some dangling python object (e.g. maybe the `pa.ipc.RecordBatchReader`) is hanging around and keeping references to buffers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org