You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/20 22:38:22 UTC

[GitHub] [arrow] westonpace commented on issue #36100: [Python] Pyarrow Table.pylist doesn't release memory untill the program terminates.

westonpace commented on issue #36100:
URL: https://github.com/apache/arrow/issues/36100#issuecomment-1599665149

   I'm not entirely sure what you are expecting and I don't think `from_pylist` is to blame.  Debugging memory usage like this is pretty complex.  First, just importing the code and running it through the first time is going to load various shared objects, etc. into RSS.  To see this effect we can seed the program by creating a single row first.  I'm also going to run `construct_data` three times in a row.
   
   ![image](https://github.com/apache/arrow/assets/1696093/b57a5fa7-52b5-4d13-8ee6-1ab4adc589af)
   
   Now, we see that `from_pylist` (the zig-zag parts) doesn't create much more additional RAM, but it also doesn't appear to immediately release it.  In fact, it could almost look like it is leaking a bit of memory each time.  However, if we run `construct_data` more times we can see that this isn't actually leaking.
   
   What is happening is that pyarrow is returning the memory back to the allocator (in these graphs I was using the system allocator so we are returning the memory to `malloc`).  However, the allocator is not releasing this memory to the OS.  This is because obtaining memory from the OS is expensive and so the allocator tries to avoid it if it can.
   
   ![image](https://github.com/apache/arrow/assets/1696093/f71ccba0-4f1f-4aea-b9f0-34ed2637f4cc)
   
   We can verify this by printing `pa.total_allocated_bytes()`.  This tells us how much memory is allocated and not returned to the OS.  We can see that this is always 0.
   
   Finally, there is a method we can call, mainly for debugging purposes, to ask the allocator to return the memory to the OS.  What this actually does under the hood depends on the allocator.  For malloc, this triggers a call to [`malloc_trim`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html).
   
   If we use `release_unused` then we see the RAM is returned to the OS after each call to `construct_data`.
   
   ![image](https://github.com/apache/arrow/assets/1696093/f61a9cdb-47c7-4a81-817b-af08dbd6e435)
   
   Updated example demonstrating some of these things.
   
   ```
   import pyarrow as pa
   import time
   import random
   import string
   
   def get_sample_data():
       record1 = {}
       for col_id in range(15):
           record1[f"column_{col_id}"] = string.ascii_letters[10 : random.randint(17, 49)]
   
       return [record1]
   
   def construct_data(data, size):
       count = 1
       while count < 10:
   	pa.Table.from_pylist(data * size)
   	count += 1
       return True
   
   def main():
       data = get_sample_data()
       construct_data(data, 1)
       print(f"initial seeding complete! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
   
   if __name__ == "__main__":
       main()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org