You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "vminfant (via GitHub)" <gi...@apache.org> on 2023/06/15 15:29:39 UTC

[GitHub] [arrow] vminfant opened a new issue, #36100: Pyarrow Table.pylist doesn't release memory untill the program terminates.

vminfant opened a new issue, #36100:
URL: https://github.com/apache/arrow/issues/36100

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi,
   
   I have been using pyarrow `Table.from_pylist` to convert a list to a pyarrow table and i have observed that the memory allocated by `parrow` doesn't get released until the program terminates (especially for **linux os**). My use case often involve long running processes and it is creating a problem. 
   
   Sample script to reproduce the problem. I have used [memray](https://bloomberg.github.io/memray/getting_started.html) to profile and check the memory usage. 
   
   ```python
   #file_name: test_exec.py
   
   import pyarrow as pa
   import time
   import random
   import string
   
   def get_sample_data():
       record1 = {}
       for col_id in range(15):
           record1[f"column_{col_id}"] = string.ascii_letters[10 : random.randint(17, 49)]
   
       return [record1]
   
   def construct_data(data):
       count = 1
       while count < 10:
           pa.Table.from_pylist(data * 100000)
           count += 1
       return True
   
   def main():
       data = get_sample_data()
       construct_data(data)
       print("construct data completed!")
   
   if __name__ == "__main__":
       main()
       time.sleep(180)
   ```
   
   I have also checked the documentation and realised that there are different memory pool options can be used by setting an environment variable `ARROW_DEFAULT_MEMORY_POOL` to either `jemalloc`, `mimalloc` or `system` and i have tried all of them in both `macOS` and `linux` but I could see minor improvements on `macOS` but I couldn't see differences irrespective of the memory pool configurations on `linux`. 
   
   `ARROW_DEFAULT_MEMORY_POOL=jemalloc` / OS: `macOS`
   
   ![image](https://github.com/apache/arrow/assets/27796304/dc8b56f9-2ed5-4106-b0c6-f6401ed69798)
   
   `ARROW_DEFAULT_MEMORY_POOL=jemalloc` / OS: `linux`
   
   ![image](https://github.com/apache/arrow/assets/27796304/4e7857dc-18b9-42af-ae32-a5399043692b)
   
   `ARROW_DEFAULT_MEMORY_POOL=mimalloc` / OS: `macOS`
   
   ![image](https://github.com/apache/arrow/assets/27796304/52a55f78-3778-4a17-8004-40bce1477aab)
   
   `ARROW_DEFAULT_MEMORY_POOL=mimalloc` / OS: `linux`
   
   ![image](https://github.com/apache/arrow/assets/27796304/78a3cd32-7bf4-4b48-afae-d03c1d746982)
   
   `ARROW_DEFAULT_MEMORY_POOL=system` / OS: `macOS`
   
   ![image](https://github.com/apache/arrow/assets/27796304/d6319b7c-aa94-4bf6-96ce-11cab9c13487)
   
   `ARROW_DEFAULT_MEMORY_POOL=system` / OS: `linux`
   
   ![image](https://github.com/apache/arrow/assets/27796304/0469b33b-7665-4a36-a14e-e696ba28b0ef)
   
   It will be really helpful if someone can highlight whats happening or suggest any approaches to eliminate this problem on linux. Highly appreciate your efforts on this issue.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vminfant closed issue #36100: [Python] Pyarrow Table.pylist doesn't release memory untill the program terminates.

Posted by "vminfant (via GitHub)" <gi...@apache.org>.
vminfant closed issue #36100: [Python] Pyarrow Table.pylist doesn't release memory untill the program terminates. 
URL: https://github.com/apache/arrow/issues/36100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36100: [Python] Pyarrow Table.pylist doesn't release memory untill the program terminates.

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36100:
URL: https://github.com/apache/arrow/issues/36100#issuecomment-1599665149

   I'm not entirely sure what you are expecting and I don't think `from_pylist` is to blame.  Debugging memory usage like this is pretty complex.  First, just importing the code and running it through the first time is going to load various shared objects, etc. into RSS.  To see this effect we can seed the program by creating a single row first.  I'm also going to run `construct_data` three times in a row.
   
   ![image](https://github.com/apache/arrow/assets/1696093/b57a5fa7-52b5-4d13-8ee6-1ab4adc589af)
   
   Now, we see that `from_pylist` (the zig-zag parts) doesn't create much more additional RAM, but it also doesn't appear to immediately release it.  In fact, it could almost look like it is leaking a bit of memory each time.  However, if we run `construct_data` more times we can see that this isn't actually leaking.
   
   What is happening is that pyarrow is returning the memory back to the allocator (in these graphs I was using the system allocator so we are returning the memory to `malloc`).  However, the allocator is not releasing this memory to the OS.  This is because obtaining memory from the OS is expensive and so the allocator tries to avoid it if it can.
   
   ![image](https://github.com/apache/arrow/assets/1696093/f71ccba0-4f1f-4aea-b9f0-34ed2637f4cc)
   
   We can verify this by printing `pa.total_allocated_bytes()`.  This tells us how much memory is allocated and not returned to the OS.  We can see that this is always 0.
   
   Finally, there is a method we can call, mainly for debugging purposes, to ask the allocator to return the memory to the OS.  What this actually does under the hood depends on the allocator.  For malloc, this triggers a call to [`malloc_trim`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html).
   
   If we use `release_unused` then we see the RAM is returned to the OS after each call to `construct_data`.
   
   ![image](https://github.com/apache/arrow/assets/1696093/f61a9cdb-47c7-4a81-817b-af08dbd6e435)
   
   Updated example demonstrating some of these things.
   
   ```
   import pyarrow as pa
   import time
   import random
   import string
   
   def get_sample_data():
       record1 = {}
       for col_id in range(15):
           record1[f"column_{col_id}"] = string.ascii_letters[10 : random.randint(17, 49)]
   
       return [record1]
   
   def construct_data(data, size):
       count = 1
       while count < 10:
   	pa.Table.from_pylist(data * size)
   	count += 1
       return True
   
   def main():
       data = get_sample_data()
       construct_data(data, 1)
       print(f"initial seeding complete! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! total_allocated_bytes={pa.total_allocated_bytes()}")
   
   if __name__ == "__main__":
       main()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vminfant commented on issue #36100: [Python] Pyarrow Table.pylist doesn't release memory untill the program terminates.

Posted by "vminfant (via GitHub)" <gi...@apache.org>.
vminfant commented on issue #36100:
URL: https://github.com/apache/arrow/issues/36100#issuecomment-1607212858

   Hi @westonpace ... thank you so much for your inputs and time. `total_allocated_bytes()` seems to be producing `0` even for all my tests as well. We can close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org