You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "marchostau (via GitHub)" <gi...@apache.org> on 2023/06/16 15:45:11 UTC

[GitHub] [arrow] marchostau opened a new issue, #36126: Memory leak of table (table=(pyarrow.Table.from_pandas(df)).combine_chunks() and record_batch

marchostau opened a new issue, #36126:
URL: https://github.com/apache/arrow/issues/36126

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I'm trying to put dataframes of different types in multiprocessing shared memory. When I'm using dataframes that contain python types I've no problem, but when I use pyarrow types I've problems closing the Shared Memories. I've a function to put a dataframe to the Shared Memory and another to receive this dataframe. I have two problems when I'm using Arrow types like string[pyarrow] inside a dataframe:
   
   When I call the close on the sm_put and after receiving the dataframe I call the close on the sm_get I get instantly an error, that causes the code to stop executing: sm_get.close() -> self.close() -> self._buf.release() -> BufferError: memoryview has 1 exported buffer.
   When I call the close on the sm_put, and I don't call the close on the sm_get, I've no problem instantly, but I get a warning after a while (that surprises me because I don't call any function self.close()): self.close() -> self._buf.release() -> BufferError: memoryview has 1 exported buffer.
   My functions to put the dataframe to the shared memory and to receive the dataframe from shared memory are these:
   `
   def put_df(data, logTime, logSize, usingArrow=False):
       if(usingArrow):
           table = (pyarrow.Table.from_pandas(data)).combine_chunks()
           record_batch = table.to_batches(max_chunksize = sys.maxsize)[0]
       else:
           record_batch = pa.RecordBatch.from_pandas(data)  
       # Determine the size of buffer to request from the shared memory
       mock_sink = pa.MockOutputStream()
       stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema)
       stream_writer.write_batch(record_batch)
       stream_writer.close()
       data_size = mock_sink.size()
       
       sm_put = SharedMemory(create=True, size=data_size)
       buffer = pa.py_buffer(sm_put.buf)
   
       # Write the PyArrow RecordBatch to SharedMemory
       stream = pa.FixedSizeBufferWriter(buffer)
       stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema)
       stream_writer.write_batch(record_batch)
       stream_writer.close()
   
       del stream_writer
       del stream
       del buffer
       if(usingArrow):
           del table
       del record_batch
   
       sm_put.close()
       return sm_put.name 
   
   def get_df(sm_get_name, logTime, logSize, usingArrow=False):
       sm_get = SharedMemory(name = sm_get_name, create = False)
       buffer = pa.BufferReader(sm_get.buf)
   
       # Convert object back into an Arrow RecordBatch
       reader = pa.RecordBatchStreamReader(buffer)
       record_batch = reader.read_next_batch()
    
       # Convert back into Pandas
       if(usingArrow):
           data = record_batch.to_pandas(types_mapper = pd.ArrowDtype)
       else:
           data = record_batch.to_pandas()
              
       del buffer
       del reader
       del record_batch
   
       sm_get.close()
       sm_get.unlink()
    
       return data
    
   if __name__ == '__main__':
       names = [ "0", "1"]
       types = { "0": "string[pyarrow]", "1": "string[pyarrow]" }
       df = pd.read_csv(open(file, "r"), index_col=False, header=None, names=names, dtype=types, engine="c")
       sm_get_name = put_df(df)
       data = get_df(sm_get_name=sm_get_name)]
   `
   As you can see in the next images the del table in the put_df has no effect. Also in the put_df I think the del on the record_batch would have to decrement the memory used.
   
   ![2](https://github.com/apache/arrow/assets/82412218/af600268-d67f-45e3-b264-d67c9a005b04)
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36126: Memory leak of table (table=(pyarrow.Table.from_pandas(df)).combine_chunks() and record_batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36126:
URL: https://github.com/apache/arrow/issues/36126#issuecomment-1601748737

   This looks very similar to #36101 .  Do you want to close one of these as a duplicate of the other?  Or can you explain perhaps how the two issues are different?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marchostau commented on issue #36126: Memory leak of table (table=(pyarrow.Table.from_pandas(df)).combine_chunks() and record_batch

Posted by "marchostau (via GitHub)" <gi...@apache.org>.
marchostau commented on issue #36126:
URL: https://github.com/apache/arrow/issues/36126#issuecomment-1602120764

   I would like to close this one because it's pretty similar. I would appreciate If you can answer the question that I have left in https://github.com/apache/arrow/issues/36101 about why It's not instantly released the memory when I call del table in put_df and how I can solve that? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed issue #36126: Memory leak of table (table=(pyarrow.Table.from_pandas(df)).combine_chunks() and record_batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #36126: Memory leak of table (table=(pyarrow.Table.from_pandas(df)).combine_chunks() and record_batch
URL: https://github.com/apache/arrow/issues/36126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org