You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "guozhans (via GitHub)" <gi...@apache.org> on 2024/03/22 09:35:59 UTC

[I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

guozhans opened a new issue, #40738:
URL: https://github.com/apache/arrow/issues/40738

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi,
   
   I tried to save Pandas dataframe to parquet files, and encountered a memory leak issue. Even i have installed nightly build pyarrow 16.0.0.dev356 from the server, as the comment mentioned this issue is fixed from https://github.com/apache/arrow/issues/37989
   
   Any idea? 
   
   here is the memory usage by using memory profiler. 
   
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       33    425.8 MiB    425.8 MiB           1           @profile
       34                                                 def to_parquet(self, df: pd.DataFrame, filename: str):
       35    537.6 MiB    111.9 MiB           1               table = Table.from_pandas(df)
       36    559.1 MiB     21.4 MiB           1               parquet.write_table(table, filename, compression="snappy")
       37    559.1 MiB      0.0 MiB           1               del table
       38                                                     #df.to_parquet(filename, compression="snappy")
   
   My method
   `
   from pyarrow import parquet
   from pyarrow import Table
   
   @profile
   def to_parquet(self, df: pd.DataFrame, filename: str):
       table = Table.from_pandas(df)
       parquet.write_table(table, filename, compression="snappy")
       del table
       #df.to_parquet(filename, compression="snappy")
   `
   
   My related installed packages:
   numpy                     1.22.4
   pandas                    2.1.4
   pyarrow                   16.0.0.dev356
   pyarrow-hotfix            0.6  --> from dask
   dask                      2024.2.1
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024457991

   I closed this issue, and see above comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2021748987

   @guozhans Great. Which version are you using now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016399240

   Hi @guozhans .
   Have you tried to release the memory pool: https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused
   
   I encountered a similar issue:
   Every time I used **pyarrow.parquet.ParquetDataset** to load parquet from S3, the memory usage continued to increase and cannot be released, so I used **release_unused** after the I/O operations:
   ```python
   import pyarrow as pa
   
   pool = pa.default_memory_pool()
   # ...
   pool.release_unused()
   ```
   However, the occupied memory cannot be released immediately until I executed next time. On the other hand, it’s not sure how much memory can be released.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024484407

   @guozhans Thank you very much. Your information helps me a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip-dev (via GitHub)" <gi...@apache.org>.
kyle-ip-dev commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2022650901

   @guozhans Great! What version are you using now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016835221

   Hi @kyle-ip,
   Thanks for the info, and i didn't. But the issue still occurred after i tried. The issue can be seen in multi-thread or multi-process environment. I don't know if it must work under single thread? This issue occurred only if i changed **n_workers** more than one
   
   Result 
   ```
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       39    394.6 MiB    394.6 MiB           1   @profile
       40                                         def to_parquet(df: pd.DataFrame, filename: str):
       41    386.2 MiB     -8.4 MiB           1       table = Table.from_pandas(df)
       42    386.2 MiB      0.0 MiB           1       pool = pa.default_memory_pool()
       43    401.2 MiB     15.0 MiB           1       parquet.write_table(table, filename, compression="snappy")
       44    401.2 MiB      0.0 MiB           1       del table
       45    401.2 MiB      0.0 MiB           1       pool.release_unused()
   ```
   
   The script to reproduce this issue
   ```python
   
   import logging
   import os
   from concurrent.futures import ThreadPoolExecutor
   
   import dask
   import dask.dataframe as dd
   import pandas as pd
   import pyarrow as pa
   from dask import delayed
   
   from distributed import WorkerPlugin, Worker, LocalCluster, Client, wait
   from loky import ProcessPoolExecutor
   from memory_profiler import profile
   from pyarrow import Table, parquet
   
   
   class TaskExecutorPool(WorkerPlugin):
       def __init__(self, logger, name):
           self.logger = logger
           self.worker = None
           self.name = name
   
       def setup(self, worker: Worker):
           executor = ThreadPoolExecutor(max_workers=worker.state.nthreads)
           worker.executors[self.name] = executor
           self.worker = worker
   
   
   @profile
   def to_parquet(df: pd.DataFrame, filename: str):
       table = Table.from_pandas(df)
       pool = pa.default_memory_pool()
       parquet.write_table(table, filename, compression="snappy")
       del table
       pool.release_unused()
   
   
   def from_parquet(filename: str):
       return pd.read_parquet(filename)
   
   
   def main():
       cluster = LocalCluster(n_workers=2, processes=False, silence_logs=logging.DEBUG)
       with Client(cluster) as client:
           client.register_plugin(TaskExecutorPool(logging, "process"), name="process")
           with dask.annotate(executor="process", retries=10):
               nodes = dd.read_parquet("a parquet file", columns=["id", "tags"])
               os.makedirs("/opt/project/parquet", exist_ok=True)
               for i in range(1, 10):
                   dfs = nodes.to_delayed()
                   filenames = [os.path.join("/opt/project/parquet", f"nodes-{i}.parquet") for i, df in enumerate(dfs)]
                   writes = [delayed(to_parquet)(df, fn) for df, fn in zip(dfs, filenames)]
                   dd.compute(*writes)
                   wait(writes)
   
   
   if __name__ == "__main__":
       main()
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2022653560

   @guozhans 
   Great! What version are you using now? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016866737

   OK. I think it may also be related to OS environment. For example, my environment is Ubuntu and the default memory pool is based on jemalloc. To adjust the behaviors of memory pool, this documentation is for reference: https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md
   
   It seems to have a little effect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024456764

   Hi @kyle-ip,
   
   I had Arrow 14.0.0 and 16.0.0 DEV version installed in different folders before, and i am not aware of the old version until that day. I removed Arrow 14.0.0 complete from my ubuntu docker, and re-build source froma main branch again with these commands. And then re-install PyArrow 16.0.0 dev (I know i can build it as well, but i am bit lazy). Now everything looks fine now.
   ```shell
   cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
                  -DCMAKE_INSTALL_LIBDIR=lib \
                  -DCMAKE_BUILD_TYPE=Release \
                  -DARROW_BUILD_TESTS=ON \
                  -DARROW_COMPUTE=ON \
                  -DARROW_CSV=ON \
                  -DARROW_DATASET=ON \
                  -DARROW_FILESYSTEM=ON \
                  -DARROW_HDFS=ON \
                  -DARROW_JSON=ON \
                  -DARROW_PARQUET=ON \
                  -DARROW_WITH_BROTLI=ON \
                  -DARROW_WITH_BZ2=ON \
                  -DARROW_WITH_LZ4=ON \
                  -DARROW_WITH_SNAPPY=ON \
                  -DARROW_WITH_ZLIB=ON \
                  -DARROW_WITH_ZSTD=ON \
                  -DPARQUET_REQUIRE_ENCRYPTION=ON \
                  .. \
       && make -j4 \
       && make install
   ```
   
   
   Result:
   ```shell
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       29    395.4 MiB    395.4 MiB           1   @profile
       30                                         def to_parquet(df: pd.DataFrame, filename: str):
       31    372.2 MiB    -23.2 MiB           1       table = Table.from_pandas(df)
       32    372.2 MiB      0.0 MiB           1       pool = pa.default_memory_pool()
       33    396.4 MiB     24.2 MiB           1       parquet.write_table(table, filename, compression="snappy")
       34    396.4 MiB      0.0 MiB           1       del table
       35    396.4 MiB      0.0 MiB           1       pool.release_unused()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans closed issue #40738: [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas
URL: https://github.com/apache/arrow/issues/40738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]

Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2019899553

   Hi @kyle-ip , 
   after you mentioned, i checked again, and i found i have an old arrow lib version installed, and i am now fixing the environment issue. That might cause this issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org