You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "guozhans (via GitHub)" <gi...@apache.org> on 2024/03/22 09:35:59 UTC
[I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
guozhans opened a new issue, #40738:
URL: https://github.com/apache/arrow/issues/40738
### Describe the bug, including details regarding any error messages, version, and platform.
Hi,
I tried to save Pandas dataframe to parquet files, and encountered a memory leak issue. Even i have installed nightly build pyarrow 16.0.0.dev356 from the server, as the comment mentioned this issue is fixed from https://github.com/apache/arrow/issues/37989
Any idea?
here is the memory usage by using memory profiler.
Line # Mem usage Increment Occurrences Line Contents
=============================================================
33 425.8 MiB 425.8 MiB 1 @profile
34 def to_parquet(self, df: pd.DataFrame, filename: str):
35 537.6 MiB 111.9 MiB 1 table = Table.from_pandas(df)
36 559.1 MiB 21.4 MiB 1 parquet.write_table(table, filename, compression="snappy")
37 559.1 MiB 0.0 MiB 1 del table
38 #df.to_parquet(filename, compression="snappy")
My method
`
from pyarrow import parquet
from pyarrow import Table
@profile
def to_parquet(self, df: pd.DataFrame, filename: str):
table = Table.from_pandas(df)
parquet.write_table(table, filename, compression="snappy")
del table
#df.to_parquet(filename, compression="snappy")
`
My related installed packages:
numpy 1.22.4
pandas 2.1.4
pyarrow 16.0.0.dev356
pyarrow-hotfix 0.6 --> from dask
dask 2024.2.1
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024457991
I closed this issue, and see above comment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2021748987
@guozhans Great. Which version are you using now?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016399240
Hi @guozhans .
Have you tried to release the memory pool: https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused
I encountered a similar issue:
Every time I used **pyarrow.parquet.ParquetDataset** to load parquet from S3, the memory usage continued to increase and cannot be released, so I used **release_unused** after the I/O operations:
```python
import pyarrow as pa
pool = pa.default_memory_pool()
# ...
pool.release_unused()
```
However, the occupied memory cannot be released immediately until I executed next time. On the other hand, it’s not sure how much memory can be released.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024484407
@guozhans Thank you very much. Your information helps me a lot!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip-dev (via GitHub)" <gi...@apache.org>.
kyle-ip-dev commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2022650901
@guozhans Great! What version are you using now?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016835221
Hi @kyle-ip,
Thanks for the info, and i didn't. But the issue still occurred after i tried. The issue can be seen in multi-thread or multi-process environment. I don't know if it must work under single thread? This issue occurred only if i changed **n_workers** more than one
Result
```
Line # Mem usage Increment Occurrences Line Contents
=============================================================
39 394.6 MiB 394.6 MiB 1 @profile
40 def to_parquet(df: pd.DataFrame, filename: str):
41 386.2 MiB -8.4 MiB 1 table = Table.from_pandas(df)
42 386.2 MiB 0.0 MiB 1 pool = pa.default_memory_pool()
43 401.2 MiB 15.0 MiB 1 parquet.write_table(table, filename, compression="snappy")
44 401.2 MiB 0.0 MiB 1 del table
45 401.2 MiB 0.0 MiB 1 pool.release_unused()
```
The script to reproduce this issue
```python
import logging
import os
from concurrent.futures import ThreadPoolExecutor
import dask
import dask.dataframe as dd
import pandas as pd
import pyarrow as pa
from dask import delayed
from distributed import WorkerPlugin, Worker, LocalCluster, Client, wait
from loky import ProcessPoolExecutor
from memory_profiler import profile
from pyarrow import Table, parquet
class TaskExecutorPool(WorkerPlugin):
def __init__(self, logger, name):
self.logger = logger
self.worker = None
self.name = name
def setup(self, worker: Worker):
executor = ThreadPoolExecutor(max_workers=worker.state.nthreads)
worker.executors[self.name] = executor
self.worker = worker
@profile
def to_parquet(df: pd.DataFrame, filename: str):
table = Table.from_pandas(df)
pool = pa.default_memory_pool()
parquet.write_table(table, filename, compression="snappy")
del table
pool.release_unused()
def from_parquet(filename: str):
return pd.read_parquet(filename)
def main():
cluster = LocalCluster(n_workers=2, processes=False, silence_logs=logging.DEBUG)
with Client(cluster) as client:
client.register_plugin(TaskExecutorPool(logging, "process"), name="process")
with dask.annotate(executor="process", retries=10):
nodes = dd.read_parquet("a parquet file", columns=["id", "tags"])
os.makedirs("/opt/project/parquet", exist_ok=True)
for i in range(1, 10):
dfs = nodes.to_delayed()
filenames = [os.path.join("/opt/project/parquet", f"nodes-{i}.parquet") for i, df in enumerate(dfs)]
writes = [delayed(to_parquet)(df, fn) for df, fn in zip(dfs, filenames)]
dd.compute(*writes)
wait(writes)
if __name__ == "__main__":
main()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2022653560
@guozhans
Great! What version are you using now?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "kyle-ip (via GitHub)" <gi...@apache.org>.
kyle-ip commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2016866737
OK. I think it may also be related to OS environment. For example, my environment is Ubuntu and the default memory pool is based on jemalloc. To adjust the behaviors of memory pool, this documentation is for reference: https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md
It seems to have a little effect.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2024456764
Hi @kyle-ip,
I had Arrow 14.0.0 and 16.0.0 DEV version installed in different folders before, and i am not aware of the old version until that day. I removed Arrow 14.0.0 complete from my ubuntu docker, and re-build source froma main branch again with these commands. And then re-install PyArrow 16.0.0 dev (I know i can build it as well, but i am bit lazy). Now everything looks fine now.
```shell
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_BUILD_TYPE=Release \
-DARROW_BUILD_TESTS=ON \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_HDFS=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DPARQUET_REQUIRE_ENCRYPTION=ON \
.. \
&& make -j4 \
&& make install
```
Result:
```shell
Line # Mem usage Increment Occurrences Line Contents
=============================================================
29 395.4 MiB 395.4 MiB 1 @profile
30 def to_parquet(df: pd.DataFrame, filename: str):
31 372.2 MiB -23.2 MiB 1 table = Table.from_pandas(df)
32 372.2 MiB 0.0 MiB 1 pool = pa.default_memory_pool()
33 396.4 MiB 24.2 MiB 1 parquet.write_table(table, filename, compression="snappy")
34 396.4 MiB 0.0 MiB 1 del table
35 396.4 MiB 0.0 MiB 1 pool.release_unused()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans closed issue #40738: [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas
URL: https://github.com/apache/arrow/issues/40738
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas [arrow]
Posted by "guozhans (via GitHub)" <gi...@apache.org>.
guozhans commented on issue #40738:
URL: https://github.com/apache/arrow/issues/40738#issuecomment-2019899553
Hi @kyle-ip ,
after you mentioned, i checked again, and i found i have an old arrow lib version installed, and i am now fixing the environment issue. That might cause this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org