You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ashish Gupta (Jira)" <ji...@apache.org> on 2020/09/11 10:25:00 UTC
[jira] [Created] (ARROW-9974) pyarrow version 1.0.1 throws Out Of
Memory exception while reading large number of files using ParquetDataset
(works fine with version 0.13)
Ashish Gupta created ARROW-9974:
-----------------------------------
Summary: pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset (works fine with version 0.13)
Key: ARROW-9974
URL: https://issues.apache.org/jira/browse/ARROW-9974
Project: Apache Arrow
Issue Type: Bug
Reporter: Ashish Gupta
[https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.
# create a big dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': np.arange(50000000)})
df['F1'] = np.random.randn(50000000) * 100
df['F2'] = np.random.randn(50000000) * 100
df['F3'] = np.random.randn(50000000) * 100
df['F4'] = np.random.randn(50000000) * 100
df['F5'] = np.random.randn(50000000) * 100
df['F6'] = np.random.randn(50000000) * 100
df['F7'] = np.random.randn(50000000) * 100
df['F8'] = np.random.randn(50000000) * 100
df['F9'] = 'ABCDEFGH'
df['F10'] = 'ABCDEFGH'
df['F11'] = 'ABCDEFGH'
df['F12'] = 'ABCDEFGH01234'
df['F13'] = 'ABCDEFGH01234'
df['F14'] = 'ABCDEFGH01234'
df['F15'] = 'ABCDEFGH01234567'
df['F16'] = 'ABCDEFGH01234567'
df['F17'] = 'ABCDEFGH01234567'
# split and save data to 5000 files
for i in range(5000):
df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
# use a fresh session to read data
# below code works to read
import pandas as pd
df = []
for i in range(5000):
df.append(pd.read_parquet(f'{i}.parquet'))
df = pd.concat(df)
# below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
# tried use_legacy_dataset=False, same issue
import pyarrow.parquet as pq
fnames = []
for i in range(5000):
fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames).read(use_threads=False)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)