You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2020/12/18 00:51:01 UTC
[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

    [ https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251419#comment-17251419 ] 

Weston Pace commented on ARROW-9974:
------------------------------------

I attempted to reproduce this on centos-8 and was not successful.  I used an Amazon EC2 M4.large instance with the following AMI image ([https://aws.amazon.com/marketplace/pp/B08KYLN2CG)]

Since the image only contains 8GB of RAM I used the smaller dataset example you posted.

Ironically the `read_works` method failed at `pd.concat` due to an out of RAM error (a legitimate one as pandas needed an additional 2.1 GB which was not available).

The `read_errors` method succeeded with `use_legacy_dataset` set to true or false.

It appears the core file you generated ran into some kind of 2GB limit.  Since you have 256GB on the machine your core file could be quite large.  Try following the advice here ([https://stackoverflow.com/questions/43341954/is-2g-the-limit-size-of-coredump-file-on-linux)] to see if you are able to make any further progress.

 

==Details of the test machine==

 

[centos@ip-172-30-0-34 ~]$ cat /etc/redhat-release 
CentOS Linux release 8.2.2004 (Core) 
[centos@ip-172-30-0-34 ~]$ uname -a
Linux ip-172-30-0-34.ec2.internal 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[centos@ip-172-30-0-34 ~]$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

[centos@ip-172-30-0-34 ~]$ python3 -mpip freeze
asn1crypto==0.24.0
Babel==2.5.1
cffi==1.11.5
chardet==3.0.4
cloud-init==19.4
configobj==5.0.6
cryptography==2.3
dbus-python==1.2.4
decorator==4.2.1
gpg==1.10.0
idna==2.5
Jinja2==2.10.1
jsonpatch==1.21
jsonpointer==1.10
jsonschema==2.6.0
MarkupSafe==0.23
netifaces==0.10.6
numpy==1.19.4
oauthlib==2.1.0
pandas==1.1.5
pciutils==2.3.6
perf==0.1
ply==3.9
prettytable==0.7.2
pyarrow==1.0.1
pycairo==1.16.3
pycparser==2.14
pygobject==3.28.3
PyJWT==1.6.1
pyOpenSSL==18.0.0
pyserial==3.1.1
PySocks==1.6.8
python-dateutil==2.8.1
python-dmidecode==3.12.2
python-linux-procfs==0.6
pytz==2017.2
pyudev==0.21.0
PyYAML==3.12
requests==2.20.0
rhnlib==2.8.6
rpm==4.14.2
schedutils==0.6
selinux==2.9
sepolicy==1.1
setools==4.2.2
setroubleshoot==1.1
six==1.11.0
slip==0.6.4
slip.dbus==0.6.4
syspurpose==1.26.20
systemd-python==234
urllib3==1.24.2

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: dataset
>             Fix For: 3.0.0
>
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
>     # create a big dataframe
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
> def read_works():
>     # below code works to read
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
> def read_errors():
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)