You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ashish Gupta (Jira)" <ji...@apache.org> on 2020/09/12 15:59:00 UTC
[jira] [Commented] (ARROW-9974) t

    [ https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194764#comment-17194764 ] 

Ashish Gupta commented on ARROW-9974:
-------------------------------------

1) Please find attached full traceback of both cases [^legacy_false.txt]

 

^2) I started removing columns one by one and below is the smallest dataframe where adding the last column F11 causes both scenarios use_legacy_dataset set to True and False to throw error. Interestingly, use_legacy_dataset=True starts crashing after first 5 columns. If you have more than 8Gb ram you should be able to test it.^
{code:java}
    # create a big dataframe
    import pandas as pd
    import numpy as np    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000)
    df['F2'] = np.random.randn(50000000)
    df['F3'] = np.random.randn(50000000)
    df['F4'] = 'ABCDEFGH'
    df['F5'] = 'ABCDEFGH'
    df['F6'] = 'ABCDEFGH'
    df['F7'] = 'ABCDEFGH'
    df['F8'] = 'ABCDEFGH'
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    # df['F11'] = 'ABCDEFGH'

{code}
 

 

 

> t
> -
>
>                 Key: ARROW-9974
>                 URL: https://issues.apache.org/jira/browse/ARROW-9974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Ashish Gupta
>            Priority: Major
>              Labels: dataset
>         Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.
> {code}
>     # create a big dataframe
>     import pandas as pd
>     import numpy as np
>     df = pd.DataFrame({'A': np.arange(50000000)})
>     df['F1'] = np.random.randn(50000000) * 100
>     df['F2'] = np.random.randn(50000000) * 100
>     df['F3'] = np.random.randn(50000000) * 100
>     df['F4'] = np.random.randn(50000000) * 100
>     df['F5'] = np.random.randn(50000000) * 100
>     df['F6'] = np.random.randn(50000000) * 100
>     df['F7'] = np.random.randn(50000000) * 100
>     df['F8'] = np.random.randn(50000000) * 100
>     df['F9'] = 'ABCDEFGH'
>     df['F10'] = 'ABCDEFGH'
>     df['F11'] = 'ABCDEFGH'
>     df['F12'] = 'ABCDEFGH01234'
>     df['F13'] = 'ABCDEFGH01234'
>     df['F14'] = 'ABCDEFGH01234'
>     df['F15'] = 'ABCDEFGH01234567'
>     df['F16'] = 'ABCDEFGH01234567'
>     df['F17'] = 'ABCDEFGH01234567'
>     # split and save data to 5000 files
>     for i in range(5000):
>         df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
>     # use a fresh session to read data
>     # below code works to read
>     import pandas as pd
>     df = []
>     for i in range(5000):
>         df.append(pd.read_parquet(f'{i}.parquet'))
>     df = pd.concat(df)
>     # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
>     # tried use_legacy_dataset=False, same issue
>     import pyarrow.parquet as pq
>     fnames = []
>     for i in range(5000):
>         fnames.append(f'{i}.parquet')
>     len(fnames)
>     df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)