You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Luis E Pastrana (Jira)" <ji...@apache.org> on 2022/04/28 12:14:00 UTC
[jira] [Updated] (ARROW-16391) pd.read_parquet using filters consumes too much memory

     [ https://issues.apache.org/jira/browse/ARROW-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luis E Pastrana updated ARROW-16391:
------------------------------------
    Summary: pd.read_parquet using filters consumes too much memory  (was: pd.read_parquet using filters consuming too much memory)

> pd.read_parquet using filters consumes too much memory
> ------------------------------------------------------
>
>                 Key: ARROW-16391
>                 URL: https://issues.apache.org/jira/browse/ARROW-16391
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 7.0.0
>         Environment:  Hardware Overview:
>       Model Name: MacBook Pro
>       Model Identifier: MacBookPro12,1
>       Processor Name: Dual-Core Intel Core i5
>       Processor Speed: 2.7 GHz
>       Number of Processors: 1
>       Total Number of Cores: 2
>       L2 Cache (per Core): 256 KB
>       L3 Cache: 3 MB
>       Hyper-Threading Technology: Enabled
>       Memory: 8 GB
>  System Software Overview:
>       System Version: macOS 10.15.7 (19H1217)
>       Kernel Version: Darwin 19.6.0
>       Boot Volume: Macintosh HD
>       Boot Mode: Normal
>            Reporter: Luis E Pastrana
>            Priority: Major
>
> Hello!
>  
> I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) when trying to read a parquet using file-level filters. Using the following dataset:
> {code:java}
> import pandas as pd
> import numpy as np
> a = np.random.randint(1,50,(4_000_000,4))
> df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", index=False) {code}
> and the reader script ({*}read_with_filters.py{*})
> {code:java}
> import pyarrow as pa
> import pandas as pd
> print(f"pyarrow version: {pa.__version__}")
> print(f"pandas version: {pd.__version__}")
> tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, filters=[("B","=",10)])
> print(tmp.shape) {code}
> I get:
>  
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : 1.32{code}
> *Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 {code}
> *Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 {code}
>  
> It is more evident when using a larger dataset. However, my personal computer hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. (That should be a separate issue)
>  
> It is worth mentioning that you see a relative the same memory usage when I removed the *filters* keyword.
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 {code}
> {*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 {code}
> {*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 {code}
> Thank you,
> Luis



--
This message was sent by Atlassian Jira
(v8.20.7#820007)