You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "taotao li (Jira)" <ji...@apache.org> on 2019/11/01 09:50:00 UTC
[jira] [Updated] (ARROW-7043) pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

     [ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

taotao li updated ARROW-7043:
-----------------------------
    Description: 
Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]

his ambition on building arrow for fixing problems in pandas really attract my eyes.

bellow is my problems:

background:
 * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
 * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
 * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.

 

experiment:
 * luckily I find arrow, and I did a simple test.
 * input file: a 1.5GB csv file, around 6 million records, 15 columns;
 * using pandas bellow, which will consume about *1GB memory*,
 * 
{code:java}
import pandas as pd
df = pd.read_csv(filename){code}

 * using pyarrow bellow, which will consume about *3.6GB memory,* which really makes me confused.
 * 
{code:java}
import pyarrow
import pyarrow.csv
table = pyarrow.csv.read_csv(filename){code}
 

problems:
 * why pyarrow will need so much memory for reading just 1.5GB csv data, it really disappoints me.
 * and when pyarrow is reading the file, my 8 Core CPU is full used.

 

environments:
 * ubuntu 16
 * python 3.5, ipython 6.5
 * pandas, 0.20
 * pyarrow, 0.15
 * server 8 core, 16 GB

 

great thanks again.

if needed, I can upload my 1.5GB file later.

  was:
Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]

his ambition on building arrow for fixing problems in pandas really attract my eyes.

bellow is my problems:

background:
 * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
 * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
 * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.

 

experiment:
 * luckily I find arrow, and I did a simple test.
 * input file: a 1.5GB csv file, around 6 million records, 15 columns;
 * using pandas bellow:
 * 
{code:java}
import pandas as pd{code}

 * 
 * 
{code:java}
 {code}


> pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
> --------------------------------------------------------------------------
>
>                 Key: ARROW-7043
>                 URL: https://issues.apache.org/jira/browse/ARROW-7043
>             Project: Apache Arrow
>          Issue Type: Test
>          Components: Python
>    Affects Versions: 0.15.0
>            Reporter: taotao li
>            Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)