You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "taotao li (Jira)" <ji...@apache.org> on 2019/11/01 09:50:00 UTC
[jira] [Updated] (ARROW-7043) pyarrow.csv.read_csv, memory consumed
much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
taotao li updated ARROW-7043:
-----------------------------
Description:
Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
his ambition on building arrow for fixing problems in pandas really attract my eyes.
bellow is my problems:
background:
* Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
* We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
* We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.
experiment:
* luckily I find arrow, and I did a simple test.
* input file: a 1.5GB csv file, around 6 million records, 15 columns;
* using pandas bellow, which will consume about *1GB memory*,
*
{code:java}
import pandas as pd
df = pd.read_csv(filename){code}
* using pyarrow bellow, which will consume about *3.6GB memory,* which really makes me confused.
*
{code:java}
import pyarrow
import pyarrow.csv
table = pyarrow.csv.read_csv(filename){code}
problems:
* why pyarrow will need so much memory for reading just 1.5GB csv data, it really disappoints me.
* and when pyarrow is reading the file, my 8 Core CPU is full used.
environments:
* ubuntu 16
* python 3.5, ipython 6.5
* pandas, 0.20
* pyarrow, 0.15
* server 8 core, 16 GB
great thanks again.
if needed, I can upload my 1.5GB file later.
was:
Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
his ambition on building arrow for fixing problems in pandas really attract my eyes.
bellow is my problems:
background:
* Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
* We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
* We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.
experiment:
* luckily I find arrow, and I did a simple test.
* input file: a 1.5GB csv file, around 6 million records, 15 columns;
* using pandas bellow:
*
{code:java}
import pandas as pd{code}
*
*
{code:java}
{code}
> pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
> --------------------------------------------------------------------------
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
> Issue Type: Test
> Components: Python
> Affects Versions: 0.15.0
> Reporter: taotao li
> Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract my eyes.
> bellow is my problems:
> background:
> * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
> * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
> * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.
>
> experiment:
> * luckily I find arrow, and I did a simple test.
> * input file: a 1.5GB csv file, around 6 million records, 15 columns;
> * using pandas bellow, which will consume about *1GB memory*,
> *
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
> * using pyarrow bellow, which will consume about *3.6GB memory,* which really makes me confused.
> *
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>
> problems:
> * why pyarrow will need so much memory for reading just 1.5GB csv data, it really disappoints me.
> * and when pyarrow is reading the file, my 8 Core CPU is full used.
>
> environments:
> * ubuntu 16
> * python 3.5, ipython 6.5
> * pandas, 0.20
> * pyarrow, 0.15
> * server 8 core, 16 GB
>
> great thanks again.
> if needed, I can upload my 1.5GB file later.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)