You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "taotao li (Jira)" <ji...@apache.org> on 2019/11/01 09:44:00 UTC

[jira] [Created] (ARROW-7043) pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

taotao li created ARROW-7043:
--------------------------------

             Summary: pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
                 Key: ARROW-7043
                 URL: https://issues.apache.org/jira/browse/ARROW-7043
             Project: Apache Arrow
          Issue Type: Test
          Components: Python
    Affects Versions: 0.15.0
            Reporter: taotao li


Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]

his ambition on building arrow for fixing problems in pandas really attract my eyes.

bellow is my problems:

background:
 * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work.
 * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
 * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory.

 

experiment:
 * luckily I find arrow, and I did a simple test.
 * input file: a 1.5GB csv file, around 6 million records, 15 columns;
 * using pandas bellow:
 * 
{code:java}
import pandas as pd{code}

 * 
 * 
{code:java}
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)