You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Bipin Mathew <bi...@gmail.com> on 2018/09/28 21:28:54 UTC

Help with zero-copy conversion of pyarrow table to pandas dataframe.

Hello Everyone,

     I am just getting my feet wet with apache arrow and I am running into
a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out
a four column, million row apache arrow table to shared memory and I am
attempting to read it into a python dataframe. It is advertised that it is
possible to do this in a zero-copy manner, however, when I run the
to_pandas() method on the table I imported into pyarrow, my memory usage
increases, indicating that it did not actually do a zero-copy conversion.
Here is my code:

  1 import pyarrow as pa
>   2 import pandas as pd
>   3 import numpy as np
>   4 import time
>   5
>   6 start = time.time()
>   7 mm=pa.memory_map('/dev/shm/arrow_table')
>   8 b=mm.read_buffer()
>   9 reader = pa.RecordBatchStreamReader(b)
>  10 z = reader.read_all()
>  11 print("reading time: "+str(time.time()-start))
>  12
>  13 start = time.time()
>  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
>  15 print("conversion time: "+str(time.time()-start))


What am I doing wrong here? Or indeed am I simply misunderstanding what is
meant by zero-copy in this context? My frantic google efforts only resulted
in this possibly relevant issue, but it was unclear to me how it was
resolved:

https://github.com/apache/arrow/issues/1649

I am using pyarrow 0.10.0.

Regards,

Bipin

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Posted by Bipin Mathew <bi...@gmail.com>.

Good Morning Everyone,

      I have not yet had an opportunity to write a reproducible test case
for this issue, but I am hoping the generous people here can help me with a
more general question. How, fundamentally, are we expected, to copy or
indeed directly write a arrow table to shared memory using the cpp sdk?
Currently, I have an implementation like this:

 77   std::shared_ptr<arrow::Buffer> B;
>  78   std::shared_ptr<arrow::io::BufferOutputStream> buffer;
>  79   std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
>  80   arrow::MemoryPool* pool = arrow::default_memory_pool();
>  81   arrow::io::BufferOutputStream::Create(4096,pool,&buffer);
>  82   std::shared_ptr<arrow::Table> table;
>  83   karrow::ArrowHandle *h;
>  84   h = (karrow::ArrowHandle *)Kj(khandle);
>  85   table = h->table;
>  86
>  87
>  arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer);
>  88   writer->WriteTable(*table);
>  89   writer->Close();
>  90   buffer->Finish(&B);
>  91
>  92   // printf("Investigate Memory usage.");
>  93   // getchar();
>  94
>  95
>  96   std::shared_ptr<arrow::io::MemoryMappedFile> mm;
>  97
>  arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm);
>  98   mm->Write(B->data(),B->size());
>  99   mm->Close();


"table" on line 85 is a shared_ptr to a arrow::Table object. As you can see
there, I write to an arrow:Buffer then write that to a memory mapped file.
Is there a more direct approach? I watched this video of a talk @Wes
McKinney gave here:

https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/

Where a method: arrow::MemoryMappedBuffer was referenced, but I have not
seen any documentation regarding this function. Has it been deprecated?

Also, as I mentioned, "table" up there is a arrow::Table object. I create
it columnwise using various arrow::[type]Builder functions. Is there anyway
to actually even write the original table directly into shared memory? Any
guidance on the proper way to do these things would be greatly appreciated.

Regards,

Bipin









On Fri, Sep 28, 2018 at 11:20 PM Bipin Mathew <bi...@gmail.com> wrote:

> Good Evening Abdul, Wes,
>
>      @Abdul, I agree I could probably use plasma, but I just wanted to get
> something up and running quickly for prototyping purposes. As @Wes
> mentioned, I will probably run into the same thing using plasma. I managed
> to get a little more debugging output. Here is the script that I am running:
>
> (karrow) ➜  test git:(master) ✗ cat test.py
>
>
>> import pyarrow as pa
>> import pandas as pd
>> import numpy as np
>> import time
>> import os
>> import psutil
>>
>> process = psutil.Process(os.getpid())
>> print("Memory before memory_map: "+str(process.memory_info().rss))
>> mm=pa.memory_map('/dev/shm/arrow_table','r')
>> print("Memory after before memory_map: "+str(process.memory_info().rss))
>>
>> print("Memory before reading_buffer: "+str(process.memory_info().rss))
>> b=mm.read_buffer()
>> print("Memory after reading_buffer: "+str(process.memory_info().rss))
>> print("buffer size: "+str(b.size))
>>
>> print("Memory before RecordBatchStreamReader:
>> "+str(process.memory_info().rss))
>> reader = pa.RecordBatchStreamReader(b)
>> print("Memory after RecordBatchStreamReader:
>> "+str(process.memory_info().rss))
>> print("Memory before read_all: "+str(process.memory_info().rss))
>> z = reader.read_all()
>> print("Memory after read_all: "+str(process.memory_info().rss))
>> startm = process.memory_info().rss
>> print("Memory before to_pandas: "+str(startm))
>> start=time.time()
>> df = z.to_pandas(zero_copy_only=True)
>> dt=time.time()-start
>> df_size = df.memory_usage().sum()
>> endm = process.memory_info().rss
>> print("Memory after to_pandas: "+str(endm))
>> print("Difference in memory usage after call to to_pandas:
>> "+str(endm-startm))
>> print("Converted %d byte arrow table to %d byte dataframe at a rate of %f
>> bytes/sec" % (b.size,df_size,(b.size/dt)))
>
>
>
> Here is the output:
>
> (karrow) ➜  test git:(master) ✗ python3 -i ./test.py
>> Memory before memory_map: 72962048
>> Memory after before memory_map: 72962048
>> Memory before reading_buffer: 72962048
>> Memory after reading_buffer: 72962048
>> buffer size: 2000000476
>> Memory before RecordBatchStreamReader: 72962048
>> Memory after RecordBatchStreamReader: 72962048
>> Memory before read_all: 72962048
>> Memory after read_all: 72962048
>> Memory before to_pandas: 72962048
>> Memory after to_pandas: 4074319872
>> Difference in memory usage after call to to_pandas: 4001357824
>> Converted 2000000476 byte arrow table to 2000000080 byte dataframe at a
>> rate of 2615333819.434341 bytes/sec
>
>
> The apache arrow table is comprised of 3 columns
>
> >>> z
>> pyarrow.Table
>> int: int32
>> float: double
>> bigint: int64
>
>
> the output dataframe has these types, which look reasonable:
>
> >>> df.ftypes
>> int         int32:dense
>> float     float64:dense
>> bigint      int64:dense
>> dtype: object
>
>
> Over the weekend, I will try to write a self contained reproducible test
> case, but I thought this may start to give insight on if there is an issue
> and if so what it maybe.
>
> Regards,
>
> Bipin
>
>
> On Fri, Sep 28, 2018 at 5:41 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same
>> semantics re: memory copying, so I don't believe using Plasma will
>> change the outcome
>>
>> - Wes
>> On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <ab...@outlook.com>
>> wrote:
>> >
>> > Have you tried using plasma which is effectively what you are trying to
>> do ?
>> >
>> >
>> https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma
>> >
>> >
>> > ________________________________
>> > From: Bipin Mathew <bi...@gmail.com>
>> > Sent: Friday, September 28, 2018 2:28:54 PM
>> > To: dev@arrow.apache.org
>> > Subject: Help with zero-copy conversion of pyarrow table to pandas
>> dataframe.
>> >
>> > Hello Everyone,
>> >
>> >      I am just getting my feet wet with apache arrow and I am running
>> into
>> > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote
>> out
>> > a four column, million row apache arrow table to shared memory and I am
>> > attempting to read it into a python dataframe. It is advertised that it
>> is
>> > possible to do this in a zero-copy manner, however, when I run the
>> > to_pandas() method on the table I imported into pyarrow, my memory usage
>> > increases, indicating that it did not actually do a zero-copy
>> conversion.
>> > Here is my code:
>> >
>> >   1 import pyarrow as pa
>> > >   2 import pandas as pd
>> > >   3 import numpy as np
>> > >   4 import time
>> > >   5
>> > >   6 start = time.time()
>> > >   7 mm=pa.memory_map('/dev/shm/arrow_table')
>> > >   8 b=mm.read_buffer()
>> > >   9 reader = pa.RecordBatchStreamReader(b)
>> > >  10 z = reader.read_all()
>> > >  11 print("reading time: "+str(time.time()-start))
>> > >  12
>> > >  13 start = time.time()
>> > >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
>> > >  15 print("conversion time: "+str(time.time()-start))
>> >
>> >
>> > What am I doing wrong here? Or indeed am I simply misunderstanding what
>> is
>> > meant by zero-copy in this context? My frantic google efforts only
>> resulted
>> > in this possibly relevant issue, but it was unclear to me how it was
>> > resolved:
>> >
>> > https://github.com/apache/arrow/issues/1649
>> >
>> > I am using pyarrow 0.10.0.
>> >
>> > Regards,
>> >
>> > Bipin
>>
>

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Posted by Bipin Mathew <bi...@gmail.com>.

Good Evening Abdul, Wes,

     @Abdul, I agree I could probably use plasma, but I just wanted to get
something up and running quickly for prototyping purposes. As @Wes
mentioned, I will probably run into the same thing using plasma. I managed
to get a little more debugging output. Here is the script that I am running:

(karrow) ➜  test git:(master) ✗ cat test.py


> import pyarrow as pa
> import pandas as pd
> import numpy as np
> import time
> import os
> import psutil
>
> process = psutil.Process(os.getpid())
> print("Memory before memory_map: "+str(process.memory_info().rss))
> mm=pa.memory_map('/dev/shm/arrow_table','r')
> print("Memory after before memory_map: "+str(process.memory_info().rss))
>
> print("Memory before reading_buffer: "+str(process.memory_info().rss))
> b=mm.read_buffer()
> print("Memory after reading_buffer: "+str(process.memory_info().rss))
> print("buffer size: "+str(b.size))
>
> print("Memory before RecordBatchStreamReader:
> "+str(process.memory_info().rss))
> reader = pa.RecordBatchStreamReader(b)
> print("Memory after RecordBatchStreamReader:
> "+str(process.memory_info().rss))
> print("Memory before read_all: "+str(process.memory_info().rss))
> z = reader.read_all()
> print("Memory after read_all: "+str(process.memory_info().rss))
> startm = process.memory_info().rss
> print("Memory before to_pandas: "+str(startm))
> start=time.time()
> df = z.to_pandas(zero_copy_only=True)
> dt=time.time()-start
> df_size = df.memory_usage().sum()
> endm = process.memory_info().rss
> print("Memory after to_pandas: "+str(endm))
> print("Difference in memory usage after call to to_pandas:
> "+str(endm-startm))
> print("Converted %d byte arrow table to %d byte dataframe at a rate of %f
> bytes/sec" % (b.size,df_size,(b.size/dt)))



Here is the output:

(karrow) ➜  test git:(master) ✗ python3 -i ./test.py
> Memory before memory_map: 72962048
> Memory after before memory_map: 72962048
> Memory before reading_buffer: 72962048
> Memory after reading_buffer: 72962048
> buffer size: 2000000476
> Memory before RecordBatchStreamReader: 72962048
> Memory after RecordBatchStreamReader: 72962048
> Memory before read_all: 72962048
> Memory after read_all: 72962048
> Memory before to_pandas: 72962048
> Memory after to_pandas: 4074319872
> Difference in memory usage after call to to_pandas: 4001357824
> Converted 2000000476 byte arrow table to 2000000080 byte dataframe at a
> rate of 2615333819.434341 bytes/sec


The apache arrow table is comprised of 3 columns

>>> z
> pyarrow.Table
> int: int32
> float: double
> bigint: int64


the output dataframe has these types, which look reasonable:

>>> df.ftypes
> int         int32:dense
> float     float64:dense
> bigint      int64:dense
> dtype: object


Over the weekend, I will try to write a self contained reproducible test
case, but I thought this may start to give insight on if there is an issue
and if so what it maybe.

Regards,

Bipin


On Fri, Sep 28, 2018 at 5:41 PM Wes McKinney <we...@gmail.com> wrote:

> hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same
> semantics re: memory copying, so I don't believe using Plasma will
> change the outcome
>
> - Wes
> On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <ab...@outlook.com>
> wrote:
> >
> > Have you tried using plasma which is effectively what you are trying to
> do ?
> >
> >
> https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma
> >
> >
> > ________________________________
> > From: Bipin Mathew <bi...@gmail.com>
> > Sent: Friday, September 28, 2018 2:28:54 PM
> > To: dev@arrow.apache.org
> > Subject: Help with zero-copy conversion of pyarrow table to pandas
> dataframe.
> >
> > Hello Everyone,
> >
> >      I am just getting my feet wet with apache arrow and I am running
> into
> > a bug or, more likely, simply misunderstanding the pyarrow api. I wrote
> out
> > a four column, million row apache arrow table to shared memory and I am
> > attempting to read it into a python dataframe. It is advertised that it
> is
> > possible to do this in a zero-copy manner, however, when I run the
> > to_pandas() method on the table I imported into pyarrow, my memory usage
> > increases, indicating that it did not actually do a zero-copy conversion.
> > Here is my code:
> >
> >   1 import pyarrow as pa
> > >   2 import pandas as pd
> > >   3 import numpy as np
> > >   4 import time
> > >   5
> > >   6 start = time.time()
> > >   7 mm=pa.memory_map('/dev/shm/arrow_table')
> > >   8 b=mm.read_buffer()
> > >   9 reader = pa.RecordBatchStreamReader(b)
> > >  10 z = reader.read_all()
> > >  11 print("reading time: "+str(time.time()-start))
> > >  12
> > >  13 start = time.time()
> > >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
> > >  15 print("conversion time: "+str(time.time()-start))
> >
> >
> > What am I doing wrong here? Or indeed am I simply misunderstanding what
> is
> > meant by zero-copy in this context? My frantic google efforts only
> resulted
> > in this possibly relevant issue, but it was unclear to me how it was
> > resolved:
> >
> > https://github.com/apache/arrow/issues/1649
> >
> > I am using pyarrow 0.10.0.
> >
> > Regards,
> >
> > Bipin
>

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Posted by Wes McKinney <we...@gmail.com>.

hi Abdul -- Plasma vs. a memory map on /dev/shm should have the same
semantics re: memory copying, so I don't believe using Plasma will
change the outcome

- Wes
On Fri, Sep 28, 2018 at 5:38 PM Abdul Rahman <ab...@outlook.com> wrote:
>
> Have you tried using plasma which is effectively what you are trying to do ?
>
> https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma
>
>
> ________________________________
> From: Bipin Mathew <bi...@gmail.com>
> Sent: Friday, September 28, 2018 2:28:54 PM
> To: dev@arrow.apache.org
> Subject: Help with zero-copy conversion of pyarrow table to pandas dataframe.
>
> Hello Everyone,
>
>      I am just getting my feet wet with apache arrow and I am running into
> a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out
> a four column, million row apache arrow table to shared memory and I am
> attempting to read it into a python dataframe. It is advertised that it is
> possible to do this in a zero-copy manner, however, when I run the
> to_pandas() method on the table I imported into pyarrow, my memory usage
> increases, indicating that it did not actually do a zero-copy conversion.
> Here is my code:
>
>   1 import pyarrow as pa
> >   2 import pandas as pd
> >   3 import numpy as np
> >   4 import time
> >   5
> >   6 start = time.time()
> >   7 mm=pa.memory_map('/dev/shm/arrow_table')
> >   8 b=mm.read_buffer()
> >   9 reader = pa.RecordBatchStreamReader(b)
> >  10 z = reader.read_all()
> >  11 print("reading time: "+str(time.time()-start))
> >  12
> >  13 start = time.time()
> >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
> >  15 print("conversion time: "+str(time.time()-start))
>
>
> What am I doing wrong here? Or indeed am I simply misunderstanding what is
> meant by zero-copy in this context? My frantic google efforts only resulted
> in this possibly relevant issue, but it was unclear to me how it was
> resolved:
>
> https://github.com/apache/arrow/issues/1649
>
> I am using pyarrow 0.10.0.
>
> Regards,
>
> Bipin

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Posted by Abdul Rahman <ab...@outlook.com>.

Have you tried using plasma which is effectively what you are trying to do ?

https://arrow.apache.org/docs/python/plasma.html#using-arrow-and-pandas-with-plasma

________________________________
From: Bipin Mathew <bi...@gmail.com>
Sent: Friday, September 28, 2018 2:28:54 PM
To: dev@arrow.apache.org
Subject: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Hello Everyone,

     I am just getting my feet wet with apache arrow and I am running into
a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out
a four column, million row apache arrow table to shared memory and I am
attempting to read it into a python dataframe. It is advertised that it is
possible to do this in a zero-copy manner, however, when I run the
to_pandas() method on the table I imported into pyarrow, my memory usage
increases, indicating that it did not actually do a zero-copy conversion.
Here is my code:

  1 import pyarrow as pa
>   2 import pandas as pd
>   3 import numpy as np
>   4 import time
>   5
>   6 start = time.time()
>   7 mm=pa.memory_map('/dev/shm/arrow_table')
>   8 b=mm.read_buffer()
>   9 reader = pa.RecordBatchStreamReader(b)
>  10 z = reader.read_all()
>  11 print("reading time: "+str(time.time()-start))
>  12
>  13 start = time.time()
>  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
>  15 print("conversion time: "+str(time.time()-start))

What am I doing wrong here? Or indeed am I simply misunderstanding what is
meant by zero-copy in this context? My frantic google efforts only resulted
in this possibly relevant issue, but it was unclear to me how it was
resolved:

https://github.com/apache/arrow/issues/1649

I am using pyarrow 0.10.0.

Regards,

Bipin

Re: Help with zero-copy conversion of pyarrow table to pandas dataframe.

Posted by Wes McKinney <we...@gmail.com>.

hi Bipin,

There are narrow circumstances where zero-copy pandas deserialization
is possible. Firstly, I noted that we are short of documentation for
Table.to_pandas, so I opened

https://issues.apache.org/jira/browse/ARROW-3356

It's possible there's a bug when zero_copy_only=True -- it is supposed
to raise if any memory allocations are required. Can you give more
information what you mean by "my memory usage
increases". Did it increase by the footprint of the underlying memory?
A minimal reproducible example would help us investigate further.

Thanks,
Wes
On Fri, Sep 28, 2018 at 5:29 PM Bipin Mathew <bi...@gmail.com> wrote:
>
> Hello Everyone,
>
>      I am just getting my feet wet with apache arrow and I am running into
> a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out
> a four column, million row apache arrow table to shared memory and I am
> attempting to read it into a python dataframe. It is advertised that it is
> possible to do this in a zero-copy manner, however, when I run the
> to_pandas() method on the table I imported into pyarrow, my memory usage
> increases, indicating that it did not actually do a zero-copy conversion.
> Here is my code:
>
>   1 import pyarrow as pa
> >   2 import pandas as pd
> >   3 import numpy as np
> >   4 import time
> >   5
> >   6 start = time.time()
> >   7 mm=pa.memory_map('/dev/shm/arrow_table')
> >   8 b=mm.read_buffer()
> >   9 reader = pa.RecordBatchStreamReader(b)
> >  10 z = reader.read_all()
> >  11 print("reading time: "+str(time.time()-start))
> >  12
> >  13 start = time.time()
> >  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
> >  15 print("conversion time: "+str(time.time()-start))
>
>
> What am I doing wrong here? Or indeed am I simply misunderstanding what is
> meant by zero-copy in this context? My frantic google efforts only resulted
> in this possibly relevant issue, but it was unclear to me how it was
> resolved:
>
> https://github.com/apache/arrow/issues/1649
>
> I am using pyarrow 0.10.0.
>
> Regards,
>
> Bipin