You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jayjeet Chakraborty <ja...@gmail.com> on 2021/06/10 18:42:50 UTC

Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Hello Arrow Community,

I am a student working on a project where I need to serialize an in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am currently using the arrow::ipc::RecordBatchStreamWriter API to serialize the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize the whole table, and that is harming the performance of my performance-critical application. I basically want to get hold of the underlying memory of the table as bytes and send it over the network. How do you suggest I tackle this problem? I was thinking of using the C Data interface for this, so that I convert my arrow::Table to ArrowArray and ArrowSchema and serialize the structs to send them over the network, but seems like serializing structs is another complex problem on its own.  It will be great to have some suggestions on this. Thanks a lot.

Best,
Jayjeet

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Jayjeet Chakraborty <ja...@gmail.com>.

Thanks for the perf experiments weston !

On 2021/06/14 20:24:07, Weston Pace <we...@gmail.com> wrote: 
> Returning to the main thread...
> 
> From: jayjeetchakraborty25@gmail.com
> 
> > Hi Wes, Gosh, Weston,
> >
> > Sorry if you are recieving this message redundantly, but I tried sending this message via ponymail twice, but the message didn't went through for some reason. But anyways, Thanks a lot for the valuable discussion. I experimented a little with the pre-allocation strategy on my end, although it worked, I was not able to reproduce the results that Weston got. I used a +1.3 GB Table (containing NYC taxi data). I serialized it to an arrow::Buffer once dynamically and once using a 1.4 GB preallocated arrow::BufferOutputStream. I use the CMake flags here [2] to build arrow from source.
> >
> > My results are as follows:
> >
> > pre-allocated (1400 * 1024 * 1024 bytes): ~860ms
> >
> > dynamically allocated (starting from 4096 bytes): ~2300ms
> >
> > I saw a >12 times improvement in Weston's experiments, and since I am only getting ~3 times improvement, I am wondering what am I doing wrong on my end. I am sharing my benchmark code here [1]. It will be great if someone could take a > look at it (mainly the Serialize function). Looking forward to hearing back from you. Thanks again.
> >
> > Best regards,
> > Jayjeet  Chakraborty
> 
> From: gosharz@gmail.com
> 
> > Hi Jayjeet,
> >
> > Abstracting from the real data shape, to me the code looks reasonable at first glance from the point of view of arrow. However I'd assume that kind of measurement should be done in some loop(like in google benchmark etc.). Also, as mentioned on the main mail thread(at least I heard no concerns about that :) ) you might be observing some lazy page warm up effects.
> >
> > At this point I'd suggest profiling that code as the stack trace will probably tell you much better where the problem is.
> > Also probably we should try again to return to the main thread for public record and expanded discussion.
> >
> >  Cheers,
> > Gosh
> 
> Lazy page warmup effects is exactly what I suspect the culprit to be.
> Keep in mind that these can be hard to measure as they don't happen at
> allocation time but actually happen when you first write to the memory
> (only at that point does Linux allocate some real RAM).  As a fun
> experiment, running your program in a loop, I get this perf output
> (measuring cycles) from the first iteration...
> 
>   48.98%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
>    9.14%  serialize  [kernel.kallsyms]    [k] clear_page_erms
>    5.60%  serialize  [kernel.kallsyms]    [k] native_irq_return_iret
>    3.91%  serialize  libc-2.31.so         [.] __memset_avx2_erms
>    3.90%  serialize  [kernel.kallsyms]    [k] sync_regs
>    1.98%  serialize  [kernel.kallsyms]    [k] rmqueue
>    1.80%  serialize  [kernel.kallsyms]    [k] __pagevec_lru_add_fn
>    1.41%  serialize  [kernel.kallsyms]    [k] handle_mm_fault
>    1.34%  serialize  [kernel.kallsyms]    [k] __handle_mm_fault
>    1.32%  serialize  [kernel.kallsyms]    [k] get_mem_cgroup_from_mm
>    1.26%  serialize  [kernel.kallsyms]    [k] try_charge
>    0.99%  serialize  [kernel.kallsyms]    [k] do_anonymous_page
> 
> __memmove_sse2_unaligned_erms is inside of memcpy and it is the actual
> workhorse in this example.   Notice that is only taking about 50% of
> the available cycles.  The remaining time is spent in various
> functions that appear to be page allocation (page fault, assign page,
> clear page).  Also, this first iteration has ~3 billion total cycles.
> 
> On the third iteration...
> 
>   98.40%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
>    0.44%  :79205     [kernel.kallsyms]    [k] __mod_zone_page_state
>    0.44%  :79205     [kernel.kallsyms]    [k] zap_pte_range.isra.0
>    0.40%  serialize  libc-2.31.so         [.] __memset_avx2_erms
>    0.17%  serialize  libarrow.so.500.0.0  [.] arrow::StringArray::~StringArray
>    0.16%  serialize  [kernel.kallsyms]    [k] cpuacct_account_field
> 
> I only get about 700 million total cycles and the time is dominated by memcpy.
> 
> 
> 
> On Thu, Jun 10, 2021 at 1:21 PM Gosh Arzumanyan <go...@gmail.com> wrote:
> >
> > This might help to get the size of the output buffer upfront:
> > https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96
> >
> > Though with "standard" allocators there is a risk of running into
> > KiPageFaults when going for buffers over 1mb. This might be especially
> > painful in multithreaded environment.
> >
> > A custom outputstream with configurable buffering parameter might help to
> > overcome that problem without dealing too much with the allocators.
> > Curious to hear community thoughts on this.
> >
> > Cheers,
> > Gosh
> >
> > On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <we...@gmail.com> wrote:
> >
> > > From this, it seems like seeding the RecordBatchStreamWriter's output
> > > stream with a much larger preallocated buffer would improve
> > > performance (depends on the allocator used of course).
> > >
> > > On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <we...@gmail.com> wrote:
> > > >
> > > > Just for some reference times from my system I created a quick test to
> > > > dump a ~1.7GB table to buffer(s).
> > > >
> > > > Going to many buffers (just collecting the buffers): ~11,000ns
> > > > Going to one preallocated buffer: ~160,000,000ns
> > > > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > > > ~2,000,000,000ns
> > > >
> > > > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > To be clear, we would like to help make this faster. I don't recall
> > > > > much effort being invested in optimizing this code path in the last
> > > > > couple of years, so there may be some low hanging fruit to improve the
> > > > > performance. Changing the in-memory data layout (the chunking) is one
> > > > > of the most likely things to help.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hi Jayjeet,
> > > > > >
> > > > > > I wonder if you really need to serialize the whole table into a
> > > single
> > > > > > buffer as you will end up with twice the memory while you could be
> > > sending
> > > > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> > > is the
> > > > > > buffer resized beforehand? I'd suspect there might be relocations
> > > happening
> > > > > > under the hood.
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > Gosh
> > > > > >
> > > > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > > > columns) in total are there? I would guess that the problem is all
> > > the
> > > > > > > little Buffer memcopies. I don't think that the C Interface is
> > > going
> > > > > > > to help you.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > > > > <ja...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello Arrow Community,
> > > > > > > >
> > > > > > > > I am a student working on a project where I need to serialize an
> > > > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> > > am
> > > > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> > > serialize
> > > > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> > > serialize
> > > > > > > the whole table, and that is harming the performance of my
> > > > > > > performance-critical application. I basically want to get hold of
> > > the
> > > > > > > underlying memory of the table as bytes and send it over the
> > > network. How
> > > > > > > do you suggest I tackle this problem? I was thinking of using the
> > > C Data
> > > > > > > interface for this, so that I convert my arrow::Table to
> > > ArrowArray and
> > > > > > > ArrowSchema and serialize the structs to send them over the
> > > network, but
> > > > > > > seems like serializing structs is another complex problem on its
> > > own.  It
> > > > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jayjeet
> > > > > > > >
> > > > > > >
> > >
>

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Weston Pace <we...@gmail.com>.

Returning to the main thread...

From: jayjeetchakraborty25@gmail.com

> Hi Wes, Gosh, Weston,
>
> Sorry if you are recieving this message redundantly, but I tried sending this message via ponymail twice, but the message didn't went through for some reason. But anyways, Thanks a lot for the valuable discussion. I experimented a little with the pre-allocation strategy on my end, although it worked, I was not able to reproduce the results that Weston got. I used a +1.3 GB Table (containing NYC taxi data). I serialized it to an arrow::Buffer once dynamically and once using a 1.4 GB preallocated arrow::BufferOutputStream. I use the CMake flags here [2] to build arrow from source.
>
> My results are as follows:
>
> pre-allocated (1400 * 1024 * 1024 bytes): ~860ms
>
> dynamically allocated (starting from 4096 bytes): ~2300ms
>
> I saw a >12 times improvement in Weston's experiments, and since I am only getting ~3 times improvement, I am wondering what am I doing wrong on my end. I am sharing my benchmark code here [1]. It will be great if someone could take a > look at it (mainly the Serialize function). Looking forward to hearing back from you. Thanks again.
>
> Best regards,
> Jayjeet  Chakraborty

From: gosharz@gmail.com

> Hi Jayjeet,
>
> Abstracting from the real data shape, to me the code looks reasonable at first glance from the point of view of arrow. However I'd assume that kind of measurement should be done in some loop(like in google benchmark etc.). Also, as mentioned on the main mail thread(at least I heard no concerns about that :) ) you might be observing some lazy page warm up effects.
>
> At this point I'd suggest profiling that code as the stack trace will probably tell you much better where the problem is.
> Also probably we should try again to return to the main thread for public record and expanded discussion.
>
>  Cheers,
> Gosh

Lazy page warmup effects is exactly what I suspect the culprit to be.
Keep in mind that these can be hard to measure as they don't happen at
allocation time but actually happen when you first write to the memory
(only at that point does Linux allocate some real RAM).  As a fun
experiment, running your program in a loop, I get this perf output
(measuring cycles) from the first iteration...

  48.98%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
   9.14%  serialize  [kernel.kallsyms]    [k] clear_page_erms
   5.60%  serialize  [kernel.kallsyms]    [k] native_irq_return_iret
   3.91%  serialize  libc-2.31.so         [.] __memset_avx2_erms
   3.90%  serialize  [kernel.kallsyms]    [k] sync_regs
   1.98%  serialize  [kernel.kallsyms]    [k] rmqueue
   1.80%  serialize  [kernel.kallsyms]    [k] __pagevec_lru_add_fn
   1.41%  serialize  [kernel.kallsyms]    [k] handle_mm_fault
   1.34%  serialize  [kernel.kallsyms]    [k] __handle_mm_fault
   1.32%  serialize  [kernel.kallsyms]    [k] get_mem_cgroup_from_mm
   1.26%  serialize  [kernel.kallsyms]    [k] try_charge
   0.99%  serialize  [kernel.kallsyms]    [k] do_anonymous_page

__memmove_sse2_unaligned_erms is inside of memcpy and it is the actual
workhorse in this example.   Notice that is only taking about 50% of
the available cycles.  The remaining time is spent in various
functions that appear to be page allocation (page fault, assign page,
clear page).  Also, this first iteration has ~3 billion total cycles.

On the third iteration...

  98.40%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
   0.44%  :79205     [kernel.kallsyms]    [k] __mod_zone_page_state
   0.44%  :79205     [kernel.kallsyms]    [k] zap_pte_range.isra.0
   0.40%  serialize  libc-2.31.so         [.] __memset_avx2_erms
   0.17%  serialize  libarrow.so.500.0.0  [.] arrow::StringArray::~StringArray
   0.16%  serialize  [kernel.kallsyms]    [k] cpuacct_account_field

I only get about 700 million total cycles and the time is dominated by memcpy.



On Thu, Jun 10, 2021 at 1:21 PM Gosh Arzumanyan <go...@gmail.com> wrote:
>
> This might help to get the size of the output buffer upfront:
> https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96
>
> Though with "standard" allocators there is a risk of running into
> KiPageFaults when going for buffers over 1mb. This might be especially
> painful in multithreaded environment.
>
> A custom outputstream with configurable buffering parameter might help to
> overcome that problem without dealing too much with the allocators.
> Curious to hear community thoughts on this.
>
> Cheers,
> Gosh
>
> On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <we...@gmail.com> wrote:
>
> > From this, it seems like seeding the RecordBatchStreamWriter's output
> > stream with a much larger preallocated buffer would improve
> > performance (depends on the allocator used of course).
> >
> > On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <we...@gmail.com> wrote:
> > >
> > > Just for some reference times from my system I created a quick test to
> > > dump a ~1.7GB table to buffer(s).
> > >
> > > Going to many buffers (just collecting the buffers): ~11,000ns
> > > Going to one preallocated buffer: ~160,000,000ns
> > > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > > ~2,000,000,000ns
> > >
> > > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > To be clear, we would like to help make this faster. I don't recall
> > > > much effort being invested in optimizing this code path in the last
> > > > couple of years, so there may be some low hanging fruit to improve the
> > > > performance. Changing the in-memory data layout (the chunking) is one
> > > > of the most likely things to help.
> > > >
> > > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Jayjeet,
> > > > >
> > > > > I wonder if you really need to serialize the whole table into a
> > single
> > > > > buffer as you will end up with twice the memory while you could be
> > sending
> > > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> > is the
> > > > > buffer resized beforehand? I'd suspect there might be relocations
> > happening
> > > > > under the hood.
> > > > >
> > > > >
> > > > > Cheers,
> > > > > Gosh
> > > > >
> > > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com>
> > wrote:
> > > > >
> > > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > > columns) in total are there? I would guess that the problem is all
> > the
> > > > > > little Buffer memcopies. I don't think that the C Interface is
> > going
> > > > > > to help you.
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > > > <ja...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello Arrow Community,
> > > > > > >
> > > > > > > I am a student working on a project where I need to serialize an
> > > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> > am
> > > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> > serialize
> > > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> > serialize
> > > > > > the whole table, and that is harming the performance of my
> > > > > > performance-critical application. I basically want to get hold of
> > the
> > > > > > underlying memory of the table as bytes and send it over the
> > network. How
> > > > > > do you suggest I tackle this problem? I was thinking of using the
> > C Data
> > > > > > interface for this, so that I convert my arrow::Table to
> > ArrowArray and
> > > > > > ArrowSchema and serialize the structs to send them over the
> > network, but
> > > > > > seems like serializing structs is another complex problem on its
> > own.  It
> > > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jayjeet
> > > > > > >
> > > > > >
> >

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Gosh Arzumanyan <go...@gmail.com>.

This might help to get the size of the output buffer upfront:
https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96

Though with "standard" allocators there is a risk of running into
KiPageFaults when going for buffers over 1mb. This might be especially
painful in multithreaded environment.

A custom outputstream with configurable buffering parameter might help to
overcome that problem without dealing too much with the allocators.
Curious to hear community thoughts on this.

Cheers,
Gosh

On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <we...@gmail.com> wrote:

> From this, it seems like seeding the RecordBatchStreamWriter's output
> stream with a much larger preallocated buffer would improve
> performance (depends on the allocator used of course).
>
> On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <we...@gmail.com> wrote:
> >
> > Just for some reference times from my system I created a quick test to
> > dump a ~1.7GB table to buffer(s).
> >
> > Going to many buffers (just collecting the buffers): ~11,000ns
> > Going to one preallocated buffer: ~160,000,000ns
> > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > ~2,000,000,000ns
> >
> > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > To be clear, we would like to help make this faster. I don't recall
> > > much effort being invested in optimizing this code path in the last
> > > couple of years, so there may be some low hanging fruit to improve the
> > > performance. Changing the in-memory data layout (the chunking) is one
> > > of the most likely things to help.
> > >
> > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com>
> wrote:
> > > >
> > > > Hi Jayjeet,
> > > >
> > > > I wonder if you really need to serialize the whole table into a
> single
> > > > buffer as you will end up with twice the memory while you could be
> sending
> > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> is the
> > > > buffer resized beforehand? I'd suspect there might be relocations
> happening
> > > > under the hood.
> > > >
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com>
> wrote:
> > > >
> > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > columns) in total are there? I would guess that the problem is all
> the
> > > > > little Buffer memcopies. I don't think that the C Interface is
> going
> > > > > to help you.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > > <ja...@gmail.com> wrote:
> > > > > >
> > > > > > Hello Arrow Community,
> > > > > >
> > > > > > I am a student working on a project where I need to serialize an
> > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> am
> > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> serialize
> > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> serialize
> > > > > the whole table, and that is harming the performance of my
> > > > > performance-critical application. I basically want to get hold of
> the
> > > > > underlying memory of the table as bytes and send it over the
> network. How
> > > > > do you suggest I tackle this problem? I was thinking of using the
> C Data
> > > > > interface for this, so that I convert my arrow::Table to
> ArrowArray and
> > > > > ArrowSchema and serialize the structs to send them over the
> network, but
> > > > > seems like serializing structs is another complex problem on its
> own.  It
> > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > >
> > > > > > Best,
> > > > > > Jayjeet
> > > > > >
> > > > >
>

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Wes McKinney <we...@gmail.com>.

From this, it seems like seeding the RecordBatchStreamWriter's output
stream with a much larger preallocated buffer would improve
performance (depends on the allocator used of course).

On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <we...@gmail.com> wrote:
>
> Just for some reference times from my system I created a quick test to
> dump a ~1.7GB table to buffer(s).
>
> Going to many buffers (just collecting the buffers): ~11,000ns
> Going to one preallocated buffer: ~160,000,000ns
> Going to one dynamically allocated buffer (using a grow factor of 2x):
> ~2,000,000,000ns
>
> On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > To be clear, we would like to help make this faster. I don't recall
> > much effort being invested in optimizing this code path in the last
> > couple of years, so there may be some low hanging fruit to improve the
> > performance. Changing the in-memory data layout (the chunking) is one
> > of the most likely things to help.
> >
> > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com> wrote:
> > >
> > > Hi Jayjeet,
> > >
> > > I wonder if you really need to serialize the whole table into a single
> > > buffer as you will end up with twice the memory while you could be sending
> > > chunks as they are generated by the  RecordBatchStreamWriter. Also is the
> > > buffer resized beforehand? I'd suspect there might be relocations happening
> > > under the hood.
> > >
> > >
> > > Cheers,
> > > Gosh
> > >
> > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com> wrote:
> > >
> > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > spent? How many arrays (the sum of the number of chunks across all
> > > > columns) in total are there? I would guess that the problem is all the
> > > > little Buffer memcopies. I don't think that the C Interface is going
> > > > to help you.
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > <ja...@gmail.com> wrote:
> > > > >
> > > > > Hello Arrow Community,
> > > > >
> > > > > I am a student working on a project where I need to serialize an
> > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> > > > currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> > > > the whole table, and that is harming the performance of my
> > > > performance-critical application. I basically want to get hold of the
> > > > underlying memory of the table as bytes and send it over the network. How
> > > > do you suggest I tackle this problem? I was thinking of using the C Data
> > > > interface for this, so that I convert my arrow::Table to ArrowArray and
> > > > ArrowSchema and serialize the structs to send them over the network, but
> > > > seems like serializing structs is another complex problem on its own.  It
> > > > will be great to have some suggestions on this. Thanks a lot.
> > > > >
> > > > > Best,
> > > > > Jayjeet
> > > > >
> > > >

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Weston Pace <we...@gmail.com>.

Just for some reference times from my system I created a quick test to
dump a ~1.7GB table to buffer(s).

Going to many buffers (just collecting the buffers): ~11,000ns
Going to one preallocated buffer: ~160,000,000ns
Going to one dynamically allocated buffer (using a grow factor of 2x):
~2,000,000,000ns

On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <we...@gmail.com> wrote:
>
> To be clear, we would like to help make this faster. I don't recall
> much effort being invested in optimizing this code path in the last
> couple of years, so there may be some low hanging fruit to improve the
> performance. Changing the in-memory data layout (the chunking) is one
> of the most likely things to help.
>
> On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com> wrote:
> >
> > Hi Jayjeet,
> >
> > I wonder if you really need to serialize the whole table into a single
> > buffer as you will end up with twice the memory while you could be sending
> > chunks as they are generated by the  RecordBatchStreamWriter. Also is the
> > buffer resized beforehand? I'd suspect there might be relocations happening
> > under the hood.
> >
> >
> > Cheers,
> > Gosh
> >
> > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com> wrote:
> >
> > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > spent? How many arrays (the sum of the number of chunks across all
> > > columns) in total are there? I would guess that the problem is all the
> > > little Buffer memcopies. I don't think that the C Interface is going
> > > to help you.
> > >
> > > - Wes
> > >
> > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > <ja...@gmail.com> wrote:
> > > >
> > > > Hello Arrow Community,
> > > >
> > > > I am a student working on a project where I need to serialize an
> > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> > > currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> > > the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> > > the whole table, and that is harming the performance of my
> > > performance-critical application. I basically want to get hold of the
> > > underlying memory of the table as bytes and send it over the network. How
> > > do you suggest I tackle this problem? I was thinking of using the C Data
> > > interface for this, so that I convert my arrow::Table to ArrowArray and
> > > ArrowSchema and serialize the structs to send them over the network, but
> > > seems like serializing structs is another complex problem on its own.  It
> > > will be great to have some suggestions on this. Thanks a lot.
> > > >
> > > > Best,
> > > > Jayjeet
> > > >
> > >

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Wes McKinney <we...@gmail.com>.

To be clear, we would like to help make this faster. I don't recall
much effort being invested in optimizing this code path in the last
couple of years, so there may be some low hanging fruit to improve the
performance. Changing the in-memory data layout (the chunking) is one
of the most likely things to help.

On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <go...@gmail.com> wrote:
>
> Hi Jayjeet,
>
> I wonder if you really need to serialize the whole table into a single
> buffer as you will end up with twice the memory while you could be sending
> chunks as they are generated by the  RecordBatchStreamWriter. Also is the
> buffer resized beforehand? I'd suspect there might be relocations happening
> under the hood.
>
>
> Cheers,
> Gosh
>
> On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com> wrote:
>
> > hi Jayjeet — have you run prof to see where those 1000ms are being
> > spent? How many arrays (the sum of the number of chunks across all
> > columns) in total are there? I would guess that the problem is all the
> > little Buffer memcopies. I don't think that the C Interface is going
> > to help you.
> >
> > - Wes
> >
> > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > <ja...@gmail.com> wrote:
> > >
> > > Hello Arrow Community,
> > >
> > > I am a student working on a project where I need to serialize an
> > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> > currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> > the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> > the whole table, and that is harming the performance of my
> > performance-critical application. I basically want to get hold of the
> > underlying memory of the table as bytes and send it over the network. How
> > do you suggest I tackle this problem? I was thinking of using the C Data
> > interface for this, so that I convert my arrow::Table to ArrowArray and
> > ArrowSchema and serialize the structs to send them over the network, but
> > seems like serializing structs is another complex problem on its own.  It
> > will be great to have some suggestions on this. Thanks a lot.
> > >
> > > Best,
> > > Jayjeet
> > >
> >

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Gosh Arzumanyan <go...@gmail.com>.

Hi Jayjeet,

I wonder if you really need to serialize the whole table into a single
buffer as you will end up with twice the memory while you could be sending
chunks as they are generated by the  RecordBatchStreamWriter. Also is the
buffer resized beforehand? I'd suspect there might be relocations happening
under the hood.


Cheers,
Gosh

On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <we...@gmail.com> wrote:

> hi Jayjeet — have you run prof to see where those 1000ms are being
> spent? How many arrays (the sum of the number of chunks across all
> columns) in total are there? I would guess that the problem is all the
> little Buffer memcopies. I don't think that the C Interface is going
> to help you.
>
> - Wes
>
> On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> <ja...@gmail.com> wrote:
> >
> > Hello Arrow Community,
> >
> > I am a student working on a project where I need to serialize an
> in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> the whole table, and that is harming the performance of my
> performance-critical application. I basically want to get hold of the
> underlying memory of the table as bytes and send it over the network. How
> do you suggest I tackle this problem? I was thinking of using the C Data
> interface for this, so that I convert my arrow::Table to ArrowArray and
> ArrowSchema and serialize the structs to send them over the network, but
> seems like serializing structs is another complex problem on its own.  It
> will be great to have some suggestions on this. Thanks a lot.
> >
> > Best,
> > Jayjeet
> >
>

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Posted by Wes McKinney <we...@gmail.com>.

hi Jayjeet — have you run prof to see where those 1000ms are being
spent? How many arrays (the sum of the number of chunks across all
columns) in total are there? I would guess that the problem is all the
little Buffer memcopies. I don't think that the C Interface is going
to help you.

- Wes

On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
<ja...@gmail.com> wrote:
>
> Hello Arrow Community,
>
> I am a student working on a project where I need to serialize an in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am currently using the arrow::ipc::RecordBatchStreamWriter API to serialize the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize the whole table, and that is harming the performance of my performance-critical application. I basically want to get hold of the underlying memory of the table as bytes and send it over the network. How do you suggest I tackle this problem? I was thinking of using the C Data interface for this, so that I convert my arrow::Table to ArrowArray and ArrowSchema and serialize the structs to send them over the network, but seems like serializing structs is another complex problem on its own.  It will be great to have some suggestions on this. Thanks a lot.
>
> Best,
> Jayjeet
>