You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2022/04/04 14:39:26 UTC

[Question] Is it possible to write to IPC without an intermediary buffer?

Hi,

Motivated by [1], I wonder if it is possible to write to IPC without
writing the data to an intermediary buffer.

The challenge is that the header of an IPC message [header][data] requires:

* the positions of the buffers
* the total length of the body

For uncompressed data, we could compute these before-hand at `O(C)` where C
is the number of columns. However, I am unable to find a way of computing
these ahead of writing for compressed buffers: we need to compress the data
to know its compressed (and thus buffers) size.

Is this understanding correct?

Best,
Jorge

[1] https://github.com/pola-rs/polars/issues/2639

Re: [Question] Is it possible to write to IPC without an intermediary buffer?

Posted by Micah Kornfield <em...@gmail.com>.

>
>
> AFAI can understand, this would cause writing to IPC to require O(N) where
> N is the average size of the buffers, as opposed to O(N*B) where N is the
> average size of the buffer and B the number of buffers. I.e. It is still
> quite a multiplicative factor involved.


Small nit, but this could theoretically be O(1) size requirements depending
on the compression library, since the same seeking behavior could be used
to go back and store the necessary byte lengths after compressing data.

Unfortunately the solution doesn't work if the data is actually being
consumed as a stream without more coordination between producer and
consumer.

On Tue, Apr 5, 2022 at 2:50 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi Micah,
>
> Thank you for your reply. That is also my understanding - not possible in
> streaming IPC, possible in file IPC with random access. The pseudo-code
> could be something like:
>
> start = writer.seek_current();
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> end_buffers_position = writer.seek_current()
> writer.seek(start)
> write_header(writer, locations)
> writer.seek(end_buffers_position)
>
> AFAI can understand, this would cause writing to IPC to require O(N) where
> N is the average size of the buffers, as opposed to O(N*B) where N is the
> average size of the buffer and B the number of buffers. I.e. It is still
> quite a multiplicative factor involved.
>
> I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea.
>
> Best,
> Jorge
>
>
>
> On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Jorge,
>> I don't think any implementation does this but I think it is technically
>> possible, although it might be complicated to actually do.  It also
>> requires random access files (the output can't be purely streaming).
>>
>> I think the approach you would need to take is to pr-write the header
>> information without the values zeroed out at first., After you've
>> compressed and written the physical bytes you would need to update the
>> values in place, after you know them.  Since Flatbuffers doesn't do any
>> variable length encoding, you don't need to worry about possibly
>> corrupting
>> the data.   The challenging part is determining the exact locations that
>> need to be overwritten.
>>
>> -MIcah
>>
>> On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
>> jorgecarleitao@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > Motivated by [1], I wonder if it is possible to write to IPC without
>> > writing the data to an intermediary buffer.
>> >
>> > The challenge is that the header of an IPC message [header][data]
>> requires:
>> >
>> > * the positions of the buffers
>> > * the total length of the body
>> >
>> > For uncompressed data, we could compute these before-hand at `O(C)`
>> where C
>> > is the number of columns. However, I am unable to find a way of
>> computing
>> > these ahead of writing for compressed buffers: we need to compress the
>> data
>> > to know its compressed (and thus buffers) size.
>> >
>> > Is this understanding correct?
>> >
>> > Best,
>> > Jorge
>> >
>> > [1] https://github.com/pola-rs/polars/issues/2639
>> >
>>
>

Re: [Question] Is it possible to write to IPC without an intermediary buffer?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi Micah,

Thank you for your reply. That is also my understanding - not possible in
streaming IPC, possible in file IPC with random access. The pseudo-code
could be something like:

start = writer.seek_current();
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
end_buffers_position = writer.seek_current()
writer.seek(start)
write_header(writer, locations)
writer.seek(end_buffers_position)

AFAI can understand, this would cause writing to IPC to require O(N) where
N is the average size of the buffers, as opposed to O(N*B) where N is the
average size of the buffer and B the number of buffers. I.e. It is still
quite a multiplicative factor involved.

I filed https://issues.apache.org/jira/browse/ARROW-16118 with the idea.

Best,
Jorge



On Mon, Apr 4, 2022 at 6:09 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Jorge,
> I don't think any implementation does this but I think it is technically
> possible, although it might be complicated to actually do.  It also
> requires random access files (the output can't be purely streaming).
>
> I think the approach you would need to take is to pr-write the header
> information without the values zeroed out at first., After you've
> compressed and written the physical bytes you would need to update the
> values in place, after you know them.  Since Flatbuffers doesn't do any
> variable length encoding, you don't need to worry about possibly corrupting
> the data.   The challenging part is determining the exact locations that
> need to be overwritten.
>
> -MIcah
>
> On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Hi,
> >
> > Motivated by [1], I wonder if it is possible to write to IPC without
> > writing the data to an intermediary buffer.
> >
> > The challenge is that the header of an IPC message [header][data]
> requires:
> >
> > * the positions of the buffers
> > * the total length of the body
> >
> > For uncompressed data, we could compute these before-hand at `O(C)`
> where C
> > is the number of columns. However, I am unable to find a way of computing
> > these ahead of writing for compressed buffers: we need to compress the
> data
> > to know its compressed (and thus buffers) size.
> >
> > Is this understanding correct?
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/pola-rs/polars/issues/2639
> >
>

Re: [Question] Is it possible to write to IPC without an intermediary buffer?

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jorge,
I don't think any implementation does this but I think it is technically
possible, although it might be complicated to actually do.  It also
requires random access files (the output can't be purely streaming).

I think the approach you would need to take is to pr-write the header
information without the values zeroed out at first., After you've
compressed and written the physical bytes you would need to update the
values in place, after you know them.  Since Flatbuffers doesn't do any
variable length encoding, you don't need to worry about possibly corrupting
the data.   The challenging part is determining the exact locations that
need to be overwritten.

-MIcah

On Mon, Apr 4, 2022 at 7:40 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> Motivated by [1], I wonder if it is possible to write to IPC without
> writing the data to an intermediary buffer.
>
> The challenge is that the header of an IPC message [header][data] requires:
>
> * the positions of the buffers
> * the total length of the body
>
> For uncompressed data, we could compute these before-hand at `O(C)` where C
> is the number of columns. However, I am unable to find a way of computing
> these ahead of writing for compressed buffers: we need to compress the data
> to know its compressed (and thus buffers) size.
>
> Is this understanding correct?
>
> Best,
> Jorge
>
> [1] https://github.com/pola-rs/polars/issues/2639
>