You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Tewfik Zeghmi <ze...@gmail.com> on 2020/02/14 21:32:43 UTC

Schemaless serialization

Hi,

I have a use case of creating a feature store to serve low latency traffic.
Given a key, we need the ability to save and read a feature vector in a low
latency Key Value store. Serializing an Arrow table with one row is takes
1344 bytes, while the same singular row serialized with AVRO without the
schema uses 236 bytes.

Is it possible to save serialize an Arrow table/RecordBatch independently
of the schema? Ideally, we'd like to serialize the schema once and not
along with every feature key, then be able to read the RecordBatch with the
schema.

thank you!

Re: Schemaless serialization

Posted by Antoine Pitrou <so...@pitrou.net>.
Hi Tewfik,

It would be good to step back a bit and explain what your data is, and
what the consumer is going to do with it.

Regards

Antoine.


On Fri, 14 Feb 2020 15:08:57 -0800
Tewfik Zeghmi <ze...@gmail.com> wrote:
> Hi Micah,
> 
> The primary language is Python.  I'm hoping the that the small overhead of
> metadata is small compared to the schema information.
> 
> thank you!
> 
> On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
> > Hi Tewfik,
> > What language?  it is possible to serialize them separately but the right
> > hooks might not be exposed in all languages.
> >
> > There is still going to be a higher overhead for single row values in Arrow
> > compared to Avro due to metadata requirements.
> >
> > Thanks,
> > Micah
> >
> > On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <ze...@gmail.com> wrote:
> >  
> > > Hi,
> > >
> > > I have a use case of creating a feature store to serve low latency  
> > traffic.  
> > > Given a key, we need the ability to save and read a feature vector in a  
> > low  
> > > latency Key Value store. Serializing an Arrow table with one row is takes
> > > 1344 bytes, while the same singular row serialized with AVRO without the
> > > schema uses 236 bytes.
> > >
> > > Is it possible to save serialize an Arrow table/RecordBatch independently
> > > of the schema? Ideally, we'd like to serialize the schema once and not
> > > along with every feature key, then be able to read the RecordBatch with  
> > the  
> > > schema.
> > >
> > > thank you!
> > >  
> >  
> 
> 




Re: Schemaless serialization

Posted by Wes McKinney <we...@gmail.com>.
hi Micah and Tewfik,

The functionality is exposed in Python, see e.g.

https://github.com/apache/arrow/blob/apache-arrow-0.16.0/python/pyarrow/tests/test_ipc.py#L685

As Micah said, very small batches aren't necessarily optimized for
compactness (for example buffers are padded to multiples of 8). Give
this a try though and see how it works

Thanks
Wes

On Sun, Feb 16, 2020 at 9:26 AM Micah Kornfield <em...@gmail.com> wrote:
>
> I should note, it isn't necessarily just the extra metadata.  For single
> row values, there is also an overhead for padding requirements.  You should
> be able to measure this by looking at the size of the buffer you are using
> before writing any batches to the stream (I believe the schema is written
> eagerly), and subtracting that from the final size.
>
> Looking at python documentation I don't see it exposed, but the underlying
> function does exist in C++ [1]. People more familiar with the python may be
> able to offer more details.
>
> I think for this type of use-case it probably makes sense to expose it.
> Want to try to create a patch for it?
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L215
>
> On Fri, Feb 14, 2020 at 3:09 PM Tewfik Zeghmi <ze...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > The primary language is Python.  I'm hoping the that the small overhead of
> > metadata is small compared to the schema information.
> >
> > thank you!
> >
> > On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Tewfik,
> >> What language?  it is possible to serialize them separately but the right
> >> hooks might not be exposed in all languages.
> >>
> >> There is still going to be a higher overhead for single row values in
> >> Arrow
> >> compared to Avro due to metadata requirements.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <ze...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have a use case of creating a feature store to serve low latency
> >> traffic.
> >> > Given a key, we need the ability to save and read a feature vector in a
> >> low
> >> > latency Key Value store. Serializing an Arrow table with one row is
> >> takes
> >> > 1344 bytes, while the same singular row serialized with AVRO without the
> >> > schema uses 236 bytes.
> >> >
> >> > Is it possible to save serialize an Arrow table/RecordBatch
> >> independently
> >> > of the schema? Ideally, we'd like to serialize the schema once and not
> >> > along with every feature key, then be able to read the RecordBatch with
> >> the
> >> > schema.
> >> >
> >> > thank you!
> >> >
> >>
> >
> >
> > --
> > Taleb Tewfik Zeghmi
> >

Re: Schemaless serialization

Posted by Micah Kornfield <em...@gmail.com>.
I should note, it isn't necessarily just the extra metadata.  For single
row values, there is also an overhead for padding requirements.  You should
be able to measure this by looking at the size of the buffer you are using
before writing any batches to the stream (I believe the schema is written
eagerly), and subtracting that from the final size.

Looking at python documentation I don't see it exposed, but the underlying
function does exist in C++ [1]. People more familiar with the python may be
able to offer more details.

I think for this type of use-case it probably makes sense to expose it.
Want to try to create a patch for it?

Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L215

On Fri, Feb 14, 2020 at 3:09 PM Tewfik Zeghmi <ze...@gmail.com> wrote:

> Hi Micah,
>
> The primary language is Python.  I'm hoping the that the small overhead of
> metadata is small compared to the schema information.
>
> thank you!
>
> On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Tewfik,
>> What language?  it is possible to serialize them separately but the right
>> hooks might not be exposed in all languages.
>>
>> There is still going to be a higher overhead for single row values in
>> Arrow
>> compared to Avro due to metadata requirements.
>>
>> Thanks,
>> Micah
>>
>> On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <ze...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I have a use case of creating a feature store to serve low latency
>> traffic.
>> > Given a key, we need the ability to save and read a feature vector in a
>> low
>> > latency Key Value store. Serializing an Arrow table with one row is
>> takes
>> > 1344 bytes, while the same singular row serialized with AVRO without the
>> > schema uses 236 bytes.
>> >
>> > Is it possible to save serialize an Arrow table/RecordBatch
>> independently
>> > of the schema? Ideally, we'd like to serialize the schema once and not
>> > along with every feature key, then be able to read the RecordBatch with
>> the
>> > schema.
>> >
>> > thank you!
>> >
>>
>
>
> --
> Taleb Tewfik Zeghmi
>

Re: Schemaless serialization

Posted by Tewfik Zeghmi <ze...@gmail.com>.
Hi Micah,

The primary language is Python.  I'm hoping the that the small overhead of
metadata is small compared to the schema information.

thank you!

On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Tewfik,
> What language?  it is possible to serialize them separately but the right
> hooks might not be exposed in all languages.
>
> There is still going to be a higher overhead for single row values in Arrow
> compared to Avro due to metadata requirements.
>
> Thanks,
> Micah
>
> On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <ze...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a use case of creating a feature store to serve low latency
> traffic.
> > Given a key, we need the ability to save and read a feature vector in a
> low
> > latency Key Value store. Serializing an Arrow table with one row is takes
> > 1344 bytes, while the same singular row serialized with AVRO without the
> > schema uses 236 bytes.
> >
> > Is it possible to save serialize an Arrow table/RecordBatch independently
> > of the schema? Ideally, we'd like to serialize the schema once and not
> > along with every feature key, then be able to read the RecordBatch with
> the
> > schema.
> >
> > thank you!
> >
>


-- 
Taleb Tewfik Zeghmi

Re: Schemaless serialization

Posted by Micah Kornfield <em...@gmail.com>.
Hi Tewfik,
What language?  it is possible to serialize them separately but the right
hooks might not be exposed in all languages.

There is still going to be a higher overhead for single row values in Arrow
compared to Avro due to metadata requirements.

Thanks,
Micah

On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <ze...@gmail.com> wrote:

> Hi,
>
> I have a use case of creating a feature store to serve low latency traffic.
> Given a key, we need the ability to save and read a feature vector in a low
> latency Key Value store. Serializing an Arrow table with one row is takes
> 1344 bytes, while the same singular row serialized with AVRO without the
> schema uses 236 bytes.
>
> Is it possible to save serialize an Arrow table/RecordBatch independently
> of the schema? Ideally, we'd like to serialize the schema once and not
> along with every feature key, then be able to read the RecordBatch with the
> schema.
>
> thank you!
>