You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jacob Huffman <ja...@gmail.com> on 2022/01/03 20:57:56 UTC

[C++] Interest Arrow <=> Protobuf conversion capability

Hey all,

Is there much interest in adding the capability to do Arrow <=> Protobuf
conversion in C++?

I'm working on this for a side project, but I was wondering if there is
much interest from the broader Arrow community. If so, I might be able to
find time to contribute it.

To get the point across, here is a strawman API. In reality, we would
likely need some sort of builder API which allows incrementally adding
protos and a generator-like API for returning the protos from a table.

"""
// Functions of functions using templates to work with any message type
template <class T>
Result<std::shared_prt<Table>> ProtosToTable(const std::vector<T>& protos);

template <class T>
Result<std::vector<T>> TableToProtos(const std::shared_prt<Table> table);

// Pair of functions using google::protobuf::Message and polymorphism to
work with any message type
Result<std::shared_prt<Table>> ProtosToTable( const
std::vector<google::protobuf::Message *>& protos);

// I don't like that this returns a vector of unique pointers. Is there a
better way to return a vector of base classes while retaining polymorphic
behavior?
Result<std::vector<std::unique_ptr<google::protobuf::Message>>>
TableToProtos (const std::shared_prt<Table> table, const
google::protobuf:Descriptor* descriptor);
"""

My particular use case for these functions is that I would like to use
protobufs for the in-memory data representation as it provides strongly
typed classes which are very expressive (can have nested/repeated fields)
and a well established path for schema evolution. However, I would like to
use parquet as the data storage layer (and arrow as the glue between the
two) so that I can take advantage of technologies like presto for querying
the data. I'm hoping that backwards compatible changes to the proto schema
turn into backwards compatible changes in the parquet files. I'm also a bit
curious to see if arrow allows faster deserialization when compared to a
list of serialized protos on disk.

Re: [C++] Interest Arrow <=> Protobuf conversion capability

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacob,
I think translation between Arrow and Protobuf could be useful.  Given your
use-case I'd suggest considering a few things:

1.  If the end-goal is to work with Parquet then you might consider
building the layer directly on top of the Parquet low-level API instead of
involving Arrow.  For sparsely populated nested messages, you will avoid
unneeded construction and destruction (I think parquet-MR also has some
optimizations to avoid cache churn in these situations).  Keeping data in
row-form also adds the option of buffering and sorting efficiency, which
can greatly improve storage and metadata pruning efficiency.   Working
directly with the parquet API would also reduce the metadata you would have
to plumb through if you wanted to do schema resolution using protobuf tag
numbers (instead of field names).

2.  If the data you are working with isn't mutable (i.e. Append only), it
might be worth experimenting with code-gen that wraps Arrow Builder classes
with an idiomatic expressive API instead of adding an extra level of
indirection.

3.  If we do pursue the APIs above. I think APIs like this are likely more
naturally written in terms of RecordBatch or RecordBatchReader instead of
Tables.

4.  Defining the scope of protobuf handling will be important.   Protobuf
extensions [1] have some sharp edges to incorporate. How "oneof" fields are
handled would also be something to consider.

I'm also a bit curious to see if arrow allows faster deserialization when
> compared to a
> list of serialized protos on disk

It depends what you mean by this.  If you want to deserialize from Parquet
back to proto, I would be pretty surprised (but not completely) if going
through Arrow is more efficient, especially for deeply nested
"sparse" messages. The protobuf reflection implementation has a high
overhead of setting individual fields, and as mentioned above nested sparse
messages probably incur some level of tax that is avoided with serialized
protobufs.

[1] https://developers.google.com/protocol-buffers/docs/proto#extensions

Cheers,
Micah

On Mon, Jan 3, 2022 at 1:27 PM Jacob Huffman <ja...@gmail.com>
wrote:

> Hey all,
>
> Is there much interest in adding the capability to do Arrow <=> Protobuf
> conversion in C++?
>
> I'm working on this for a side project, but I was wondering if there is
> much interest from the broader Arrow community. If so, I might be able to
> find time to contribute it.
>
> To get the point across, here is a strawman API. In reality, we would
> likely need some sort of builder API which allows incrementally adding
> protos and a generator-like API for returning the protos from a table.
>
> """
> // Functions of functions using templates to work with any message type
> template <class T>
> Result<std::shared_prt<Table>> ProtosToTable(const std::vector<T>& protos);
>
> template <class T>
> Result<std::vector<T>> TableToProtos(const std::shared_prt<Table> table);
>
> // Pair of functions using google::protobuf::Message and polymorphism to
> work with any message type
> Result<std::shared_prt<Table>> ProtosToTable( const
> std::vector<google::protobuf::Message *>& protos);
>
> // I don't like that this returns a vector of unique pointers. Is there a
> better way to return a vector of base classes while retaining polymorphic
> behavior?
> Result<std::vector<std::unique_ptr<google::protobuf::Message>>>
> TableToProtos (const std::shared_prt<Table> table, const
> google::protobuf:Descriptor* descriptor);
> """
>
> My particular use case for these functions is that I would like to use
> protobufs for the in-memory data representation as it provides strongly
> typed classes which are very expressive (can have nested/repeated fields)
> and a well established path for schema evolution. However, I would like to
> use parquet as the data storage layer (and arrow as the glue between the
> two) so that I can take advantage of technologies like presto for querying
> the data. I'm hoping that backwards compatible changes to the proto schema
> turn into backwards compatible changes in the parquet files. I'm also a bit
> curious to see if arrow allows faster deserialization when compared to a
> list of serialized protos on disk.
>