You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Cindy McMullen <cm...@twitter.com> on 2020/06/29 19:45:29 UTC

Streaming use cases

Can I use Arrow to stream data from a Parquet file source and consume it
via Avro?

Re: Streaming use cases

Posted by Micah Kornfield <em...@gmail.com>.
HI Cindy,
Naming is hard :(.  The Consumer classes consume avro data and write it to
arrow.  For example the AvroArraysConsumer [1] has the following
description "Consumer which consume array type values from avro decoder.
Write the data to ListVector."  ListVector is the analogous arrow structure
to avro arrays.

Thanks,
Micah


[1]
https://arrow.apache.org/docs/java/org/apache/arrow/consumers/AvroArraysConsumer.html

On Tue, Jun 30, 2020 at 8:02 AM Cindy McMullen <cm...@twitter.com>
wrote:

> Hi, Micah -
>
> I see the Avro*Consumer classes in the javadocs
> <https://arrow.apache.org/docs/java/>, which would lead me to believe we
> have Arrow to Avro capability.  What am I missing?
>
> On Mon, Jun 29, 2020 at 9:33 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Just a clarification the functionality in Java is from Avro to Arrow (not
>> Arrow to Avro).
>>
>>
>>
>> On Mon, Jun 29, 2020 at 2:25 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> On Mon, Jun 29, 2020 at 4:15 PM Cindy McMullen <cm...@twitter.com>
>>> wrote:
>>> >
>>> > Hi, Wes -
>>> >
>>> > Yes, we're using Java/Scala, but also have a good Python code base for
>>> our data scientists.  Our goal is to replace storage/representation of
>>> Thrift for ML features with some more OSS-friendly format, such as Parquet
>>> or Avro, and avoid writing multiple adapters.
>>> >
>>> > Ideally, we could stream data from Parquet disk in batches into
>>> Arrow-compatible consumers.  Is this a reasonable fit for something like
>>> Arrow Flight?
>>>
>>> Yes, Flight is definitely designed for that -- fast / efficient
>>> delivery of Arrow record batches over TCP.
>>>
>>> >
>>> > On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>
>>> >> hi Cindy,
>>> >>
>>> >> Could you clarify which PL you are working in (though assuming Scala /
>>> >> Java judging by your e-mail address)?
>>> >>
>>> >> In C++ we have reasonably mature Parquet->Arrow reading but not yet
>>> >> conversion from Arrow to Avro. In Java, I am not sure what is the
>>> >> state of the art for getting Parquet into Arrow but this code does not
>>> >> live in Apache Arrow -- I know that Apache Iceberg has done some work
>>> >> around this but I'm not sure how consumable it is as a library.
>>> >> Java-Arrow does have some preliminary support for converting Arrow to
>>> >> Avro, I believe. So there's some engineering here to do in any case.
>>> >>
>>> >> best,
>>> >> Wes
>>> >>
>>> >> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com>
>>> wrote:
>>> >> >
>>> >> > Can I use Arrow to stream data from a Parquet file source and
>>> consume it via Avro?
>>>
>>

Re: Streaming use cases

Posted by Cindy McMullen <cm...@twitter.com>.
Hi, Micah -

I see the Avro*Consumer classes in the javadocs
<https://arrow.apache.org/docs/java/>, which would lead me to believe we
have Arrow to Avro capability.  What am I missing?

On Mon, Jun 29, 2020 at 9:33 PM Micah Kornfield <em...@gmail.com>
wrote:

> Just a clarification the functionality in Java is from Avro to Arrow (not
> Arrow to Avro).
>
>
>
> On Mon, Jun 29, 2020 at 2:25 PM Wes McKinney <we...@gmail.com> wrote:
>
>> On Mon, Jun 29, 2020 at 4:15 PM Cindy McMullen <cm...@twitter.com>
>> wrote:
>> >
>> > Hi, Wes -
>> >
>> > Yes, we're using Java/Scala, but also have a good Python code base for
>> our data scientists.  Our goal is to replace storage/representation of
>> Thrift for ML features with some more OSS-friendly format, such as Parquet
>> or Avro, and avoid writing multiple adapters.
>> >
>> > Ideally, we could stream data from Parquet disk in batches into
>> Arrow-compatible consumers.  Is this a reasonable fit for something like
>> Arrow Flight?
>>
>> Yes, Flight is definitely designed for that -- fast / efficient
>> delivery of Arrow record batches over TCP.
>>
>> >
>> > On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> hi Cindy,
>> >>
>> >> Could you clarify which PL you are working in (though assuming Scala /
>> >> Java judging by your e-mail address)?
>> >>
>> >> In C++ we have reasonably mature Parquet->Arrow reading but not yet
>> >> conversion from Arrow to Avro. In Java, I am not sure what is the
>> >> state of the art for getting Parquet into Arrow but this code does not
>> >> live in Apache Arrow -- I know that Apache Iceberg has done some work
>> >> around this but I'm not sure how consumable it is as a library.
>> >> Java-Arrow does have some preliminary support for converting Arrow to
>> >> Avro, I believe. So there's some engineering here to do in any case.
>> >>
>> >> best,
>> >> Wes
>> >>
>> >> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com>
>> wrote:
>> >> >
>> >> > Can I use Arrow to stream data from a Parquet file source and
>> consume it via Avro?
>>
>

Re: Streaming use cases

Posted by Micah Kornfield <em...@gmail.com>.
Just a clarification the functionality in Java is from Avro to Arrow (not
Arrow to Avro).



On Mon, Jun 29, 2020 at 2:25 PM Wes McKinney <we...@gmail.com> wrote:

> On Mon, Jun 29, 2020 at 4:15 PM Cindy McMullen <cm...@twitter.com>
> wrote:
> >
> > Hi, Wes -
> >
> > Yes, we're using Java/Scala, but also have a good Python code base for
> our data scientists.  Our goal is to replace storage/representation of
> Thrift for ML features with some more OSS-friendly format, such as Parquet
> or Avro, and avoid writing multiple adapters.
> >
> > Ideally, we could stream data from Parquet disk in batches into
> Arrow-compatible consumers.  Is this a reasonable fit for something like
> Arrow Flight?
>
> Yes, Flight is definitely designed for that -- fast / efficient
> delivery of Arrow record batches over TCP.
>
> >
> > On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> hi Cindy,
> >>
> >> Could you clarify which PL you are working in (though assuming Scala /
> >> Java judging by your e-mail address)?
> >>
> >> In C++ we have reasonably mature Parquet->Arrow reading but not yet
> >> conversion from Arrow to Avro. In Java, I am not sure what is the
> >> state of the art for getting Parquet into Arrow but this code does not
> >> live in Apache Arrow -- I know that Apache Iceberg has done some work
> >> around this but I'm not sure how consumable it is as a library.
> >> Java-Arrow does have some preliminary support for converting Arrow to
> >> Avro, I believe. So there's some engineering here to do in any case.
> >>
> >> best,
> >> Wes
> >>
> >> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com>
> wrote:
> >> >
> >> > Can I use Arrow to stream data from a Parquet file source and consume
> it via Avro?
>

Re: Streaming use cases

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Jun 29, 2020 at 4:15 PM Cindy McMullen <cm...@twitter.com> wrote:
>
> Hi, Wes -
>
> Yes, we're using Java/Scala, but also have a good Python code base for our data scientists.  Our goal is to replace storage/representation of Thrift for ML features with some more OSS-friendly format, such as Parquet or Avro, and avoid writing multiple adapters.
>
> Ideally, we could stream data from Parquet disk in batches into Arrow-compatible consumers.  Is this a reasonable fit for something like Arrow Flight?

Yes, Flight is definitely designed for that -- fast / efficient
delivery of Arrow record batches over TCP.

>
> On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Cindy,
>>
>> Could you clarify which PL you are working in (though assuming Scala /
>> Java judging by your e-mail address)?
>>
>> In C++ we have reasonably mature Parquet->Arrow reading but not yet
>> conversion from Arrow to Avro. In Java, I am not sure what is the
>> state of the art for getting Parquet into Arrow but this code does not
>> live in Apache Arrow -- I know that Apache Iceberg has done some work
>> around this but I'm not sure how consumable it is as a library.
>> Java-Arrow does have some preliminary support for converting Arrow to
>> Avro, I believe. So there's some engineering here to do in any case.
>>
>> best,
>> Wes
>>
>> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com> wrote:
>> >
>> > Can I use Arrow to stream data from a Parquet file source and consume it via Avro?

Re: Streaming use cases

Posted by Cindy McMullen <cm...@twitter.com>.
Hi, Wes -

Yes, we're using Java/Scala, but also have a good Python code base for our
data scientists.  Our goal is to replace storage/representation of Thrift
for ML features with some more OSS-friendly format, such as Parquet or
Avro, and avoid writing multiple adapters.

Ideally, we could stream data from Parquet disk in batches into
Arrow-compatible consumers.  Is this a reasonable fit for something like
Arrow Flight?


On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <we...@gmail.com> wrote:

> hi Cindy,
>
> Could you clarify which PL you are working in (though assuming Scala /
> Java judging by your e-mail address)?
>
> In C++ we have reasonably mature Parquet->Arrow reading but not yet
> conversion from Arrow to Avro. In Java, I am not sure what is the
> state of the art for getting Parquet into Arrow but this code does not
> live in Apache Arrow -- I know that Apache Iceberg has done some work
> around this but I'm not sure how consumable it is as a library.
> Java-Arrow does have some preliminary support for converting Arrow to
> Avro, I believe. So there's some engineering here to do in any case.
>
> best,
> Wes
>
> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com>
> wrote:
> >
> > Can I use Arrow to stream data from a Parquet file source and consume it
> via Avro?
>

Re: Streaming use cases

Posted by Wes McKinney <we...@gmail.com>.
hi Cindy,

Could you clarify which PL you are working in (though assuming Scala /
Java judging by your e-mail address)?

In C++ we have reasonably mature Parquet->Arrow reading but not yet
conversion from Arrow to Avro. In Java, I am not sure what is the
state of the art for getting Parquet into Arrow but this code does not
live in Apache Arrow -- I know that Apache Iceberg has done some work
around this but I'm not sure how consumable it is as a library.
Java-Arrow does have some preliminary support for converting Arrow to
Avro, I believe. So there's some engineering here to do in any case.

best,
Wes

On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cm...@twitter.com> wrote:
>
> Can I use Arrow to stream data from a Parquet file source and consume it via Avro?