You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Cindy McMullen <cm...@twitter.com> on 2020/07/28 16:04:46 UTC

Avro -> TensorFlow

Hi -

I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc file
or SpecificRecord Java class) that I'd like to send to TensorFlow as input
tensors, preferably via Arrow.  Can you suggest some existing adapters or
code patterns (Java or Scala) that I can use?

Thanks -

-- Cindy

Re: Avro -> TensorFlow

Posted by Micah Kornfield <em...@gmail.com>.
Thanks Cindy,
Feedback would be appreciated.  I also filed
https://issues.apache.org/jira/browse/ARROW-9613 so that the conversion can
potentially be more efficient.

On Wed, Jul 29, 2020 at 4:16 AM Cindy McMullen <cm...@twitter.com>
wrote:

> Thanks, Micah, for your thoughtful response.  We'll give it a try and let
> you know how it goes.
>
> -- Cindy
>
> On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Cindy,
>> I haven't tried this but the best guidance I can give is the following:
>> 1.   Create an appropriate decoder using Avro's DecoderFactory [1]
>> 2.  Construct an arrow adapter with a schema and the decoder.  There are
>> some examples in the unit tests [2].
>> 3.  Adapt the method described by Uwe describes in his blog-post about
>> JDBC [3] to using the adapter.  From there I think you can use the
>> tensorflow APIs (sorry I've not used them but my understanding is TF only
>> has python APIs?)
>>
>> If number 3 doesn't work for you due to environment constraints, you
>> could write out an Arrow file using the file writer [4] and try to see if
>> examples listed in [5] help.
>>
>>  ne thing to note is, I believe the Avro adapter library currently has an
>> impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
>> VectorStreamRoot per batch, and the Writer libraries are designed around
>> loading/unloading a single VectorSchemaRoot.  I think the method with the
>> least overhead for transferring is the data is to create a VectorUnloader
>> [6] per VectorSchemaRoot, convert it to a record batch and then load it
>> into the Writer's VectorSchemaRoot.  This will unfortunately cause some
>> amount of memory churn due to extra allocations.
>>
>> There is a short overview of working with Arrow generally available at [7]
>>
>> Hope this helps,
>> Micah
>>
>> [1]
>> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
>> [2]
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
>> [3]
>> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
>> [4]
>> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
>> [5]
>> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
>> [6]
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
>> [7] https://arrow.apache.org/docs/java/
>>
>> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>>> Hi -
>>>
>>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
>>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
>>> input tensors, preferably via Arrow.  Can you suggest some existing
>>> adapters or code patterns (Java or Scala) that I can use?
>>>
>>> Thanks -
>>>
>>> -- Cindy
>>>
>>

Re: Avro -> TensorFlow

Posted by Cindy McMullen <cm...@twitter.com>.
Thanks, Micah, for your thoughtful response.  We'll give it a try and let
you know how it goes.

-- Cindy

On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Cindy,
> I haven't tried this but the best guidance I can give is the following:
> 1.   Create an appropriate decoder using Avro's DecoderFactory [1]
> 2.  Construct an arrow adapter with a schema and the decoder.  There are
> some examples in the unit tests [2].
> 3.  Adapt the method described by Uwe describes in his blog-post about
> JDBC [3] to using the adapter.  From there I think you can use the
> tensorflow APIs (sorry I've not used them but my understanding is TF only
> has python APIs?)
>
> If number 3 doesn't work for you due to environment constraints, you could
> write out an Arrow file using the file writer [4] and try to see if
> examples listed in [5] help.
>
>  ne thing to note is, I believe the Avro adapter library currently has an
> impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
> VectorStreamRoot per batch, and the Writer libraries are designed around
> loading/unloading a single VectorSchemaRoot.  I think the method with the
> least overhead for transferring is the data is to create a VectorUnloader
> [6] per VectorSchemaRoot, convert it to a record batch and then load it
> into the Writer's VectorSchemaRoot.  This will unfortunately cause some
> amount of memory churn due to extra allocations.
>
> There is a short overview of working with Arrow generally available at [7]
>
> Hope this helps,
> Micah
>
> [1]
> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
> [2]
> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
> [3]
> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
> [4]
> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
> [5]
> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
> [6]
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
> [7] https://arrow.apache.org/docs/java/
>
> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cm...@twitter.com>
> wrote:
>
>> Hi -
>>
>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
>> input tensors, preferably via Arrow.  Can you suggest some existing
>> adapters or code patterns (Java or Scala) that I can use?
>>
>> Thanks -
>>
>> -- Cindy
>>
>

Re: Avro -> TensorFlow

Posted by Micah Kornfield <em...@gmail.com>.
Hi Cindy,
I haven't tried this but the best guidance I can give is the following:
1.   Create an appropriate decoder using Avro's DecoderFactory [1]
2.  Construct an arrow adapter with a schema and the decoder.  There are
some examples in the unit tests [2].
3.  Adapt the method described by Uwe describes in his blog-post about JDBC
[3] to using the adapter.  From there I think you can use the tensorflow
APIs (sorry I've not used them but my understanding is TF only has python
APIs?)

If number 3 doesn't work for you due to environment constraints, you could
write out an Arrow file using the file writer [4] and try to see if
examples listed in [5] help.

 ne thing to note is, I believe the Avro adapter library currently has an
impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
VectorStreamRoot per batch, and the Writer libraries are designed around
loading/unloading a single VectorSchemaRoot.  I think the method with the
least overhead for transferring is the data is to create a VectorUnloader
[6] per VectorSchemaRoot, convert it to a record batch and then load it
into the Writer's VectorSchemaRoot.  This will unfortunately cause some
amount of memory churn due to extra allocations.

There is a short overview of working with Arrow generally available at [7]

Hope this helps,
Micah

[1]
https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
[2]
https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
[3]
https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
[4]
https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
[5]
https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
[6]
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
[7] https://arrow.apache.org/docs/java/

On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cm...@twitter.com>
wrote:

> Hi -
>
> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
> input tensors, preferably via Arrow.  Can you suggest some existing
> adapters or code patterns (Java or Scala) that I can use?
>
> Thanks -
>
> -- Cindy
>