You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Yevgeni Litvin <se...@gmail.com> on 2018/09/26 18:59:42 UTC

Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Hi,

My name is Yevgeni Litvin. I am working on ML infra with a small team
within Uber ATG. Our team has recently open sourced Petastorm library. It
heavily relies on Apache Arrow so I wanted to share it with the community.

The goal of the project is to provide a convenient way for deep learning
community to use Apache Parquet store with sensor data from Tensorflow,
PyTorch or other Python based ML frameworks.

I believe our use of Parquet is different from mainstream applications as
our field sizes are asymetric (some are huge, such as images, and others
are small) and rowgroup sizes are relatively small (<100). That required
some adaptations.

We use PyArrow mostly for loading the data. We do see great potential for
further optimizations and speedups by relying more heavily on Arrow as
in-memory store.

You can find more information about our project here:

http://eng.uber.com/petastorm/
https://github.com/uber/petastorm

Would be more than happy to hear comments, feedback and suggestions!

Thank you,

- Yevgeni Litvin

Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Yevgeni,

this looks interesting. Can you make a PR to https://github.com/apache/arrow so that  Petastorm is listed on https://arrow.apache.org/powered_by/ ? 

I browsed a bit through your code. As far as I can see your approach is store to have a set of Parquet files in a directory with a schema that can be translated for Spark, Tensorflow, Torch, … Is this schema persisted in the Parquet file metadata or as a separate file alongside the dataset? Could we extend Arrow's type system a bit to better suit all the frameworks you are targeting. As you had to build a more general schema class, I guess there are definitely things that could not be expressed in Arrow's schema definition. Not sure whether we could extend pyarrow's schema classes to fully support your use case but I would like to understand how to better support it.

Uwe

On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote:
> Hi,
> 
> My name is Yevgeni Litvin. I am working on ML infra with a small team
> within Uber ATG. Our team has recently open sourced Petastorm library. It
> heavily relies on Apache Arrow so I wanted to share it with the community.
> 
> The goal of the project is to provide a convenient way for deep learning
> community to use Apache Parquet store with sensor data from Tensorflow,
> PyTorch or other Python based ML frameworks.
> 
> I believe our use of Parquet is different from mainstream applications as
> our field sizes are asymetric (some are huge, such as images, and others
> are small) and rowgroup sizes are relatively small (<100). That required
> some adaptations.
> 
> We use PyArrow mostly for loading the data. We do see great potential for
> further optimizations and speedups by relying more heavily on Arrow as
> in-memory store.
> 
> You can find more information about our project here:
> 
> http://eng.uber.com/petastorm/
> https://github.com/uber/petastorm
> 
> Would be more than happy to hear comments, feedback and suggestions!
> 
> Thank you,
> 
> - Yevgeni Litvin