You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Joaquin Vanschoren <jo...@gmail.com> on 2019/06/12 10:38:11 UTC

Arrow as a common open standard for machine learning data

Dear all,

Thanks for creating Arrow! I'm part of OpenML.org, an open source
initiative/platform for sharing machine learning datasets and models. We
are currently storing data in either ARFF or Parquet, but are looking into
whether e.g. Feather or a mix of Feather and Parquet could be the new
standard for all(?) our datasets (currently about 20000 of them). We had a
few questions though, and would definitely like to hear your opinion.
Apologies in advance if there were recent announcements about these that I
missed.

* Is Feather a good choice for long-term storage (is the binary format
stable)?
* What meta-data is stored? Are the column names and data types always
stored? For categorical columns, can I store the named categories/levels?
Is there a way to append additional meta-data, or is it best to store that
in a separate file (e.g. json)?
* What is the status of support for sparse data? Can I store large sparse
datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
Feather/Parquet?
* What is the status of compression for Feather? We have some datasets that
are quite large (several GB), and we'd like to onboard datasets like
Imagenet which are 130GB in TFRecord format (but I read that parquet can
store it in about 40GB).
* Would it make more sense to use both Parquet and Feather, depending on
the dataset size or dimensionality? If so, what would be a good
trade-off/threshold in that case?
* Most of our datasets are standard dataframes, but some are also
collections of images or texts. I guess we have to 'manually' convert those
to dataframes first, right? Or do you know of existing tools to facilitate
this?
*  Can Feather files already be read in Java/Go/C#/...?

Thanks!
Joaquin

Re: Arrow as a common open standard for machine learning data

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Jun 30, 2020 at 8:09 AM Nicholas Poorman <ni...@gmail.com> wrote:
>
> Joaquin,
>
> After reading your proposal I think there may be some things you may want
> to consider.
>
> It sounds like you are trying to come up with a one size fits all solution
> but it may be better to define your requirements based on your needs and
> environment.
>
> For starters, where do you plan to store these files? Do you plan on
> putting them in a cloud object storage like S3 or do you plan on having
> disk volumes attached to servers you are managing? A format like Parquet is
> going to be useful for object storage such as S3 because you bundle up
> everything in memory and then write it out at once. Any append-able format
> is going to require disk volumes where you have append functionality. There
> are a few active project that build commit logs and tombstones on top of
> Parquet to add this functionality. For example Hudi and Databricks
> DeltaLake. Also, if you plan on doing anything at scale you might run into
> issues with lock contention if you choose something backed by b-trees such
> as SQLite.
>
> There’s two ways you could handle the issues with Parquet implementations
> that are currently unable to read partial files. One, you could contribute
> back to the Parquet implementation so that it is capable of doing so. Or
> two, you could partition your Parquet files and write them in smaller
> chunks so they could be selectively read. I’m currently in the process of
> implementing the Parquet implementation in Go so partial read functionality
> is something I will consider.
>
> If you plan on having a service that can read in one format and return
> another format to a user, you are either going to need a format capable of
> stream decoding/encoding or a whole lot of memory on the instances running
> the service. Something like csv would allow for stream decoding/encoding.
> Compression algorithms are for the most part going to be streaming. A
> b-tree is going to be streaming as you can do something like a
> breadth-first iteration over it. Parquet on the other hand is either going
> to require you to write files in small partitions (this is generally bad
> and Spark users refer to this as the “small files problem”), or you will
> need to utilize an implementation that supports partial reads. There are
> Parquet implementations in Java and Go that support partial reads. The
> issue you will face is doing the steaming writes back to the user. If for
> example a user wanted their data returned as Parquet you would have to do
> the transformation in memory all at once and then stream it to the user. If
> doing the transform from/to various file formats is a feature you feel
> strongly about, I would suggest doing the transforms via out-of-band ETL
> jobs where the user can then request the files asynchronously later. Doing
> the transform in-band of the request / response lifecycle doesn’t seem
> scalable given the constraints of some file formats and instance memory.
>
> To your point of storing images with meta data such as tags. I haven’t
> actually tried it but I suppose you could in theory write the images in one
> Parquet binary type column and the tags in another.
>
> Versioning is difficult and I believe there are many attempts at this right
> now. DeltaLake for example has the ability to query at dataset at a point
> in time. They basically have Parquet files with some extra json files on
> the side describing the changes. You first read the json files to
> understand the changes and then read the Parquet files they reference.
> Straight up versions of file could be achieved with your underlying file
> system. S3 has file versioning, Docker has its own internal delta changes
> file system layer, etc..
>
> I would not recommend storing the files in Feather for long term storage as
> your file size and costs are going to explode compared to a column-oriented
> format that supports compression.

Note: Feather files support compression now.

> Best,
> Nick
>
>
> On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j....@tue.nl>
> wrote:
>
> > Hi all,
> >
> > Sorry for restarting an old thread, but we've had a _lot_ of discussions
> > over the past 9 months or so on how to store machine learning datasets
> > internally. We've written a blog post about it and would love to hear your
> > thoughts:
> >
> > https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> >
> > To be clear: what we need is a data format for archival storage on the
> > server, and preferably one that supports versioning/diff, multi-table
> > storage, and sparse data.
> > Hence, this is for *internal* storage. When OpenML users want to download a
> > dataset in parquet or arrow we can always convert it on the fly (or from a
> > cache). We already use Arrow/Feather to cache the datasets after it is
> > downloaded (when possible).
> >
> > One specific concern about parquet is that we are not entirely sure
> > whether a parquet file created by one parser (e.g. in R) can always be read
> > by another parser (e.g. in Python). We saw some github issues related to
> > this but we don't know whether this is still an issue. Do you know? Also,
> > it seems that none of the current python parsers support partial
> > read/writes, is that correct?
> >
> > Because of these issues, we are still considering a text-based format (e.g.
> > CSV) for our main dataset storage, mainly because of its broad native
> > support in all languages and easy versioning/diffs (we could use git-lfs),
> > and use parquet/arrow for later usage where possible. We're still doubting
> > between CSV and Parquet, though.
> >
> > Do you have any thoughts or comments?
> >
> > Thanks!
> > Joaquin
> >
> > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Joaquin -- there would be no practical difference, primarily it
> > > would be for the preservation of APIs in Python and R related to the
> > > Feather format. Internally "read_feather" will invoke the same code
> > > paths as the Arrow protocol file reader
> > >
> > > - Wes
> > >
> > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Thank you all for your very detailed answers! I also read in other
> > > threads
> > > > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > > > looking forward to that.
> > > > @Wes: will there be any practical difference between Feather and Arrow
> > > > after the 1.0.0 release? It is just an alias? What would be the
> > benefits
> > > of
> > > > using Feather rather than Arrow at that point?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > > >
> > > >
> > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> > > >
> > > > > hi there,
> > > > >
> > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > emkornfield@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > > >
> > > > > > I don't know the status of feather.   The arrow file format should
> > be
> > > > > > readable by Java and C++ (I believe all the languages that bind C++
> > > also
> > > > > > support the format, these include python, ruby and R) .  A quick
> > code
> > > > > > search of the repo makes me think that there is also support for
> > C#,
> > > Rust
> > > > > > and Javascript. It doesn't look like the file format isn't
> > supported
> > > in
> > > > > Go
> > > > > > yet but it probably wouldn't be too hard to do.
> > > > > >
> > > > > Go doesn't handle Feather files.
> > > > > But there is support (not yet feature complete, see [1]) for Arrow
> > > files
> > > > > (r/w):
> > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > > >
> > > > > hth,
> > > > > -s
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > > >
> > >
> >

Re: Arrow as a common open standard for machine learning data

Posted by Joaquin Vanschoren <j....@tue.nl>.
Thanks!


> You should be able to store different length vectors in Parquet. Think of
> strings simply as an array of bytes, and those are variable length. You
> would want to make sure you don’t use DICTIONARY_ENCODING in that case.
>

Interesting. We'll look at that.


> No, I'm not aware of any tools that do diffs between Parquet files. I'm
> not sure how you could perform a byte for byte diff without reading one
> into memory and decoding it. My question here would be who is trying to
> consume the diff you want to generate? Is the diff something you want to
> display to a user? i.e. column A, row 132 was "foo" but has now changed to
> "bar"
>

Yes. A typical scenario is that there is a public dataset, and
different people have made incremental improvements. This could be, for
instance, removal of constant columns, fixing typos, formatting dates,
remove data from a broken sensor,... It would be interesting if users could
see how two datasets differ.
Another scenario is a reviewing process where the author of a dataset wants
to review changes made by a contributor before accepting them.


> Or are you looking to apply an update to a dataset? i.e. I recently
> trained and stored embeddings and now I need to update them but I don't
> want to override the data because I would like to be able to retrieve what
> they were in the last training iteration so I can roll back, run parallel
> tests, etc..
>

Possibly, although updating an embedding will likely change every value in
the dataset. That seems to call for file versioning and meta-data about the
process that generated it.


> Thanks, you may mention me as a contributor to the blog post if you'd like!
>

Done ;).

Thanks again,
Joaquin





> On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren <j....@tue.nl>
> wrote:
>
>> Hi Nick, all,
>>
>> Thanks! I updated the blog post to specify the requirements better.
>>
>> First, we plan to store the datasets in S3 (on min.io). I agree this
>> works
>> nicely with Parquet.
>>
>> Do you know whether there any activity on supporting partial read/writes
>> in
>> arrow or fastparquet? That would change things a lot.
>>
>>
>> > If doing the transform from/to various file formats is a feature you
>> feel
>> > strongly about, I would suggest doing the transforms via out-of-band ETL
>> > jobs where the user can then request the files asynchronously later.
>>
>>
>> That's what we were thinking about, yes. We need a 'core' format to store
>> the data and write ETL jobs for, but secondary formats could be stored in
>> S3 and returned on demand.
>>
>>
>> > To your point of storing images with meta data such as tags. I haven’t
>> > actually tried it but I suppose you could in theory write the images in
>> one
>> > Parquet binary type column and the tags in another.
>> >
>>
>> Even then, there are different numbers of bounding boxes / tags per image.
>> Can you store different-length vectors in Parquet?
>>
>>
>> > Versioning is difficult and I believe there are many attempts at this
>> right
>> > now. DeltaLake for example has the ability to query at dataset at a
>> point
>> > in time. They basically have Parquet files with some extra json files on
>> > the side describing the changes.
>>
>>
>> I've looked at DeltaLake, but as far as I understand, its commit log
>> depends on spark operations done on the dataframe? Hence, any change to
>> the
>> dataset has to be performed via spark? Is that correct?
>>
>>
>> > Straight up versions of file could be achieved with your underlying file
>> > system. S3 has file versioning.
>> >
>>
>> Do you know of any tools to compute diffs between Parquet file? What I
>> could find was basically: export both files to CSV and run git diff.
>> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
>> operations done directly on the file?
>>
>> Thanks!
>> Joaquin
>>
>> PS. Nick, would you like to be mentioned as a contributor in the blog
>> post?
>> Your comments helped a lot to improve it ;).
>>
>>
>>
>>
>> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j....@tue.nl>
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > Sorry for restarting an old thread, but we've had a _lot_ of
>> discussions
>> > > over the past 9 months or so on how to store machine learning datasets
>> > > internally. We've written a blog post about it and would love to hear
>> > your
>> > > thoughts:
>> > >
>> > >
>> >
>> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
>> > >
>> > > To be clear: what we need is a data format for archival storage on the
>> > > server, and preferably one that supports versioning/diff, multi-table
>> > > storage, and sparse data.
>> > > Hence, this is for *internal* storage. When OpenML users want to
>> > download a
>> > > dataset in parquet or arrow we can always convert it on the fly (or
>> from
>> > a
>> > > cache). We already use Arrow/Feather to cache the datasets after it is
>> > > downloaded (when possible).
>> > >
>> > > One specific concern about parquet is that we are not entirely sure
>> > > whether a parquet file created by one parser (e.g. in R) can always be
>> > read
>> > > by another parser (e.g. in Python). We saw some github issues related
>> to
>> > > this but we don't know whether this is still an issue. Do you know?
>> Also,
>> > > it seems that none of the current python parsers support partial
>> > > read/writes, is that correct?
>> > >
>> > > Because of these issues, we are still considering a text-based format
>> > (e.g.
>> > > CSV) for our main dataset storage, mainly because of its broad native
>> > > support in all languages and easy versioning/diffs (we could use
>> > git-lfs),
>> > > and use parquet/arrow for later usage where possible. We're still
>> > doubting
>> > > between CSV and Parquet, though.
>> > >
>> > > Do you have any thoughts or comments?
>> > >
>> > > Thanks!
>> > > Joaquin
>> > >
>> > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > > > hi Joaquin -- there would be no practical difference, primarily it
>> > > > would be for the preservation of APIs in Python and R related to the
>> > > > Feather format. Internally "read_feather" will invoke the same code
>> > > > paths as the Arrow protocol file reader
>> > > >
>> > > > - Wes
>> > > >
>> > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
>> > > > <jo...@gmail.com> wrote:
>> > > > >
>> > > > > Thank you all for your very detailed answers! I also read in other
>> > > > threads
>> > > > > that the 1.0.0 release might be coming somewhere this fall? I'm
>> > really
>> > > > > looking forward to that.
>> > > > > @Wes: will there be any practical difference between Feather and
>> > Arrow
>> > > > > after the 1.0.0 release? It is just an alias? What would be the
>> > > benefits
>> > > > of
>> > > > > using Feather rather than Arrow at that point?
>> > > > >
>> > > > > Thanks!
>> > > > > Joaquin
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch>
>> wrote:
>> > > > >
>> > > > > > hi there,
>> > > > > >
>> > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
>> > > emkornfield@gmail.com
>> > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
>> > > > > > >
>> > > > > > > I don't know the status of feather.   The arrow file format
>> > should
>> > > be
>> > > > > > > readable by Java and C++ (I believe all the languages that
>> bind
>> > C++
>> > > > also
>> > > > > > > support the format, these include python, ruby and R) .  A
>> quick
>> > > code
>> > > > > > > search of the repo makes me think that there is also support
>> for
>> > > C#,
>> > > > Rust
>> > > > > > > and Javascript. It doesn't look like the file format isn't
>> > > supported
>> > > > in
>> > > > > > Go
>> > > > > > > yet but it probably wouldn't be too hard to do.
>> > > > > > >
>> > > > > > Go doesn't handle Feather files.
>> > > > > > But there is support (not yet feature complete, see [1]) for
>> Arrow
>> > > > files
>> > > > > > (r/w):
>> > > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
>> > > > > >
>> > > > > > hth,
>> > > > > > -s
>> > > > > >
>> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
>> > > > > >
>> > > >
>> > >
>> >
>>
>

Re: Arrow as a common open standard for machine learning data

Posted by Nicholas Poorman <ni...@gmail.com>.
Joaquin,

> Do you know whether there any activity on supporting partial read/writes
in
arrow or fastparquet?

I’m not entirely sure about the status of partial read/writes in Arrow’s
Parquet implementation but
https://github.com/xitongsys/parquet-go for example has this capability.

> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?

You should be able to store different length vectors in Parquet. Think of
strings simply as an array of bytes, and those are variable length. You
would want to make sure you don’t use DICTIONARY_ENCODING in that case.

> I've looked at DeltaLake, but as far as I understand, its commit log
depends on spark operations done on the dataframe? Hence, any change to the
dataset has to be performed via spark? Is that correct?

Until someone replicates the functionality outside of Spark, yes that is
the drawback and why I have been hesitant to adopt DeltaLake.

> Do you know of any tools to compute diffs between Parquet file? What I
could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
operations done directly on the file?

No, I'm not aware of any tools that do diffs between Parquet files. I'm not
sure how you could perform a byte for byte diff without reading one into
memory and decoding it. My question here would be who is trying to consume
the diff you want to generate? Is the diff something you want to display to
a user? i.e. column A, row 132 was "foo" but has now changed to "bar" Or
are you looking to apply an update to a dataset? i.e. I recently trained
and stored embeddings and now I need to update them but I don't want to
override the data because I would like to be able to retrieve what they
were in the last training iteration so I can roll back, run parallel tests,
etc..

I believe DeltaLake has a commit log. However, it probably doesn't provide
a diff. The commit log does give them the ability to ask "What did the data
look like at this point in time?".

Thanks, you may mention me as a contributor to the blog post if you'd like!

Best,
Nick Poorman



On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren <j....@tue.nl>
wrote:

> Hi Nick, all,
>
> Thanks! I updated the blog post to specify the requirements better.
>
> First, we plan to store the datasets in S3 (on min.io). I agree this works
> nicely with Parquet.
>
> Do you know whether there any activity on supporting partial read/writes in
> arrow or fastparquet? That would change things a lot.
>
>
> > If doing the transform from/to various file formats is a feature you feel
> > strongly about, I would suggest doing the transforms via out-of-band ETL
> > jobs where the user can then request the files asynchronously later.
>
>
> That's what we were thinking about, yes. We need a 'core' format to store
> the data and write ETL jobs for, but secondary formats could be stored in
> S3 and returned on demand.
>
>
> > To your point of storing images with meta data such as tags. I haven’t
> > actually tried it but I suppose you could in theory write the images in
> one
> > Parquet binary type column and the tags in another.
> >
>
> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?
>
>
> > Versioning is difficult and I believe there are many attempts at this
> right
> > now. DeltaLake for example has the ability to query at dataset at a point
> > in time. They basically have Parquet files with some extra json files on
> > the side describing the changes.
>
>
> I've looked at DeltaLake, but as far as I understand, its commit log
> depends on spark operations done on the dataframe? Hence, any change to the
> dataset has to be performed via spark? Is that correct?
>
>
> > Straight up versions of file could be achieved with your underlying file
> > system. S3 has file versioning.
> >
>
> Do you know of any tools to compute diffs between Parquet file? What I
> could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
> operations done directly on the file?
>
> Thanks!
> Joaquin
>
> PS. Nick, would you like to be mentioned as a contributor in the blog post?
> Your comments helped a lot to improve it ;).
>
>
>
>
> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j....@tue.nl>
> > wrote:
> >
> > > Hi all,
> > >
> > > Sorry for restarting an old thread, but we've had a _lot_ of
> discussions
> > > over the past 9 months or so on how to store machine learning datasets
> > > internally. We've written a blog post about it and would love to hear
> > your
> > > thoughts:
> > >
> > >
> >
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> > >
> > > To be clear: what we need is a data format for archival storage on the
> > > server, and preferably one that supports versioning/diff, multi-table
> > > storage, and sparse data.
> > > Hence, this is for *internal* storage. When OpenML users want to
> > download a
> > > dataset in parquet or arrow we can always convert it on the fly (or
> from
> > a
> > > cache). We already use Arrow/Feather to cache the datasets after it is
> > > downloaded (when possible).
> > >
> > > One specific concern about parquet is that we are not entirely sure
> > > whether a parquet file created by one parser (e.g. in R) can always be
> > read
> > > by another parser (e.g. in Python). We saw some github issues related
> to
> > > this but we don't know whether this is still an issue. Do you know?
> Also,
> > > it seems that none of the current python parsers support partial
> > > read/writes, is that correct?
> > >
> > > Because of these issues, we are still considering a text-based format
> > (e.g.
> > > CSV) for our main dataset storage, mainly because of its broad native
> > > support in all languages and easy versioning/diffs (we could use
> > git-lfs),
> > > and use parquet/arrow for later usage where possible. We're still
> > doubting
> > > between CSV and Parquet, though.
> > >
> > > Do you have any thoughts or comments?
> > >
> > > Thanks!
> > > Joaquin
> > >
> > > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Joaquin -- there would be no practical difference, primarily it
> > > > would be for the preservation of APIs in Python and R related to the
> > > > Feather format. Internally "read_feather" will invoke the same code
> > > > paths as the Arrow protocol file reader
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > > > <jo...@gmail.com> wrote:
> > > > >
> > > > > Thank you all for your very detailed answers! I also read in other
> > > > threads
> > > > > that the 1.0.0 release might be coming somewhere this fall? I'm
> > really
> > > > > looking forward to that.
> > > > > @Wes: will there be any practical difference between Feather and
> > Arrow
> > > > > after the 1.0.0 release? It is just an alias? What would be the
> > > benefits
> > > > of
> > > > > using Feather rather than Arrow at that point?
> > > > >
> > > > > Thanks!
> > > > > Joaquin
> > > > >
> > > > >
> > > > >
> > > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch>
> wrote:
> > > > >
> > > > > > hi there,
> > > > > >
> > > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > > emkornfield@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > > > >
> > > > > > > I don't know the status of feather.   The arrow file format
> > should
> > > be
> > > > > > > readable by Java and C++ (I believe all the languages that bind
> > C++
> > > > also
> > > > > > > support the format, these include python, ruby and R) .  A
> quick
> > > code
> > > > > > > search of the repo makes me think that there is also support
> for
> > > C#,
> > > > Rust
> > > > > > > and Javascript. It doesn't look like the file format isn't
> > > supported
> > > > in
> > > > > > Go
> > > > > > > yet but it probably wouldn't be too hard to do.
> > > > > > >
> > > > > > Go doesn't handle Feather files.
> > > > > > But there is support (not yet feature complete, see [1]) for
> Arrow
> > > > files
> > > > > > (r/w):
> > > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > > > >
> > > > > > hth,
> > > > > > -s
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > > > >
> > > >
> > >
> >
>

Re: Arrow as a common open standard for machine learning data

Posted by Joaquin Vanschoren <j....@tue.nl>.
Hi Nick, all,

Thanks! I updated the blog post to specify the requirements better.

First, we plan to store the datasets in S3 (on min.io). I agree this works
nicely with Parquet.

Do you know whether there any activity on supporting partial read/writes in
arrow or fastparquet? That would change things a lot.


> If doing the transform from/to various file formats is a feature you feel
> strongly about, I would suggest doing the transforms via out-of-band ETL
> jobs where the user can then request the files asynchronously later.


That's what we were thinking about, yes. We need a 'core' format to store
the data and write ETL jobs for, but secondary formats could be stored in
S3 and returned on demand.


> To your point of storing images with meta data such as tags. I haven’t
> actually tried it but I suppose you could in theory write the images in one
> Parquet binary type column and the tags in another.
>

Even then, there are different numbers of bounding boxes / tags per image.
Can you store different-length vectors in Parquet?


> Versioning is difficult and I believe there are many attempts at this right
> now. DeltaLake for example has the ability to query at dataset at a point
> in time. They basically have Parquet files with some extra json files on
> the side describing the changes.


I've looked at DeltaLake, but as far as I understand, its commit log
depends on spark operations done on the dataframe? Hence, any change to the
dataset has to be performed via spark? Is that correct?


> Straight up versions of file could be achieved with your underlying file
> system. S3 has file versioning.
>

Do you know of any tools to compute diffs between Parquet file? What I
could find was basically: export both files to CSV and run git diff.
DeltaLake would help here, but again, is seems that it only 'tracks' Spark
operations done directly on the file?

Thanks!
Joaquin

PS. Nick, would you like to be mentioned as a contributor in the blog post?
Your comments helped a lot to improve it ;).




On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j....@tue.nl>
> wrote:
>
> > Hi all,
> >
> > Sorry for restarting an old thread, but we've had a _lot_ of discussions
> > over the past 9 months or so on how to store machine learning datasets
> > internally. We've written a blog post about it and would love to hear
> your
> > thoughts:
> >
> >
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> >
> > To be clear: what we need is a data format for archival storage on the
> > server, and preferably one that supports versioning/diff, multi-table
> > storage, and sparse data.
> > Hence, this is for *internal* storage. When OpenML users want to
> download a
> > dataset in parquet or arrow we can always convert it on the fly (or from
> a
> > cache). We already use Arrow/Feather to cache the datasets after it is
> > downloaded (when possible).
> >
> > One specific concern about parquet is that we are not entirely sure
> > whether a parquet file created by one parser (e.g. in R) can always be
> read
> > by another parser (e.g. in Python). We saw some github issues related to
> > this but we don't know whether this is still an issue. Do you know? Also,
> > it seems that none of the current python parsers support partial
> > read/writes, is that correct?
> >
> > Because of these issues, we are still considering a text-based format
> (e.g.
> > CSV) for our main dataset storage, mainly because of its broad native
> > support in all languages and easy versioning/diffs (we could use
> git-lfs),
> > and use parquet/arrow for later usage where possible. We're still
> doubting
> > between CSV and Parquet, though.
> >
> > Do you have any thoughts or comments?
> >
> > Thanks!
> > Joaquin
> >
> > On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Joaquin -- there would be no practical difference, primarily it
> > > would be for the preservation of APIs in Python and R related to the
> > > Feather format. Internally "read_feather" will invoke the same code
> > > paths as the Arrow protocol file reader
> > >
> > > - Wes
> > >
> > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > > <jo...@gmail.com> wrote:
> > > >
> > > > Thank you all for your very detailed answers! I also read in other
> > > threads
> > > > that the 1.0.0 release might be coming somewhere this fall? I'm
> really
> > > > looking forward to that.
> > > > @Wes: will there be any practical difference between Feather and
> Arrow
> > > > after the 1.0.0 release? It is just an alias? What would be the
> > benefits
> > > of
> > > > using Feather rather than Arrow at that point?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > > >
> > > >
> > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> > > >
> > > > > hi there,
> > > > >
> > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > emkornfield@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > > >
> > > > > > I don't know the status of feather.   The arrow file format
> should
> > be
> > > > > > readable by Java and C++ (I believe all the languages that bind
> C++
> > > also
> > > > > > support the format, these include python, ruby and R) .  A quick
> > code
> > > > > > search of the repo makes me think that there is also support for
> > C#,
> > > Rust
> > > > > > and Javascript. It doesn't look like the file format isn't
> > supported
> > > in
> > > > > Go
> > > > > > yet but it probably wouldn't be too hard to do.
> > > > > >
> > > > > Go doesn't handle Feather files.
> > > > > But there is support (not yet feature complete, see [1]) for Arrow
> > > files
> > > > > (r/w):
> > > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > > >
> > > > > hth,
> > > > > -s
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > > >
> > >
> >
>

Re: Arrow as a common open standard for machine learning data

Posted by Nicholas Poorman <ni...@gmail.com>.
Joaquin,

After reading your proposal I think there may be some things you may want
to consider.

It sounds like you are trying to come up with a one size fits all solution
but it may be better to define your requirements based on your needs and
environment.

For starters, where do you plan to store these files? Do you plan on
putting them in a cloud object storage like S3 or do you plan on having
disk volumes attached to servers you are managing? A format like Parquet is
going to be useful for object storage such as S3 because you bundle up
everything in memory and then write it out at once. Any append-able format
is going to require disk volumes where you have append functionality. There
are a few active project that build commit logs and tombstones on top of
Parquet to add this functionality. For example Hudi and Databricks
DeltaLake. Also, if you plan on doing anything at scale you might run into
issues with lock contention if you choose something backed by b-trees such
as SQLite.

There’s two ways you could handle the issues with Parquet implementations
that are currently unable to read partial files. One, you could contribute
back to the Parquet implementation so that it is capable of doing so. Or
two, you could partition your Parquet files and write them in smaller
chunks so they could be selectively read. I’m currently in the process of
implementing the Parquet implementation in Go so partial read functionality
is something I will consider.

If you plan on having a service that can read in one format and return
another format to a user, you are either going to need a format capable of
stream decoding/encoding or a whole lot of memory on the instances running
the service. Something like csv would allow for stream decoding/encoding.
Compression algorithms are for the most part going to be streaming. A
b-tree is going to be streaming as you can do something like a
breadth-first iteration over it. Parquet on the other hand is either going
to require you to write files in small partitions (this is generally bad
and Spark users refer to this as the “small files problem”), or you will
need to utilize an implementation that supports partial reads. There are
Parquet implementations in Java and Go that support partial reads. The
issue you will face is doing the steaming writes back to the user. If for
example a user wanted their data returned as Parquet you would have to do
the transformation in memory all at once and then stream it to the user. If
doing the transform from/to various file formats is a feature you feel
strongly about, I would suggest doing the transforms via out-of-band ETL
jobs where the user can then request the files asynchronously later. Doing
the transform in-band of the request / response lifecycle doesn’t seem
scalable given the constraints of some file formats and instance memory.

To your point of storing images with meta data such as tags. I haven’t
actually tried it but I suppose you could in theory write the images in one
Parquet binary type column and the tags in another.

Versioning is difficult and I believe there are many attempts at this right
now. DeltaLake for example has the ability to query at dataset at a point
in time. They basically have Parquet files with some extra json files on
the side describing the changes. You first read the json files to
understand the changes and then read the Parquet files they reference.
Straight up versions of file could be achieved with your underlying file
system. S3 has file versioning, Docker has its own internal delta changes
file system layer, etc..

I would not recommend storing the files in Feather for long term storage as
your file size and costs are going to explode compared to a column-oriented
format that supports compression.

Best,
Nick


On Tue, Jun 30, 2020 at 6:46 AM Joaquin Vanschoren <j....@tue.nl>
wrote:

> Hi all,
>
> Sorry for restarting an old thread, but we've had a _lot_ of discussions
> over the past 9 months or so on how to store machine learning datasets
> internally. We've written a blog post about it and would love to hear your
> thoughts:
>
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
>
> To be clear: what we need is a data format for archival storage on the
> server, and preferably one that supports versioning/diff, multi-table
> storage, and sparse data.
> Hence, this is for *internal* storage. When OpenML users want to download a
> dataset in parquet or arrow we can always convert it on the fly (or from a
> cache). We already use Arrow/Feather to cache the datasets after it is
> downloaded (when possible).
>
> One specific concern about parquet is that we are not entirely sure
> whether a parquet file created by one parser (e.g. in R) can always be read
> by another parser (e.g. in Python). We saw some github issues related to
> this but we don't know whether this is still an issue. Do you know? Also,
> it seems that none of the current python parsers support partial
> read/writes, is that correct?
>
> Because of these issues, we are still considering a text-based format (e.g.
> CSV) for our main dataset storage, mainly because of its broad native
> support in all languages and easy versioning/diffs (we could use git-lfs),
> and use parquet/arrow for later usage where possible. We're still doubting
> between CSV and Parquet, though.
>
> Do you have any thoughts or comments?
>
> Thanks!
> Joaquin
>
> On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com> wrote:
>
> > hi Joaquin -- there would be no practical difference, primarily it
> > would be for the preservation of APIs in Python and R related to the
> > Feather format. Internally "read_feather" will invoke the same code
> > paths as the Arrow protocol file reader
> >
> > - Wes
> >
> > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > <jo...@gmail.com> wrote:
> > >
> > > Thank you all for your very detailed answers! I also read in other
> > threads
> > > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > > looking forward to that.
> > > @Wes: will there be any practical difference between Feather and Arrow
> > > after the 1.0.0 release? It is just an alias? What would be the
> benefits
> > of
> > > using Feather rather than Arrow at that point?
> > >
> > > Thanks!
> > > Joaquin
> > >
> > >
> > >
> > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> > >
> > > > hi there,
> > > >
> > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> emkornfield@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > > >
> > > > > I don't know the status of feather.   The arrow file format should
> be
> > > > > readable by Java and C++ (I believe all the languages that bind C++
> > also
> > > > > support the format, these include python, ruby and R) .  A quick
> code
> > > > > search of the repo makes me think that there is also support for
> C#,
> > Rust
> > > > > and Javascript. It doesn't look like the file format isn't
> supported
> > in
> > > > Go
> > > > > yet but it probably wouldn't be too hard to do.
> > > > >
> > > > Go doesn't handle Feather files.
> > > > But there is support (not yet feature complete, see [1]) for Arrow
> > files
> > > > (r/w):
> > > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > > >
> > > > hth,
> > > > -s
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > > >
> >
>

Re: Arrow as a common open standard for machine learning data

Posted by Joaquin Vanschoren <j....@tue.nl>.
Hi all,

Sorry for restarting an old thread, but we've had a _lot_ of discussions
over the past 9 months or so on how to store machine learning datasets
internally. We've written a blog post about it and would love to hear your
thoughts:
https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html

To be clear: what we need is a data format for archival storage on the
server, and preferably one that supports versioning/diff, multi-table
storage, and sparse data.
Hence, this is for *internal* storage. When OpenML users want to download a
dataset in parquet or arrow we can always convert it on the fly (or from a
cache). We already use Arrow/Feather to cache the datasets after it is
downloaded (when possible).

One specific concern about parquet is that we are not entirely sure
whether a parquet file created by one parser (e.g. in R) can always be read
by another parser (e.g. in Python). We saw some github issues related to
this but we don't know whether this is still an issue. Do you know? Also,
it seems that none of the current python parsers support partial
read/writes, is that correct?

Because of these issues, we are still considering a text-based format (e.g.
CSV) for our main dataset storage, mainly because of its broad native
support in all languages and easy versioning/diffs (we could use git-lfs),
and use parquet/arrow for later usage where possible. We're still doubting
between CSV and Parquet, though.

Do you have any thoughts or comments?

Thanks!
Joaquin

On Thu, 20 Jun 2019 at 23:47, Wes McKinney <we...@gmail.com> wrote:

> hi Joaquin -- there would be no practical difference, primarily it
> would be for the preservation of APIs in Python and R related to the
> Feather format. Internally "read_feather" will invoke the same code
> paths as the Arrow protocol file reader
>
> - Wes
>
> On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> <jo...@gmail.com> wrote:
> >
> > Thank you all for your very detailed answers! I also read in other
> threads
> > that the 1.0.0 release might be coming somewhere this fall? I'm really
> > looking forward to that.
> > @Wes: will there be any practical difference between Feather and Arrow
> > after the 1.0.0 release? It is just an alias? What would be the benefits
> of
> > using Feather rather than Arrow at that point?
> >
> > Thanks!
> > Joaquin
> >
> >
> >
> > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
> >
> > > hi there,
> > >
> > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > >
> > > > I don't know the status of feather.   The arrow file format should be
> > > > readable by Java and C++ (I believe all the languages that bind C++
> also
> > > > support the format, these include python, ruby and R) .  A quick code
> > > > search of the repo makes me think that there is also support for C#,
> Rust
> > > > and Javascript. It doesn't look like the file format isn't supported
> in
> > > Go
> > > > yet but it probably wouldn't be too hard to do.
> > > >
> > > Go doesn't handle Feather files.
> > > But there is support (not yet feature complete, see [1]) for Arrow
> files
> > > (r/w):
> > > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> > >
> > > hth,
> > > -s
> > >
> > > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> > >
>

Re: Arrow as a common open standard for machine learning data

Posted by Wes McKinney <we...@gmail.com>.
hi Joaquin -- there would be no practical difference, primarily it
would be for the preservation of APIs in Python and R related to the
Feather format. Internally "read_feather" will invoke the same code
paths as the Arrow protocol file reader

- Wes

On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
<jo...@gmail.com> wrote:
>
> Thank you all for your very detailed answers! I also read in other threads
> that the 1.0.0 release might be coming somewhere this fall? I'm really
> looking forward to that.
> @Wes: will there be any practical difference between Feather and Arrow
> after the 1.0.0 release? It is just an alias? What would be the benefits of
> using Feather rather than Arrow at that point?
>
> Thanks!
> Joaquin
>
>
>
> On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:
>
> > hi there,
> >
> > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > > *  Can Feather files already be read in Java/Go/C#/...?
> > >
> > > I don't know the status of feather.   The arrow file format should be
> > > readable by Java and C++ (I believe all the languages that bind C++ also
> > > support the format, these include python, ruby and R) .  A quick code
> > > search of the repo makes me think that there is also support for C#, Rust
> > > and Javascript. It doesn't look like the file format isn't supported in
> > Go
> > > yet but it probably wouldn't be too hard to do.
> > >
> > Go doesn't handle Feather files.
> > But there is support (not yet feature complete, see [1]) for Arrow files
> > (r/w):
> > -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
> >
> > hth,
> > -s
> >
> > [1]: https://issues.apache.org/jira/browse/ARROW-3679
> >

Re: Arrow as a common open standard for machine learning data

Posted by Joaquin Vanschoren <jo...@gmail.com>.
Thank you all for your very detailed answers! I also read in other threads
that the 1.0.0 release might be coming somewhere this fall? I'm really
looking forward to that.
@Wes: will there be any practical difference between Feather and Arrow
after the 1.0.0 release? It is just an alias? What would be the benefits of
using Feather rather than Arrow at that point?

Thanks!
Joaquin



On Sun, 16 Jun 2019 at 18:25, Sebastien Binet <bi...@cern.ch> wrote:

> hi there,
>
> On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > > *  Can Feather files already be read in Java/Go/C#/...?
> >
> > I don't know the status of feather.   The arrow file format should be
> > readable by Java and C++ (I believe all the languages that bind C++ also
> > support the format, these include python, ruby and R) .  A quick code
> > search of the repo makes me think that there is also support for C#, Rust
> > and Javascript. It doesn't look like the file format isn't supported in
> Go
> > yet but it probably wouldn't be too hard to do.
> >
> Go doesn't handle Feather files.
> But there is support (not yet feature complete, see [1]) for Arrow files
> (r/w):
> -  https://godoc.org/github.com/apache/arrow/go/arrow/ipc
>
> hth,
> -s
>
> [1]: https://issues.apache.org/jira/browse/ARROW-3679
>

Re: Arrow as a common open standard for machine learning data

Posted by Sebastien Binet <bi...@cern.ch>.
hi there,

On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <em...@gmail.com>
wrote:

> > *  Can Feather files already be read in Java/Go/C#/...?
>
> I don't know the status of feather.   The arrow file format should be
> readable by Java and C++ (I believe all the languages that bind C++ also
> support the format, these include python, ruby and R) .  A quick code
> search of the repo makes me think that there is also support for C#, Rust
> and Javascript. It doesn't look like the file format isn't supported in Go
> yet but it probably wouldn't be too hard to do.
>
Go doesn't handle Feather files.
But there is support (not yet feature complete, see [1]) for Arrow files
(r/w):
-  https://godoc.org/github.com/apache/arrow/go/arrow/ipc

hth,
-s

[1]: https://issues.apache.org/jira/browse/ARROW-3679

Re: Arrow as a common open standard for machine learning data

Posted by Wes McKinney <we...@gmail.com>.
hi Micah and Joaquin,

With regards to the Feather format, I have been waiting a _long_ time
for the R community to "catch up" with Apache Arrow development and
get a release of an Arrow R project out that can be installed by most
R users. We are finally approaching that point, and so Feather
development has been in a holding pattern for more than 3 years as a
result of this.

Since Feather is popular in practice, my idea has been to preserve the
file format name and have it be a simple container around the Arrow
IPC file format. So Feather would become a stable binary format once
we release a 1.0.0 protocol version. If the goal is to have a stable
memory-mappable binary format, then at that point Feather is something
I'd recommend.  If the Arrow protocol acquires compression, then
Feather files will get compression. My plan is to conduct this Feather
evolution after the 0.14.0 release

I would recommend using Parquet files in general without hesitation,
though they trade deserialization cost for compactness. Arrow and
Parquet are designed to be used together.

- Wes

On Sat, Jun 15, 2019 at 11:07 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Joaquin,
> Answers inline:
>
> Thanks, that explains the arrow-parquet relationship very nicely.
> > So, at the moment you would recommend Parquet for any form of archival
> > storage, right?
>
> Yes Parquet should be used as an archival format.
>
> * Is Feather a good choice for long-term storage (is the binary format
> > stable)?
>
> It is worth mentioning that the Arrow file format and Feather format are
> not the same thing.  My understanding is feather is not being actively
> developed and the idea is it will be deprecated once there is wider support
> for the Arrow file format.
>
> * What meta-data is stored? Are the column names and data types always
> > stored? For categorical columns, can I store the named categories/levels?
> > Is there a way to append additional meta-data, or is it best to store that
> > in a separate file (e.g. json)?
>
> The Arrow file format (
> https://arrow.apache.org/docs/format/IPC.html#file-format) always has a
> schema as the first message which denotes column names and data types.
> Categorical columns can be supported via dictionary encoding.  Custom
> metadata is support at the Schema (
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L323), Column
> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291) and
> batch level (
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L98).  There
> was also a recent proposal to add it to the Footer of the file as well.
>
>
> > * What is the status of support for sparse data? Can I store large sparse
> > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
> > Feather/Parquet?
>
> I don't know the answer to this for Feather or parquet.  Currently Arrow
> doesn't support Sparse Data other then the sparse tensor your mentioned
> (there has been some discussion on adding more support for it it the past
> but not enough developer bandwidth to follow-up on it.).
>
> * What is the status of compression for Feather? We have some datasets that
> > are quite large (several GB), and we'd like to onboard datasets like
> > Imagenet which are 130GB in TFRecord format (but I read that parquet can
> > store it in about 40GB).
>
> I don't know about feather, but compression is not directly supported
> within Arrow format at the moment (it has the same status as sparseness,
> i.e. it has been discussed but nobody has worked on it).  You can always
> compress the entire file externally though.
>
> * Would it make more sense to use both Parquet and Feather, depending on
> > the dataset size or dimensionality? If so, what would be a good
> > trade-off/threshold in that case?
>
> See above, for archival purposes Parquet is still probably preferred.
>
>
> > * Most of our datasets are standard dataframes, but some are also
> > collections of images or texts. I guess we have to 'manually' convert those
> > to dataframes first, right? Or do you know of existing tools to facilitate
> > this?
>
> Without more details I would guess you would need to manually convert these
> to dataframes frist.
>
>
> > *  Can Feather files already be read in Java/Go/C#/...?
>
> I don't know the status of feather.   The arrow file format should be
> readable by Java and C++ (I believe all the languages that bind C++ also
> support the format, these include python, ruby and R) .  A quick code
> search of the repo makes me think that there is also support for C#, Rust
> and Javascript. It doesn't look like the file format isn't supported in Go
> yet but it probably wouldn't be too hard to do.
>
> Thanks,
> Micah
>
> On Wed, Jun 12, 2019 at 12:02 PM Joaquin Vanschoren <
> joaquin.vanschoren@gmail.com> wrote:
>
> > Hi Neal,
> >
> > Thanks, that explains the arrow-parquet relationship very nicely.
> > So, at the moment you would recommend Parquet for any form of archival
> > storage, right?
> > We could also experiment with storing data as both Parquet and Arrow for
> > now.
> >
> > Still curious about the other questions, like meta-data, sparse data,
> > Feather support, etc.
> >
> > Cheers,
> > Joaquin
> >
> >
> >
> >
> > On Wed, 12 Jun 2019 at 20:25, Neal Richardson <neal.p.richardson@gmail.com
> > >
> > wrote:
> >
> > > Hi Joaquin,
> > > I recognize that this doesn't answer all of your questions, but we are in
> > > the process of adding a FAQ to the arrow.apache.org website that speaks
> > to
> > > some of them: https://github.com/apache/arrow/blob/master/site/faq.md
> > >
> > > Neal
> > >
> > > On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
> > > joaquin.vanschoren@gmail.com> wrote:
> > >
> > > > Dear all,
> > > >
> > > > Thanks for creating Arrow! I'm part of OpenML.org, an open source
> > > > initiative/platform for sharing machine learning datasets and models.
> > We
> > > > are currently storing data in either ARFF or Parquet, but are looking
> > > into
> > > > whether e.g. Feather or a mix of Feather and Parquet could be the new
> > > > standard for all(?) our datasets (currently about 20000 of them). We
> > had
> > > a
> > > > few questions though, and would definitely like to hear your opinion.
> > > > Apologies in advance if there were recent announcements about these
> > that
> > > I
> > > > missed.
> > > >
> > > > * Is Feather a good choice for long-term storage (is the binary format
> > > > stable)?
> > > > * What meta-data is stored? Are the column names and data types always
> > > > stored? For categorical columns, can I store the named
> > categories/levels?
> > > > Is there a way to append additional meta-data, or is it best to store
> > > that
> > > > in a separate file (e.g. json)?
> > > > * What is the status of support for sparse data? Can I store large
> > sparse
> > > > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available
> > > in
> > > > Feather/Parquet?
> > > > * What is the status of compression for Feather? We have some datasets
> > > that
> > > > are quite large (several GB), and we'd like to onboard datasets like
> > > > Imagenet which are 130GB in TFRecord format (but I read that parquet
> > can
> > > > store it in about 40GB).
> > > > * Would it make more sense to use both Parquet and Feather, depending
> > on
> > > > the dataset size or dimensionality? If so, what would be a good
> > > > trade-off/threshold in that case?
> > > > * Most of our datasets are standard dataframes, but some are also
> > > > collections of images or texts. I guess we have to 'manually' convert
> > > those
> > > > to dataframes first, right? Or do you know of existing tools to
> > > facilitate
> > > > this?
> > > > *  Can Feather files already be read in Java/Go/C#/...?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > >
> >

Re: Arrow as a common open standard for machine learning data

Posted by Micah Kornfield <em...@gmail.com>.
Hi Joaquin,
Answers inline:

Thanks, that explains the arrow-parquet relationship very nicely.
> So, at the moment you would recommend Parquet for any form of archival
> storage, right?

Yes Parquet should be used as an archival format.

* Is Feather a good choice for long-term storage (is the binary format
> stable)?

It is worth mentioning that the Arrow file format and Feather format are
not the same thing.  My understanding is feather is not being actively
developed and the idea is it will be deprecated once there is wider support
for the Arrow file format.

* What meta-data is stored? Are the column names and data types always
> stored? For categorical columns, can I store the named categories/levels?
> Is there a way to append additional meta-data, or is it best to store that
> in a separate file (e.g. json)?

The Arrow file format (
https://arrow.apache.org/docs/format/IPC.html#file-format) always has a
schema as the first message which denotes column names and data types.
Categorical columns can be supported via dictionary encoding.  Custom
metadata is support at the Schema (
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L323), Column
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291) and
batch level (
https://github.com/apache/arrow/blob/master/format/Message.fbs#L98).  There
was also a recent proposal to add it to the Footer of the file as well.


> * What is the status of support for sparse data? Can I store large sparse
> datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
> Feather/Parquet?

I don't know the answer to this for Feather or parquet.  Currently Arrow
doesn't support Sparse Data other then the sparse tensor your mentioned
(there has been some discussion on adding more support for it it the past
but not enough developer bandwidth to follow-up on it.).

* What is the status of compression for Feather? We have some datasets that
> are quite large (several GB), and we'd like to onboard datasets like
> Imagenet which are 130GB in TFRecord format (but I read that parquet can
> store it in about 40GB).

I don't know about feather, but compression is not directly supported
within Arrow format at the moment (it has the same status as sparseness,
i.e. it has been discussed but nobody has worked on it).  You can always
compress the entire file externally though.

* Would it make more sense to use both Parquet and Feather, depending on
> the dataset size or dimensionality? If so, what would be a good
> trade-off/threshold in that case?

See above, for archival purposes Parquet is still probably preferred.


> * Most of our datasets are standard dataframes, but some are also
> collections of images or texts. I guess we have to 'manually' convert those
> to dataframes first, right? Or do you know of existing tools to facilitate
> this?

Without more details I would guess you would need to manually convert these
to dataframes frist.


> *  Can Feather files already be read in Java/Go/C#/...?

I don't know the status of feather.   The arrow file format should be
readable by Java and C++ (I believe all the languages that bind C++ also
support the format, these include python, ruby and R) .  A quick code
search of the repo makes me think that there is also support for C#, Rust
and Javascript. It doesn't look like the file format isn't supported in Go
yet but it probably wouldn't be too hard to do.

Thanks,
Micah

On Wed, Jun 12, 2019 at 12:02 PM Joaquin Vanschoren <
joaquin.vanschoren@gmail.com> wrote:

> Hi Neal,
>
> Thanks, that explains the arrow-parquet relationship very nicely.
> So, at the moment you would recommend Parquet for any form of archival
> storage, right?
> We could also experiment with storing data as both Parquet and Arrow for
> now.
>
> Still curious about the other questions, like meta-data, sparse data,
> Feather support, etc.
>
> Cheers,
> Joaquin
>
>
>
>
> On Wed, 12 Jun 2019 at 20:25, Neal Richardson <neal.p.richardson@gmail.com
> >
> wrote:
>
> > Hi Joaquin,
> > I recognize that this doesn't answer all of your questions, but we are in
> > the process of adding a FAQ to the arrow.apache.org website that speaks
> to
> > some of them: https://github.com/apache/arrow/blob/master/site/faq.md
> >
> > Neal
> >
> > On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
> > joaquin.vanschoren@gmail.com> wrote:
> >
> > > Dear all,
> > >
> > > Thanks for creating Arrow! I'm part of OpenML.org, an open source
> > > initiative/platform for sharing machine learning datasets and models.
> We
> > > are currently storing data in either ARFF or Parquet, but are looking
> > into
> > > whether e.g. Feather or a mix of Feather and Parquet could be the new
> > > standard for all(?) our datasets (currently about 20000 of them). We
> had
> > a
> > > few questions though, and would definitely like to hear your opinion.
> > > Apologies in advance if there were recent announcements about these
> that
> > I
> > > missed.
> > >
> > > * Is Feather a good choice for long-term storage (is the binary format
> > > stable)?
> > > * What meta-data is stored? Are the column names and data types always
> > > stored? For categorical columns, can I store the named
> categories/levels?
> > > Is there a way to append additional meta-data, or is it best to store
> > that
> > > in a separate file (e.g. json)?
> > > * What is the status of support for sparse data? Can I store large
> sparse
> > > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available
> > in
> > > Feather/Parquet?
> > > * What is the status of compression for Feather? We have some datasets
> > that
> > > are quite large (several GB), and we'd like to onboard datasets like
> > > Imagenet which are 130GB in TFRecord format (but I read that parquet
> can
> > > store it in about 40GB).
> > > * Would it make more sense to use both Parquet and Feather, depending
> on
> > > the dataset size or dimensionality? If so, what would be a good
> > > trade-off/threshold in that case?
> > > * Most of our datasets are standard dataframes, but some are also
> > > collections of images or texts. I guess we have to 'manually' convert
> > those
> > > to dataframes first, right? Or do you know of existing tools to
> > facilitate
> > > this?
> > > *  Can Feather files already be read in Java/Go/C#/...?
> > >
> > > Thanks!
> > > Joaquin
> > >
> >
>

Re: Arrow as a common open standard for machine learning data

Posted by Joaquin Vanschoren <jo...@gmail.com>.
Hi Neal,

Thanks, that explains the arrow-parquet relationship very nicely.
So, at the moment you would recommend Parquet for any form of archival
storage, right?
We could also experiment with storing data as both Parquet and Arrow for
now.

Still curious about the other questions, like meta-data, sparse data,
Feather support, etc.

Cheers,
Joaquin




On Wed, 12 Jun 2019 at 20:25, Neal Richardson <ne...@gmail.com>
wrote:

> Hi Joaquin,
> I recognize that this doesn't answer all of your questions, but we are in
> the process of adding a FAQ to the arrow.apache.org website that speaks to
> some of them: https://github.com/apache/arrow/blob/master/site/faq.md
>
> Neal
>
> On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
> joaquin.vanschoren@gmail.com> wrote:
>
> > Dear all,
> >
> > Thanks for creating Arrow! I'm part of OpenML.org, an open source
> > initiative/platform for sharing machine learning datasets and models. We
> > are currently storing data in either ARFF or Parquet, but are looking
> into
> > whether e.g. Feather or a mix of Feather and Parquet could be the new
> > standard for all(?) our datasets (currently about 20000 of them). We had
> a
> > few questions though, and would definitely like to hear your opinion.
> > Apologies in advance if there were recent announcements about these that
> I
> > missed.
> >
> > * Is Feather a good choice for long-term storage (is the binary format
> > stable)?
> > * What meta-data is stored? Are the column names and data types always
> > stored? For categorical columns, can I store the named categories/levels?
> > Is there a way to append additional meta-data, or is it best to store
> that
> > in a separate file (e.g. json)?
> > * What is the status of support for sparse data? Can I store large sparse
> > datasets efficiently? I noticed sparse_tensor in Arrow. Is it available
> in
> > Feather/Parquet?
> > * What is the status of compression for Feather? We have some datasets
> that
> > are quite large (several GB), and we'd like to onboard datasets like
> > Imagenet which are 130GB in TFRecord format (but I read that parquet can
> > store it in about 40GB).
> > * Would it make more sense to use both Parquet and Feather, depending on
> > the dataset size or dimensionality? If so, what would be a good
> > trade-off/threshold in that case?
> > * Most of our datasets are standard dataframes, but some are also
> > collections of images or texts. I guess we have to 'manually' convert
> those
> > to dataframes first, right? Or do you know of existing tools to
> facilitate
> > this?
> > *  Can Feather files already be read in Java/Go/C#/...?
> >
> > Thanks!
> > Joaquin
> >
>

Re: Arrow as a common open standard for machine learning data

Posted by Neal Richardson <ne...@gmail.com>.
Hi Joaquin,
I recognize that this doesn't answer all of your questions, but we are in
the process of adding a FAQ to the arrow.apache.org website that speaks to
some of them: https://github.com/apache/arrow/blob/master/site/faq.md

Neal

On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren <
joaquin.vanschoren@gmail.com> wrote:

> Dear all,
>
> Thanks for creating Arrow! I'm part of OpenML.org, an open source
> initiative/platform for sharing machine learning datasets and models. We
> are currently storing data in either ARFF or Parquet, but are looking into
> whether e.g. Feather or a mix of Feather and Parquet could be the new
> standard for all(?) our datasets (currently about 20000 of them). We had a
> few questions though, and would definitely like to hear your opinion.
> Apologies in advance if there were recent announcements about these that I
> missed.
>
> * Is Feather a good choice for long-term storage (is the binary format
> stable)?
> * What meta-data is stored? Are the column names and data types always
> stored? For categorical columns, can I store the named categories/levels?
> Is there a way to append additional meta-data, or is it best to store that
> in a separate file (e.g. json)?
> * What is the status of support for sparse data? Can I store large sparse
> datasets efficiently? I noticed sparse_tensor in Arrow. Is it available in
> Feather/Parquet?
> * What is the status of compression for Feather? We have some datasets that
> are quite large (several GB), and we'd like to onboard datasets like
> Imagenet which are 130GB in TFRecord format (but I read that parquet can
> store it in about 40GB).
> * Would it make more sense to use both Parquet and Feather, depending on
> the dataset size or dimensionality? If so, what would be a good
> trade-off/threshold in that case?
> * Most of our datasets are standard dataframes, but some are also
> collections of images or texts. I guess we have to 'manually' convert those
> to dataframes first, right? Or do you know of existing tools to facilitate
> this?
> *  Can Feather files already be read in Java/Go/C#/...?
>
> Thanks!
> Joaquin
>