You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ivo Jimenez <iv...@gmail.com> on 2020/08/27 05:02:07 UTC

Arrow Dataset API on Ceph

Dear Arrow community,

We are writing to share our thoughts about designing an Apache Arrow-native
storage system leveraging Ceph’s extensibility mechanism as part of the
SkyhookDM <http://skyhookdm.com> project and aim for a design that
leverages Arrow as much as possible, both on the client API and within the
storage system (as embedded library). We would really appreciate your
feedback and guidance.

We are planning to write a Dataset API implementation on top of the Ceph
RADOS API (see high-level design here
<https://docs.google.com/document/d/1bd7FW2TNaIy-nE5OfI1rUYEcUM5F9h3t4excfpuJr3g>).
In case you are not familiar with RADOS, librados is a low-level object
storage interface that provides primitives to store single, non-striped
objects in Ceph. It is the underlying layer for Ceph’s S3 implementation
(RGW), as well as the block device (RBD) and POSIX file system interface
(CephFS).

One salient feature of librados is its “CLS” plugin mechanism, allowing
developers to extend the base functionality of objects, in the true
object-oriented programming sense of the word “object”. A base RADOS class
can be extended (see here
<https://makedist.com/files/cephalocon19-objclass-dev.pdf> for more
details) so that object instantiations are augmented with storage-side
compute capabilities.

Our plan is to implement Arrow-native functionality on librados by creating
a derived class, arrow::dataset::RadosFormat, from the
arrow::dataset::FileFormat base class (see high-level diagram on the doc
linked previously). Similarly to how CSV, Parquet and IPC formats are
currently implemented, the new implementation (arrow/dataset/file_rados.*
files) will consist of ScanTask, ScanTaskIterator and FileFormat
implementations that will send/receive operations to Ceph storage nodes
using librados. On the storage side, the cls_arrow extension will embed the
C++ Arrow library in order to provide all the supported features of the
Dataset API (filtering, projecting, etc.), using arrow::ipc to operate on
record batches.

Our main concern is that this new arrow::dataset::RadosFormat class will be
deriving from the arrow::dataset::FileFormat class, which seems to raise a
conceptual mismatch as there isn’t really a RADOS format but rather a
formatting/serialization deferral that will be taking place, effectively
introducing a new client-server layer in the Dataset API.

We realize that this might be forcing the Dataset API to accommodate
something it wasn’t originally designed to do. An alternative would be to
extend  arrow::fs; but, to the best of our knowledge, expressions, filters,
etc. are not visible at this level, as this is only dealing with
bytestreams. We looked at arrow::flight as well but this also seems
inappropriate given that Ceph already provides the messaging capabilities
for communicating with its own storage backend.

We look forward to your comments

Many thanks!

Re: Arrow Dataset API on Ceph

Posted by Ivo Jimenez <iv...@ucsc.edu.INVALID>.

Thanks a lot for your replies Micah and Antoine, really appreciate it!

On 2020/09/15 18:06:56, Micah Kornfield <em...@gmail.com> wrote: 
> gmock is already a dependency.  We haven't upgraded gmock/gtest in a while,
> we might want to consider doing that (but this is orthogonal).
> 
> On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <an...@python.org> wrote:
> 
> >
> > Hi Ivo,
> >
> > You can open a JIRA once you've got a PR ready.  No need to do it before
> > you think you're ready for submission.
> >
> > AFAIK, gmock is already a dependency.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :
> > > Hi again,
> > >
> > > We noticed in the contribution guidelines that there needs to be an
> > issue for every PR in JIRA. Should we open one for the eventual PR for the
> > work we're doing on implementing the dataset on Ceph's RADOS?
> > >
> > > Also, on a related note, we would like to mock the RADOS client so that
> > we can integrate it in CI tests. Would it be OK to include gmock as a
> > dependency?
> > >
> > > thanks!
> > >
> > > On 2020/09/02 22:05:51, Ivo Jimenez <iv...@gmail.com> wrote:
> > >> Hi Ben,
> > >>
> > >>
> > >>>> Our main concern is that this new arrow::dataset::RadosFormat class
> > will
> > >>> be
> > >>>> deriving from the arrow::dataset::FileFormat class, which seems to
> > raise
> > >>> a
> > >>>> conceptual mismatch as there isn’t really a RADOS format
> > >>>
> > >>> IIUC RADOS doesn't interact with a filesystem directly, so
> > RadosFileFormat
> > >>> would
> > >>> indeed be a conceptually problematic point of extension. If a RADOS
> > file
> > >>> system
> > >>> is not viable then I think the ideal approach would be to directly
> > >>> implement the
> > >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> > >>> implementation altogether.
> > >>> Unfortunately the only example we have of this approach is
> > >>> InMemoryFragment,
> > >>> which simply wraps a vector of record batches.
> > >>>
> > >>
> > >> This is what we will go with, as this seems to be the quickest way for
> > us
> > >> to have a PoC and start experimenting with this.
> > >>
> > >> Thanks a lot for the invaluable feedback! 🙏
> > >>
> >
>

Re: Arrow Dataset API on Ceph

Posted by Micah Kornfield <em...@gmail.com>.

gmock is already a dependency.  We haven't upgraded gmock/gtest in a while,
we might want to consider doing that (but this is orthogonal).

On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Ivo,
>
> You can open a JIRA once you've got a PR ready.  No need to do it before
> you think you're ready for submission.
>
> AFAIK, gmock is already a dependency.
>
> Regards
>
> Antoine.
>
>
>
> Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :
> > Hi again,
> >
> > We noticed in the contribution guidelines that there needs to be an
> issue for every PR in JIRA. Should we open one for the eventual PR for the
> work we're doing on implementing the dataset on Ceph's RADOS?
> >
> > Also, on a related note, we would like to mock the RADOS client so that
> we can integrate it in CI tests. Would it be OK to include gmock as a
> dependency?
> >
> > thanks!
> >
> > On 2020/09/02 22:05:51, Ivo Jimenez <iv...@gmail.com> wrote:
> >> Hi Ben,
> >>
> >>
> >>>> Our main concern is that this new arrow::dataset::RadosFormat class
> will
> >>> be
> >>>> deriving from the arrow::dataset::FileFormat class, which seems to
> raise
> >>> a
> >>>> conceptual mismatch as there isn’t really a RADOS format
> >>>
> >>> IIUC RADOS doesn't interact with a filesystem directly, so
> RadosFileFormat
> >>> would
> >>> indeed be a conceptually problematic point of extension. If a RADOS
> file
> >>> system
> >>> is not viable then I think the ideal approach would be to directly
> >>> implement the
> >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> >>> implementation altogether.
> >>> Unfortunately the only example we have of this approach is
> >>> InMemoryFragment,
> >>> which simply wraps a vector of record batches.
> >>>
> >>
> >> This is what we will go with, as this seems to be the quickest way for
> us
> >> to have a PoC and start experimenting with this.
> >>
> >> Thanks a lot for the invaluable feedback! 🙏
> >>
>

Re: Arrow Dataset API on Ceph

Posted by Antoine Pitrou <an...@python.org>.

Hi Ivo,

You can open a JIRA once you've got a PR ready.  No need to do it before
you think you're ready for submission.

AFAIK, gmock is already a dependency.

Regards

Antoine.



Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :
> Hi again,
> 
> We noticed in the contribution guidelines that there needs to be an issue for every PR in JIRA. Should we open one for the eventual PR for the work we're doing on implementing the dataset on Ceph's RADOS?
> 
> Also, on a related note, we would like to mock the RADOS client so that we can integrate it in CI tests. Would it be OK to include gmock as a dependency?
> 
> thanks!
> 
> On 2020/09/02 22:05:51, Ivo Jimenez <iv...@gmail.com> wrote: 
>> Hi Ben,
>>
>>
>>>> Our main concern is that this new arrow::dataset::RadosFormat class will
>>> be
>>>> deriving from the arrow::dataset::FileFormat class, which seems to raise
>>> a
>>>> conceptual mismatch as there isn’t really a RADOS format
>>>
>>> IIUC RADOS doesn't interact with a filesystem directly, so RadosFileFormat
>>> would
>>> indeed be a conceptually problematic point of extension. If a RADOS file
>>> system
>>> is not viable then I think the ideal approach would be to directly
>>> implement the
>>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
>>> implementation altogether.
>>> Unfortunately the only example we have of this approach is
>>> InMemoryFragment,
>>> which simply wraps a vector of record batches.
>>>
>>
>> This is what we will go with, as this seems to be the quickest way for us
>> to have a PoC and start experimenting with this.
>>
>> Thanks a lot for the invaluable feedback! 🙏
>>

Re: Arrow Dataset API on Ceph

Posted by Ivo Jimenez <iv...@gmail.com>.

Hi again,

We noticed in the contribution guidelines that there needs to be an issue for every PR in JIRA. Should we open one for the eventual PR for the work we're doing on implementing the dataset on Ceph's RADOS?

Also, on a related note, we would like to mock the RADOS client so that we can integrate it in CI tests. Would it be OK to include gmock as a dependency?

thanks!

On 2020/09/02 22:05:51, Ivo Jimenez <iv...@gmail.com> wrote: 
> Hi Ben,
> 
> 
> > > Our main concern is that this new arrow::dataset::RadosFormat class will
> > be
> > > deriving from the arrow::dataset::FileFormat class, which seems to raise
> > a
> > > conceptual mismatch as there isn’t really a RADOS format
> >
> > IIUC RADOS doesn't interact with a filesystem directly, so RadosFileFormat
> > would
> > indeed be a conceptually problematic point of extension. If a RADOS file
> > system
> > is not viable then I think the ideal approach would be to directly
> > implement the
> > Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> > implementation altogether.
> > Unfortunately the only example we have of this approach is
> > InMemoryFragment,
> > which simply wraps a vector of record batches.
> >
> 
> This is what we will go with, as this seems to be the quickest way for us
> to have a PoC and start experimenting with this.
> 
> Thanks a lot for the invaluable feedback! 🙏
>

Re: Arrow Dataset API on Ceph

Posted by Ivo Jimenez <iv...@gmail.com>.

Hi Ben,


> > Our main concern is that this new arrow::dataset::RadosFormat class will
> be
> > deriving from the arrow::dataset::FileFormat class, which seems to raise
> a
> > conceptual mismatch as there isn’t really a RADOS format
>
> IIUC RADOS doesn't interact with a filesystem directly, so RadosFileFormat
> would
> indeed be a conceptually problematic point of extension. If a RADOS file
> system
> is not viable then I think the ideal approach would be to directly
> implement the
> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> implementation altogether.
> Unfortunately the only example we have of this approach is
> InMemoryFragment,
> which simply wraps a vector of record batches.
>

This is what we will go with, as this seems to be the quickest way for us
to have a PoC and start experimenting with this.

Thanks a lot for the invaluable feedback! 🙏

Re: Arrow Dataset API on Ceph

Posted by Ben Kietzman <be...@rstudio.com>.

> as far as we can tell, this filesystem layer
> is unaware of expressions, record batches, etc

You're correct that the filesystem layer doesn't directly support
Expressions.
However the datasets API includes the Partitioning classes which embed
expressions in paths. Depending on what expressions etc you need to embed,
you could implement a RadosFileSystem class which wraps an IO context
and treat object names as paths. If the RADOS objects contain arrow
formatted
data, then a FileSystemDataset (using IpcFileFormat) can be constructed
which
views the IO context and exploits the partitioning information embedded in
object
names to accelerate filtering. Does that accommodate your use case?

> Our main concern is that this new arrow::dataset::RadosFormat class will
be
> deriving from the arrow::dataset::FileFormat class, which seems to raise a
> conceptual mismatch as there isn’t really a RADOS format

IIUC RADOS doesn't interact with a filesystem directly, so RadosFileFormat
would
indeed be a conceptually problematic point of extension. If a RADOS file
system
is not viable then I think the ideal approach would be to directly
implement the
Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
implementation altogether.
Unfortunately the only example we have of this approach is InMemoryFragment,
which simply wraps a vector of record batches.

[1]
https://github.com/apache/arrow/blob/975f4eb/cpp/src/arrow/dataset/dataset.h#L45-L90
[2]
https://github.com/apache/arrow/blob/975f4eb/cpp/src/arrow/dataset/dataset.h#L119-L158

On Fri, Aug 28, 2020 at 1:27 PM Ivo Jimenez <iv...@gmail.com> wrote:

> Hi Antoine
>
> > Yes, that is our plan. Since this is going to be done on the storage-,
> > > server-side, this would be transparent to the client. So our main
> concern
> > > is whether this be OK from the design perspective, and could this
> > > eventually be merged upstream?
> >
> > Arrow datasets have no notion of client and server, so I'm not sure what
> > you mean here.
>
>
> Sorry for the confusion. This is where we see a mismatch between the
> current design and what we are trying to achieve.
>
> Our goal is to push down computations in a cloud storage system. By pushing
> we mean actually sending computation tasks to storage nodes (e.g. filtering
> executing on storage nodes). Ideally this would be done by implementing a
> new plugin for arrow::fs but as far as we can tell, this filesystem layer
> is unaware of expressions, record batches, etc. so this information cannot
> be communicated down to storage.
>
> So what we thought would work is to implement this at the Dataset API
> level, and implement a scanner (and writer) that would be deferring these
> operations to storage nodes. For example, the RadosScanTask class will ask
> a storage node to actually do a scan and fetch the result, as opposed to do
> the scan locally.
>
> We would immensely appreciate it if you could let us know if the above is
> OK, or if you think there is a better alternative for accomplishing this,
> as we would rather implement this functionality in a way that is
> compatible with your overall vision.
>
>
> > Do you simply mean contributing RadosFormat to the Arrow
> > codebase?
>
>
> Yes, so that others wanting to achieve this on a Ceph cluster could
> leverage this as well.
>
>
> > I would say that depends on the required dependencies, and
> > ease of testing (and/or CI) for other developers.
>
>
> OK, yes we will pay attention to these aspects as part of an eventual PR.
> We will include tests and ensure that CI covers the changes we introduce.
>
> thanks!
>

Re: Arrow Dataset API on Ceph

Posted by Ivo Jimenez <iv...@gmail.com>.

Hi Antoine

> Yes, that is our plan. Since this is going to be done on the storage-,
> > server-side, this would be transparent to the client. So our main concern
> > is whether this be OK from the design perspective, and could this
> > eventually be merged upstream?
>
> Arrow datasets have no notion of client and server, so I'm not sure what
> you mean here.


Sorry for the confusion. This is where we see a mismatch between the
current design and what we are trying to achieve.

Our goal is to push down computations in a cloud storage system. By pushing
we mean actually sending computation tasks to storage nodes (e.g. filtering
executing on storage nodes). Ideally this would be done by implementing a
new plugin for arrow::fs but as far as we can tell, this filesystem layer
is unaware of expressions, record batches, etc. so this information cannot
be communicated down to storage.

So what we thought would work is to implement this at the Dataset API
level, and implement a scanner (and writer) that would be deferring these
operations to storage nodes. For example, the RadosScanTask class will ask
a storage node to actually do a scan and fetch the result, as opposed to do
the scan locally.

We would immensely appreciate it if you could let us know if the above is
OK, or if you think there is a better alternative for accomplishing this,
as we would rather implement this functionality in a way that is
compatible with your overall vision.


> Do you simply mean contributing RadosFormat to the Arrow
> codebase?


Yes, so that others wanting to achieve this on a Ceph cluster could
leverage this as well.


> I would say that depends on the required dependencies, and
> ease of testing (and/or CI) for other developers.


OK, yes we will pay attention to these aspects as part of an eventual PR.
We will include tests and ensure that CI covers the changes we introduce.

thanks!

Re: Arrow Dataset API on Ceph

Posted by Antoine Pitrou <an...@python.org>.

Le 27/08/2020 à 21:55, Ivo Jimenez a écrit :
> Hi Antoine,
> 
>> Our main concern is that this new arrow::dataset::RadosFormat class will
>> be
>>> deriving from the arrow::dataset::FileFormat class, which seems to raise
>> a
>>> conceptual mismatch as there isn’t really a RADOS format but rather a
>>> formatting/serialization deferral that will be taking place, effectively
>>> introducing a new client-server layer in the Dataset API.
>>
>> So, RadosFormat would ultimately redirect to another dataset format
>> (e.g. ParquetFormat) when it comes to actually understanding the data?
>>
> 
> Yes, that is our plan. Since this is going to be done on the storage-,
> server-side, this would be transparent to the client. So our main concern
> is whether this be OK from the design perspective, and could this
> eventually be merged upstream?

Arrow datasets have no notion of client and server, so I'm not sure what
you mean here.  Do you simply mean contributing RadosFormat to the Arrow
codebase?  I would say that depends on the required dependencies, and
ease of testing (and/or CI) for other developers.

Regards

Antoine.

Re: Arrow Dataset API on Ceph

Posted by Ivo Jimenez <iv...@gmail.com>.

Hi Antoine,

> Our main concern is that this new arrow::dataset::RadosFormat class will
> be
> > deriving from the arrow::dataset::FileFormat class, which seems to raise
> a
> > conceptual mismatch as there isn’t really a RADOS format but rather a
> > formatting/serialization deferral that will be taking place, effectively
> > introducing a new client-server layer in the Dataset API.
>
> So, RadosFormat would ultimately redirect to another dataset format
> (e.g. ParquetFormat) when it comes to actually understanding the data?
>

Yes, that is our plan. Since this is going to be done on the storage-,
server-side, this would be transparent to the client. So our main concern
is whether this be OK from the design perspective, and could this
eventually be merged upstream?

thanks!

Re: Arrow Dataset API on Ceph

Posted by Antoine Pitrou <an...@python.org>.

Hello Ivo,

Le 27/08/2020 à 07:02, Ivo Jimenez a écrit :
> 
> Our main concern is that this new arrow::dataset::RadosFormat class will be
> deriving from the arrow::dataset::FileFormat class, which seems to raise a
> conceptual mismatch as there isn’t really a RADOS format but rather a
> formatting/serialization deferral that will be taking place, effectively
> introducing a new client-server layer in the Dataset API.

So, RadosFormat would ultimately redirect to another dataset format
(e.g. ParquetFormat) when it comes to actually understanding the data?

Regards

Antoine.