You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Benjamin Kietzman <be...@gmail.com> on 2024/03/13 15:54:10 UTC

[DISCUSS][C++] Help needed to refactor Skyhook

Skyhook [1] enables efficient predicate and projection pushdown from
Arrow Dataset to a Ceph storage cluster. This is very cool
functionality, but it's tightly coupled to the Arrow C++ Dataset
implementation in a way which blocks refactoring. In the Arrow C++
codebase today, Acero is designed specifically to handle projection
and filtration in a more modular fashion, and to accept configuration
from standardized plan/expression formats like Substrait. In light of
improvements to Dataset which are not possible while maintaining
Skyhook in its current form, we need volunteers to update Skyhook.
Please reply to let us know if you are actively using Skyhook or if
you are interested in helping to refactor Skyhook.

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/

Re: [DISCUSS][C++] Help needed to refactor Skyhook

Posted by Benjamin Kietzman <be...@gmail.com>.
I'll put recommendations for the design on the issue. Thanks!

On Fri, Mar 15, 2024 at 2:03 PM Aldrin <oc...@pm.me.invalid> wrote:

> I created a new issue [1] to track the refactoring. Could you clarify the
> request (here or in the issue)?
>
> My understanding is that the Skyhook file format code [2] should be
> refactored to use a higher-level interface rather than using
> dataset::FileFormat and dataset::FragmentScanOptions directly [3].
>
> I am assuming the reference to Acero and Substrait to be only for context
> and not necessarily a preferred direction. If that is the preferred
> direction, there is something much more general in progress that we can
> perhaps specialize as a replacement for the Skyhook file format, but I'm
> not sure that's what's actually being requested.
>
> Thank you!
>
>
> [1]: https://github.com/apache/arrow/issues/40583
> [2]: https://github.com/apache/arrow/tree/main/cpp/src/skyhook
> [3]:
> https://github.com/apache/arrow/blob/main/cpp/src/skyhook/cls/cls_skyhook.cc#L153-L156
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Thursday, March 14th, 2024 at 09:10, Jayjeet Chakraborty <
> jayjeetchakraborty25@gmail.com> wrote:
>
> > Hi Ben, I am willing to help out with the refactor too !
> >
>
> > On Wed, Mar 13, 2024 at 9:25 PM Aldrin octalene.dev@pm.me.invalid wrote:
> >
>
> > > I am interested in helping to refactor!
> > >
>
> > > -Aldrin
> > >
>
> > > On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman <bengilgit@gmail.com
> > > <On+Wed,+Mar+13,+2024+at+08:54,+Benjamin+Kietzman+%3C%3Ca+href=>>
> wrote:
> > >
>
> > > Skyhook [1] enables efficient predicate and projection pushdown from
> > > Arrow Dataset to a Ceph storage cluster. This is very cool
> > > functionality, but it's tightly coupled to the Arrow C++ Dataset
> > > implementation in a way which blocks refactoring. In the Arrow C++
> > > codebase today, Acero is designed specifically to handle projection
> > > and filtration in a more modular fashion, and to accept configuration
> > > from standardized plan/expression formats like Substrait. In light of
> > > improvements to Dataset which are not possible while maintaining
> > > Skyhook in its current form, we need volunteers to update Skyhook.
> > > Please reply to let us know if you are actively using Skyhook or if
> > > you are interested in helping to refactor Skyhook.
> > >
>
> > > Sincerely,
> > > Ben Kietzman
> > >
>
> > > [1]
> > >
>
> > >
> https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/
> >
>
> >
>
> > --
> > Jayjeet Chakraborty
> > CS PhD student
> > UC Santa Cruz
> > California, USA

Re: [DISCUSS][C++] Help needed to refactor Skyhook

Posted by Aldrin <oc...@pm.me.INVALID>.
I created a new issue [1] to track the refactoring. Could you clarify the request (here or in the issue)?

My understanding is that the Skyhook file format code [2] should be refactored to use a higher-level interface rather than using dataset::FileFormat and dataset::FragmentScanOptions directly [3].

I am assuming the reference to Acero and Substrait to be only for context and not necessarily a preferred direction. If that is the preferred direction, there is something much more general in progress that we can perhaps specialize as a replacement for the Skyhook file format, but I'm not sure that's what's actually being requested.

Thank you!


[1]: https://github.com/apache/arrow/issues/40583
[2]: https://github.com/apache/arrow/tree/main/cpp/src/skyhook
[3]: https://github.com/apache/arrow/blob/main/cpp/src/skyhook/cls/cls_skyhook.cc#L153-L156



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Thursday, March 14th, 2024 at 09:10, Jayjeet Chakraborty <ja...@gmail.com> wrote:

> Hi Ben, I am willing to help out with the refactor too !
> 

> On Wed, Mar 13, 2024 at 9:25 PM Aldrin octalene.dev@pm.me.invalid wrote:
> 

> > I am interested in helping to refactor!
> > 

> > -Aldrin
> > 

> > On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman <bengilgit@gmail.com
> > <On+Wed,+Mar+13,+2024+at+08:54,+Benjamin+Kietzman+%3C%3Ca+href=>> wrote:
> > 

> > Skyhook [1] enables efficient predicate and projection pushdown from
> > Arrow Dataset to a Ceph storage cluster. This is very cool
> > functionality, but it's tightly coupled to the Arrow C++ Dataset
> > implementation in a way which blocks refactoring. In the Arrow C++
> > codebase today, Acero is designed specifically to handle projection
> > and filtration in a more modular fashion, and to accept configuration
> > from standardized plan/expression formats like Substrait. In light of
> > improvements to Dataset which are not possible while maintaining
> > Skyhook in its current form, we need volunteers to update Skyhook.
> > Please reply to let us know if you are actively using Skyhook or if
> > you are interested in helping to refactor Skyhook.
> > 

> > Sincerely,
> > Ben Kietzman
> > 

> > [1]
> > 

> > https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/
> 

> 

> --
> Jayjeet Chakraborty
> CS PhD student
> UC Santa Cruz
> California, USA

Re: [DISCUSS][C++] Help needed to refactor Skyhook

Posted by Jayjeet Chakraborty <ja...@gmail.com>.
Hi Ben, I am willing to help out with the refactor too !

On Wed, Mar 13, 2024 at 9:25 PM Aldrin <oc...@pm.me.invalid> wrote:

> I am interested in helping to refactor!
>
> -Aldrin
>
>
> On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman <bengilgit@gmail.com
> <On+Wed,+Mar+13,+2024+at+08:54,+Benjamin+Kietzman+%3C%3Ca+href=>> wrote:
>
> Skyhook [1] enables efficient predicate and projection pushdown from
> Arrow Dataset to a Ceph storage cluster. This is very cool
> functionality, but it's tightly coupled to the Arrow C++ Dataset
> implementation in a way which blocks refactoring. In the Arrow C++
> codebase today, Acero is designed specifically to handle projection
> and filtration in a more modular fashion, and to accept configuration
> from standardized plan/expression formats like Substrait. In light of
> improvements to Dataset which are not possible while maintaining
> Skyhook in its current form, we need volunteers to update Skyhook.
> Please reply to let us know if you are actively using Skyhook or if
> you are interested in helping to refactor Skyhook.
>
> Sincerely,
> Ben Kietzman
>
> [1]
>
> https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/
>
>

-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Re: [DISCUSS][C++] Help needed to refactor Skyhook

Posted by Aldrin <oc...@pm.me.INVALID>.
I am interested in helping to refactor!
 -Aldrin 
On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman &lt;bengilgit@gmail.com&gt; wrote:  Skyhook [1] enables efficient predicate and projection pushdown from
Arrow Dataset to a Ceph storage cluster. This is very cool
functionality, but it's tightly coupled to the Arrow C++ Dataset
implementation in a way which blocks refactoring. In the Arrow C++
codebase today, Acero is designed specifically to handle projection
and filtration in a more modular fashion, and to accept configuration
from standardized plan/expression formats like Substrait. In light of
improvements to Dataset which are not possible while maintaining
Skyhook in its current form, we need volunteers to update Skyhook.
Please reply to let us know if you are actively using Skyhook or if
you are interested in helping to refactor Skyhook.

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/