You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jayjeet Chakraborty <ja...@gmail.com> on 2022/08/24 13:59:45 UTC

Using Acero in a distributed environment

Hi Arrow Community,

With the release of Acero, we were wondering if Acero can be used in a
distributed environment as for now it looks like Acero is only intended for
a local context. For example, if we have a query plan with a hash join node
at the root and multiple filter project nodes on each sides of the tree,
each side having a data source, how can we distribute the query plan
between 3 nodes: 2 nodes containing data sources and executing the filter
and project parts of the query plan in parallel while 1 node serving as the
compute node, performing only the join operation on the results from the
other 2 nodes. As per my understanding, we need some form of RPC mechanism
between the ExecNodes of an ExecPlan and would probably be integrated
within the Flight framework. Is that the right way to think about it ? Do
you think that is something the Arrow community would be interested in if
not already planning for it ? Thanks.

Jayjeet Chakraborty


-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Re: Using Acero in a distributed environment

Posted by Aldrin <ak...@ucsc.edu.INVALID>.

I am slowly but surely building up to something like that.

[1] is my progress before, using computational storage drives but I have
only ran it on a single drive so far. [2] is where I will be trying to do
something more generic, but using flight RPC (instead of kinetic protocol)
and substrait + acero (instead of bare minimum SQL on top of my own
functions). I'm not sure what direction you may want to go, but I will be
using substrait to communicate plans between services and each service can
decide if it will: (A) execute the plan then send data, or (B) propagate
the plan, receive data, execute remaining work, then send data. This type
of approach can be seen by the slide in [3], where computational storage
devices would do (A) and storage servers would do (B) before returning data
back to the client (compute node or application node).

My general perspective is that Acero does query execution at a single site
and distributed query execution simply requires communicating query plans
between each node that can do query execution. Query plans that are sent
between many nodes can either be:
* static -- made at a head node and executed as-is at a worker
* dynamic -- initially or partially made by some node and the plan is
possibly altered or possibly partially executed at a worker

With mohair [2], I will be trying to take the dynamic approach.

[1]: https://gitlab.com/skyhookdm/skytether-singlecell
[2]: https://github.com/drin/mohair/tree/develop
[3]:
https://docs.google.com/presentation/d/1Nollf087CRhMmEAWcwfudIizIhF-ttPRGgaqmuXtSBQ/edit#slide=id.g12c2952ca0d_0_67

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Aug 31, 2022 at 10:29 AM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Thanks a lot for your reply, Niranda and Weston.
>
> On Thu, Aug 25, 2022 at 1:31 AM Weston Pace <we...@gmail.com> wrote:
>
> > I don't know of any work being done to turn Acero into a distributed
> > query engine.
> >
> > However, I would hope that Acero can be used in a distributed query
> > engine, and would be a useful component.
> >
> > If there are features that Acero would need in this environment (e.g.
> > some kind of exec node for specialized transmission or partitioned
> > transmission) then please feel free to create JIRA tickets describing
> > what you would like to see.
> >
> > On Wed, Aug 24, 2022 at 7:18 AM Niranda Perera <niranda.perera@gmail.com
> >
> > wrote:
> > >
> > > Hi Jayeet,
> > >
> > > AFAIU, Acero work mainly focuses on single node multithreaded execution
> > > based on morsel driven parallelism [1].
> > > In your case, there are multiple options IMO. Ex. just use 2 nodes
> which
> > do
> > > filtering parallely, and then node0 does the join (this reduces
> > > communication).  Better yet, if you could use distributed memory
> > > computation for the hash join which uses both nodes (which is not
> > supported
> > > yet in arrow). There are several other compute engines that support
> these
> > > types of execution on top of arrow dataformat (eg: Cylon which I'm
> > working
> > > on ATM)
> > >
> > > [1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507
> > >
> > > On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty <
> > > jayjeetchakraborty25@gmail.com> wrote:
> > >
> > > > Hi Arrow Community,
> > > >
> > > > With the release of Acero, we were wondering if Acero can be used in
> a
> > > > distributed environment as for now it looks like Acero is only
> > intended for
> > > > a local context. For example, if we have a query plan with a hash
> join
> > node
> > > > at the root and multiple filter project nodes on each sides of the
> > tree,
> > > > each side having a data source, how can we distribute the query plan
> > > > between 3 nodes: 2 nodes containing data sources and executing the
> > filter
> > > > and project parts of the query plan in parallel while 1 node serving
> > as the
> > > > compute node, performing only the join operation on the results from
> > the
> > > > other 2 nodes. As per my understanding, we need some form of RPC
> > mechanism
> > > > between the ExecNodes of an ExecPlan and would probably be integrated
> > > > within the Flight framework. Is that the right way to think about it
> ?
> > Do
> > > > you think that is something the Arrow community would be interested
> in
> > if
> > > > not already planning for it ? Thanks.
> > > >
> > > > Jayjeet Chakraborty
> > > >
> > > >
> > > > --
> > > > *Jayjeet Chakraborty*
> > > > CS PhD student
> > > > UC Santa Cruz
> > > > California, USA
> > > >
> > >
> > >
> > > --
> > > Niranda Perera
> > > https://niranda.dev/
> > > @n1r44 <https://twitter.com/N1R44>
> >
>
>
> --
> *Jayjeet Chakraborty*
> CS PhD student
> UC Santa Cruz
> California, USA
>

Re: Using Acero in a distributed environment

Posted by Jayjeet Chakraborty <ja...@gmail.com>.

Thanks a lot for your reply, Niranda and Weston.

On Thu, Aug 25, 2022 at 1:31 AM Weston Pace <we...@gmail.com> wrote:

> I don't know of any work being done to turn Acero into a distributed
> query engine.
>
> However, I would hope that Acero can be used in a distributed query
> engine, and would be a useful component.
>
> If there are features that Acero would need in this environment (e.g.
> some kind of exec node for specialized transmission or partitioned
> transmission) then please feel free to create JIRA tickets describing
> what you would like to see.
>
> On Wed, Aug 24, 2022 at 7:18 AM Niranda Perera <ni...@gmail.com>
> wrote:
> >
> > Hi Jayeet,
> >
> > AFAIU, Acero work mainly focuses on single node multithreaded execution
> > based on morsel driven parallelism [1].
> > In your case, there are multiple options IMO. Ex. just use 2 nodes which
> do
> > filtering parallely, and then node0 does the join (this reduces
> > communication).  Better yet, if you could use distributed memory
> > computation for the hash join which uses both nodes (which is not
> supported
> > yet in arrow). There are several other compute engines that support these
> > types of execution on top of arrow dataformat (eg: Cylon which I'm
> working
> > on ATM)
> >
> > [1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507
> >
> > On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty <
> > jayjeetchakraborty25@gmail.com> wrote:
> >
> > > Hi Arrow Community,
> > >
> > > With the release of Acero, we were wondering if Acero can be used in a
> > > distributed environment as for now it looks like Acero is only
> intended for
> > > a local context. For example, if we have a query plan with a hash join
> node
> > > at the root and multiple filter project nodes on each sides of the
> tree,
> > > each side having a data source, how can we distribute the query plan
> > > between 3 nodes: 2 nodes containing data sources and executing the
> filter
> > > and project parts of the query plan in parallel while 1 node serving
> as the
> > > compute node, performing only the join operation on the results from
> the
> > > other 2 nodes. As per my understanding, we need some form of RPC
> mechanism
> > > between the ExecNodes of an ExecPlan and would probably be integrated
> > > within the Flight framework. Is that the right way to think about it ?
> Do
> > > you think that is something the Arrow community would be interested in
> if
> > > not already planning for it ? Thanks.
> > >
> > > Jayjeet Chakraborty
> > >
> > >
> > > --
> > > *Jayjeet Chakraborty*
> > > CS PhD student
> > > UC Santa Cruz
> > > California, USA
> > >
> >
> >
> > --
> > Niranda Perera
> > https://niranda.dev/
> > @n1r44 <https://twitter.com/N1R44>
>


-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Re: Using Acero in a distributed environment

Posted by Weston Pace <we...@gmail.com>.

I don't know of any work being done to turn Acero into a distributed
query engine.

However, I would hope that Acero can be used in a distributed query
engine, and would be a useful component.

If there are features that Acero would need in this environment (e.g.
some kind of exec node for specialized transmission or partitioned
transmission) then please feel free to create JIRA tickets describing
what you would like to see.

On Wed, Aug 24, 2022 at 7:18 AM Niranda Perera <ni...@gmail.com> wrote:
>
> Hi Jayeet,
>
> AFAIU, Acero work mainly focuses on single node multithreaded execution
> based on morsel driven parallelism [1].
> In your case, there are multiple options IMO. Ex. just use 2 nodes which do
> filtering parallely, and then node0 does the join (this reduces
> communication).  Better yet, if you could use distributed memory
> computation for the hash join which uses both nodes (which is not supported
> yet in arrow). There are several other compute engines that support these
> types of execution on top of arrow dataformat (eg: Cylon which I'm working
> on ATM)
>
> [1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507
>
> On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty <
> jayjeetchakraborty25@gmail.com> wrote:
>
> > Hi Arrow Community,
> >
> > With the release of Acero, we were wondering if Acero can be used in a
> > distributed environment as for now it looks like Acero is only intended for
> > a local context. For example, if we have a query plan with a hash join node
> > at the root and multiple filter project nodes on each sides of the tree,
> > each side having a data source, how can we distribute the query plan
> > between 3 nodes: 2 nodes containing data sources and executing the filter
> > and project parts of the query plan in parallel while 1 node serving as the
> > compute node, performing only the join operation on the results from the
> > other 2 nodes. As per my understanding, we need some form of RPC mechanism
> > between the ExecNodes of an ExecPlan and would probably be integrated
> > within the Flight framework. Is that the right way to think about it ? Do
> > you think that is something the Arrow community would be interested in if
> > not already planning for it ? Thanks.
> >
> > Jayjeet Chakraborty
> >
> >
> > --
> > *Jayjeet Chakraborty*
> > CS PhD student
> > UC Santa Cruz
> > California, USA
> >
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>

Re: Using Acero in a distributed environment

Posted by Niranda Perera <ni...@gmail.com>.

Hi Jayeet,

AFAIU, Acero work mainly focuses on single node multithreaded execution
based on morsel driven parallelism [1].
In your case, there are multiple options IMO. Ex. just use 2 nodes which do
filtering parallely, and then node0 does the join (this reduces
communication).  Better yet, if you could use distributed memory
computation for the hash join which uses both nodes (which is not supported
yet in arrow). There are several other compute engines that support these
types of execution on top of arrow dataformat (eg: Cylon which I'm working
on ATM)

[1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507

On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Hi Arrow Community,
>
> With the release of Acero, we were wondering if Acero can be used in a
> distributed environment as for now it looks like Acero is only intended for
> a local context. For example, if we have a query plan with a hash join node
> at the root and multiple filter project nodes on each sides of the tree,
> each side having a data source, how can we distribute the query plan
> between 3 nodes: 2 nodes containing data sources and executing the filter
> and project parts of the query plan in parallel while 1 node serving as the
> compute node, performing only the join operation on the results from the
> other 2 nodes. As per my understanding, we need some form of RPC mechanism
> between the ExecNodes of an ExecPlan and would probably be integrated
> within the Flight framework. Is that the right way to think about it ? Do
> you think that is something the Arrow community would be interested in if
> not already planning for it ? Thanks.
>
> Jayjeet Chakraborty
>
>
> --
> *Jayjeet Chakraborty*
> CS PhD student
> UC Santa Cruz
> California, USA
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>