You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Ruslan Dautkhanov <da...@gmail.com> on 2018/04/24 20:13:24 UTC

Airflow - YARN as an executor?

With Hadoop 3's Docker on YARN support, I think YARN becomes
somewhat a competitor for Kubernetes.

Great job on adding k8s support to Airflow.

Very similarly I see Airflow could integrate with YARN and use
its infrastructure as an "executor" .. have anyone explored feasibility of
this approach?


Thanks!
Ruslan Dautkhanov

Re: Airflow - YARN as an executor?

Posted by Ruslan Dautkhanov <da...@gmail.com>.

As long as that code is serializable (through pickle, cloudpickle or any
other Python code serializaers ),
the answer should be yes.

Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:54 AM, Taylor Edmiston <te...@gmail.com>
wrote:

> Is it possible for the (hypothetical) Airflow SparkExecutor to handle
> general execution of any operator (i.e., run non-Spark code)?
>
> *Taylor Edmiston*
> Blog <http://blog.tedmiston.com> | Stack Overflow CV
> <https://stackoverflow.com/story/taylor> | LinkedIn
> <https://www.linkedin.com/in/tedmiston/> | AngelList
> <https://angel.co/taylor>
>
>
> On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov <da...@gmail.com>
> wrote:
>
> > I used "Executor" as an Airflow term, not meant spark executor ...
> > Like Spark would be one of Executors
> > in here
> > https://github.com/apache/incubator-airflow/tree/master/
> airflow/executors
> > or in here
> > https://github.com/apache/incubator-airflow/tree/master/
> > airflow/contrib/executors
> >
> > Thanks.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >
> > > Im a bit lost on the spark executor to be honest. To my knowledge the
> > > spark driver creates spark executors which run spark code. In other
> words
> > > in can’t arbitrarily run generic code. Or can it?
> > >
> > > B.
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <
> dautkhanov@gmail.com
> > >
> > > het volgende geschreven:
> > > >
> > > > Now I think if Airflow on PySpark Executor would be an easier target.
> > > > Spark runs on YARN, Mesos and now Kubernetes.
> > > > So PySpark Executor would give Airflow porting to these schedulers.
> > > > It's my understanding we now have only Spark Operator and not
> Executor.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > --
> > > > Ruslan Dautkhanov
> > > >
> > > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <ac...@gmail.com>
> > > wrote:
> > > >>
> > > >> Hey I didn’t know this Bolke, I was under the impression of the same
> > as
> > > >> Ruslan.
> > > >> Thanks for the share
> > > >>
> > > >> Sent from my iPhone
> > > >>
> > > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com>
> > wrote:
> > > >>>
> > > >>> It actually can nowadays: https://cdn.oreillystatic.com/
> > > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> > > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> > > >>>
> > > >>> We also have an on premise setup with ceph (s3a) and HDFS for when
> we
> > > >> need the speed and kubernetes for our workloads. We are kicking out
> > Yarn
> > > >> (and hive etc for that matter).
> > > >>>
> > > >>> Bolke
> > > >>>
> > > >>>
> > > >>>
> > > >>> Verstuurd vanaf mijn iPad
> > > >>>
> > > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> > > dautkhanov@gmail.com>
> > > >> het volgende geschreven:
> > > >>>>
> > > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle
> > what
> > > >> YARN
> > > >>>> can - for example schedule tasks local to data.
> > > >>>> Hadoop has multiple levels of data locality (node-local,
> > rack-local) -
> > > >> so
> > > >>>> computation happens local to data to minimize network
> > > >>>> data transfer which is expensive.
> > > >>>> K8s wasn't designed to handle this scheduling scenarios, as far
> as I
> > > >> know.
> > > >>>>
> > > >>>> For cloud deployments where we don't have data locality problem
> > > >> (because of
> > > >>>> s3 is being used instead of storage local
> > > >>>> to servers), k8s might be okay.
> > > >>>>
> > > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> > > messos
> > > >> ..
> > > >>>> although I think it's an offtopic.
> > > >>>>
> > > >>>> We're mostly on-prem and we don't see kubernetes take over yarn
> any
> > > time
> > > >>>> soon.
> > > >>>>
> > > >>>> Thanks.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> > > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> > > >>>>
> > > >>>> *2.3.2 Monolithic Schedulers *
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Monolithic schedulers use a single, centralized scheduling
> algorithm
> > > for
> > > >>>> all jobs. All workload is run through the same scheduler and same
> > > >>>> scheduling logic. Swarm,
> > > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> > > >>>> improvised on basic monolithic version of Borg and Swarm
> schedulers.
> > > >> This
> > > >>>> type of schedulers are not suitable for running heterogeneous
> modern
> > > >>>> workloads which include Spark jobs, containers, and other long
> > running
> > > >> jobs,
> > > >>>> etc.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> *2.3.3 Two Level Schedulers *
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Two-level schedulers address the drawbacks of a monolithic
> scheduler
> > > by
> > > >>>> separating concerns of resource allocation and task placement. An
> > > active
> > > >>>> resource manager offers compute resources to multiple parallel,
> > > >> independent
> > > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> > > >> approach,
> > > >>>> and YARN supports a limited version of it. In Mesos, resources are
> > > >> offered
> > > >>>> to application-level schedulers. This allows for custom,
> > > >> workload-specific
> > > >>>> scheduling policies. The drawback with this type of scheduling
> > > >> architecture
> > > >>>> is that the application level frameworks cannot see all the
> possible
> > > >>>> placement options anymore. Instead, they only see those options
> that
> > > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> > > >> resource
> > > >>>> manager component. This makes priority preemption (higher priority
> > > tasks
> > > >>>> kick out lower priority ones) difficult.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ruslan Dautkhanov
> > > >>>>
> > > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <
> bdbruin@gmail.com
> > >
> > > >> wrote:
> > > >>>>>
> > > >>>>> Happy to have it as a contrib executor. However, I personally
> think
> > > >> yarn
> > > >>>>> is a dead end. It has a lot of catching up to do and all the
> > momentum
> > > >> is
> > > >>>>> with kubernetes.
> > > >>>>>
> > > >>>>> B.
> > > >>>>>
> > > >>>>> Verstuurd vanaf mijn iPad
> > > >>>>>
> > > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> > > >> dautkhanov@gmail.com>
> > > >>>>> het volgende geschreven:
> > > >>>>>>
> > > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> > > >>>>>> somewhat a competitor for Kubernetes.
> > > >>>>>>
> > > >>>>>> Great job on adding k8s support to Airflow.
> > > >>>>>>
> > > >>>>>> Very similarly I see Airflow could integrate with YARN and use
> > > >>>>>> its infrastructure as an "executor" .. have anyone explored
> > > >> feasibility
> > > >>>>> of
> > > >>>>>> this approach?
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>> Ruslan Dautkhanov
> > > >>>>>
> > > >>
> > >
> >
>

Re: Airflow - YARN as an executor?

Posted by Taylor Edmiston <te...@gmail.com>.

Is it possible for the (hypothetical) Airflow SparkExecutor to handle
general execution of any operator (i.e., run non-Spark code)?

*Taylor Edmiston*
Blog <http://blog.tedmiston.com> | Stack Overflow CV
<https://stackoverflow.com/story/taylor> | LinkedIn
<https://www.linkedin.com/in/tedmiston/> | AngelList
<https://angel.co/taylor>


On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov <da...@gmail.com>
wrote:

> I used "Executor" as an Airflow term, not meant spark executor ...
> Like Spark would be one of Executors
> in here
> https://github.com/apache/incubator-airflow/tree/master/airflow/executors
> or in here
> https://github.com/apache/incubator-airflow/tree/master/
> airflow/contrib/executors
>
> Thanks.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Im a bit lost on the spark executor to be honest. To my knowledge the
> > spark driver creates spark executors which run spark code. In other words
> > in can’t arbitrarily run generic code. Or can it?
> >
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <dautkhanov@gmail.com
> >
> > het volgende geschreven:
> > >
> > > Now I think if Airflow on PySpark Executor would be an easier target.
> > > Spark runs on YARN, Mesos and now Kubernetes.
> > > So PySpark Executor would give Airflow porting to these schedulers.
> > > It's my understanding we now have only Spark Operator and not Executor.
> > >
> > > Thanks!
> > >
> > >
> > >
> > > --
> > > Ruslan Dautkhanov
> > >
> > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <ac...@gmail.com>
> > wrote:
> > >>
> > >> Hey I didn’t know this Bolke, I was under the impression of the same
> as
> > >> Ruslan.
> > >> Thanks for the share
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> > >>>
> > >>> It actually can nowadays: https://cdn.oreillystatic.com/
> > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> > >>>
> > >>> We also have an on premise setup with ceph (s3a) and HDFS for when we
> > >> need the speed and kubernetes for our workloads. We are kicking out
> Yarn
> > >> (and hive etc for that matter).
> > >>>
> > >>> Bolke
> > >>>
> > >>>
> > >>>
> > >>> Verstuurd vanaf mijn iPad
> > >>>
> > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> > dautkhanov@gmail.com>
> > >> het volgende geschreven:
> > >>>>
> > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle
> what
> > >> YARN
> > >>>> can - for example schedule tasks local to data.
> > >>>> Hadoop has multiple levels of data locality (node-local,
> rack-local) -
> > >> so
> > >>>> computation happens local to data to minimize network
> > >>>> data transfer which is expensive.
> > >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
> > >> know.
> > >>>>
> > >>>> For cloud deployments where we don't have data locality problem
> > >> (because of
> > >>>> s3 is being used instead of storage local
> > >>>> to servers), k8s might be okay.
> > >>>>
> > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> > messos
> > >> ..
> > >>>> although I think it's an offtopic.
> > >>>>
> > >>>> We're mostly on-prem and we don't see kubernetes take over yarn any
> > time
> > >>>> soon.
> > >>>>
> > >>>> Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> > >>>>
> > >>>> *2.3.2 Monolithic Schedulers *
> > >>>>
> > >>>>
> > >>>>
> > >>>> Monolithic schedulers use a single, centralized scheduling algorithm
> > for
> > >>>> all jobs. All workload is run through the same scheduler and same
> > >>>> scheduling logic. Swarm,
> > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> > >>>> improvised on basic monolithic version of Borg and Swarm schedulers.
> > >> This
> > >>>> type of schedulers are not suitable for running heterogeneous modern
> > >>>> workloads which include Spark jobs, containers, and other long
> running
> > >> jobs,
> > >>>> etc.
> > >>>>
> > >>>>
> > >>>>
> > >>>> *2.3.3 Two Level Schedulers *
> > >>>>
> > >>>>
> > >>>>
> > >>>> Two-level schedulers address the drawbacks of a monolithic scheduler
> > by
> > >>>> separating concerns of resource allocation and task placement. An
> > active
> > >>>> resource manager offers compute resources to multiple parallel,
> > >> independent
> > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> > >> approach,
> > >>>> and YARN supports a limited version of it. In Mesos, resources are
> > >> offered
> > >>>> to application-level schedulers. This allows for custom,
> > >> workload-specific
> > >>>> scheduling policies. The drawback with this type of scheduling
> > >> architecture
> > >>>> is that the application level frameworks cannot see all the possible
> > >>>> placement options anymore. Instead, they only see those options that
> > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> > >> resource
> > >>>> manager component. This makes priority preemption (higher priority
> > tasks
> > >>>> kick out lower priority ones) difficult.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ruslan Dautkhanov
> > >>>>
> > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bdbruin@gmail.com
> >
> > >> wrote:
> > >>>>>
> > >>>>> Happy to have it as a contrib executor. However, I personally think
> > >> yarn
> > >>>>> is a dead end. It has a lot of catching up to do and all the
> momentum
> > >> is
> > >>>>> with kubernetes.
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>> Verstuurd vanaf mijn iPad
> > >>>>>
> > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> > >> dautkhanov@gmail.com>
> > >>>>> het volgende geschreven:
> > >>>>>>
> > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> > >>>>>> somewhat a competitor for Kubernetes.
> > >>>>>>
> > >>>>>> Great job on adding k8s support to Airflow.
> > >>>>>>
> > >>>>>> Very similarly I see Airflow could integrate with YARN and use
> > >>>>>> its infrastructure as an "executor" .. have anyone explored
> > >> feasibility
> > >>>>> of
> > >>>>>> this approach?
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> Ruslan Dautkhanov
> > >>>>>
> > >>
> >
>

Re: Airflow - YARN as an executor?

Posted by Ruslan Dautkhanov <da...@gmail.com>.

I used "Executor" as an Airflow term, not meant spark executor ...
Like Spark would be one of Executors
in here
https://github.com/apache/incubator-airflow/tree/master/airflow/executors
or in here
https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/executors

Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> Im a bit lost on the spark executor to be honest. To my knowledge the
> spark driver creates spark executors which run spark code. In other words
> in can’t arbitrarily run generic code. Or can it?
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <da...@gmail.com>
> het volgende geschreven:
> >
> > Now I think if Airflow on PySpark Executor would be an easier target.
> > Spark runs on YARN, Mesos and now Kubernetes.
> > So PySpark Executor would give Airflow porting to these schedulers.
> > It's my understanding we now have only Spark Operator and not Executor.
> >
> > Thanks!
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <ac...@gmail.com>
> wrote:
> >>
> >> Hey I didn’t know this Bolke, I was under the impression of the same as
> >> Ruslan.
> >> Thanks for the share
> >>
> >> Sent from my iPhone
> >>
> >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> >>>
> >>> It actually can nowadays: https://cdn.oreillystatic.com/
> >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >>>
> >>> We also have an on premise setup with ceph (s3a) and HDFS for when we
> >> need the speed and kubernetes for our workloads. We are kicking out Yarn
> >> (and hive etc for that matter).
> >>>
> >>> Bolke
> >>>
> >>>
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> dautkhanov@gmail.com>
> >> het volgende geschreven:
> >>>>
> >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> >> YARN
> >>>> can - for example schedule tasks local to data.
> >>>> Hadoop has multiple levels of data locality (node-local, rack-local) -
> >> so
> >>>> computation happens local to data to minimize network
> >>>> data transfer which is expensive.
> >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
> >> know.
> >>>>
> >>>> For cloud deployments where we don't have data locality problem
> >> (because of
> >>>> s3 is being used instead of storage local
> >>>> to servers), k8s might be okay.
> >>>>
> >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> messos
> >> ..
> >>>> although I think it's an offtopic.
> >>>>
> >>>> We're mostly on-prem and we don't see kubernetes take over yarn any
> time
> >>>> soon.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>> [1]
> >>>>
> >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>>>
> >>>> *2.3.2 Monolithic Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Monolithic schedulers use a single, centralized scheduling algorithm
> for
> >>>> all jobs. All workload is run through the same scheduler and same
> >>>> scheduling logic. Swarm,
> >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >>>> improvised on basic monolithic version of Borg and Swarm schedulers.
> >> This
> >>>> type of schedulers are not suitable for running heterogeneous modern
> >>>> workloads which include Spark jobs, containers, and other long running
> >> jobs,
> >>>> etc.
> >>>>
> >>>>
> >>>>
> >>>> *2.3.3 Two Level Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Two-level schedulers address the drawbacks of a monolithic scheduler
> by
> >>>> separating concerns of resource allocation and task placement. An
> active
> >>>> resource manager offers compute resources to multiple parallel,
> >> independent
> >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> >> approach,
> >>>> and YARN supports a limited version of it. In Mesos, resources are
> >> offered
> >>>> to application-level schedulers. This allows for custom,
> >> workload-specific
> >>>> scheduling policies. The drawback with this type of scheduling
> >> architecture
> >>>> is that the application level frameworks cannot see all the possible
> >>>> placement options anymore. Instead, they only see those options that
> >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> >> resource
> >>>> manager component. This makes priority preemption (higher priority
> tasks
> >>>> kick out lower priority ones) difficult.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ruslan Dautkhanov
> >>>>
> >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> Happy to have it as a contrib executor. However, I personally think
> >> yarn
> >>>>> is a dead end. It has a lot of catching up to do and all the momentum
> >> is
> >>>>> with kubernetes.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>> Verstuurd vanaf mijn iPad
> >>>>>
> >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> >> dautkhanov@gmail.com>
> >>>>> het volgende geschreven:
> >>>>>>
> >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> >>>>>> somewhat a competitor for Kubernetes.
> >>>>>>
> >>>>>> Great job on adding k8s support to Airflow.
> >>>>>>
> >>>>>> Very similarly I see Airflow could integrate with YARN and use
> >>>>>> its infrastructure as an "executor" .. have anyone explored
> >> feasibility
> >>>>> of
> >>>>>> this approach?
> >>>>>>
> >>>>>>
> >>>>>> Thanks!
> >>>>>> Ruslan Dautkhanov
> >>>>>
> >>
>

Re: Airflow - YARN as an executor?

Posted by Bolke de Bruin <bd...@gmail.com>.

Im a bit lost on the spark executor to be honest. To my knowledge the spark driver creates spark executors which run spark code. In other words in can’t arbitrarily run generic code. Or can it?

B.

Verstuurd vanaf mijn iPad

> Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <da...@gmail.com> het volgende geschreven:
> 
> Now I think if Airflow on PySpark Executor would be an easier target.
> Spark runs on YARN, Mesos and now Kubernetes.
> So PySpark Executor would give Airflow porting to these schedulers.
> It's my understanding we now have only Spark Operator and not Executor.
> 
> Thanks!
> 
> 
> 
> -- 
> Ruslan Dautkhanov
> 
>> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <ac...@gmail.com> wrote:
>> 
>> Hey I didn’t know this Bolke, I was under the impression of the same as
>> Ruslan.
>> Thanks for the share
>> 
>> Sent from my iPhone
>> 
>>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>>> 
>>> It actually can nowadays: https://cdn.oreillystatic.com/
>> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
>> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
>>> 
>>> We also have an on premise setup with ceph (s3a) and HDFS for when we
>> need the speed and kubernetes for our workloads. We are kicking out Yarn
>> (and hive etc for that matter).
>>> 
>>> Bolke
>>> 
>>> 
>>> 
>>> Verstuurd vanaf mijn iPad
>>> 
>>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <da...@gmail.com>
>> het volgende geschreven:
>>>> 
>>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
>> YARN
>>>> can - for example schedule tasks local to data.
>>>> Hadoop has multiple levels of data locality (node-local, rack-local) -
>> so
>>>> computation happens local to data to minimize network
>>>> data transfer which is expensive.
>>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
>> know.
>>>> 
>>>> For cloud deployments where we don't have data locality problem
>> (because of
>>>> s3 is being used instead of storage local
>>>> to servers), k8s might be okay.
>>>> 
>>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos
>> ..
>>>> although I think it's an offtopic.
>>>> 
>>>> We're mostly on-prem and we don't see kubernetes take over yarn any time
>>>> soon.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> [1]
>>>> 
>>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
>> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
>>>> 
>>>> *2.3.2 Monolithic Schedulers *
>>>> 
>>>> 
>>>> 
>>>> Monolithic schedulers use a single, centralized scheduling algorithm for
>>>> all jobs. All workload is run through the same scheduler and same
>>>> scheduling logic. Swarm,
>>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
>>>> improvised on basic monolithic version of Borg and Swarm schedulers.
>> This
>>>> type of schedulers are not suitable for running heterogeneous modern
>>>> workloads which include Spark jobs, containers, and other long running
>> jobs,
>>>> etc.
>>>> 
>>>> 
>>>> 
>>>> *2.3.3 Two Level Schedulers *
>>>> 
>>>> 
>>>> 
>>>> Two-level schedulers address the drawbacks of a monolithic scheduler by
>>>> separating concerns of resource allocation and task placement. An active
>>>> resource manager offers compute resources to multiple parallel,
>> independent
>>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
>> approach,
>>>> and YARN supports a limited version of it. In Mesos, resources are
>> offered
>>>> to application-level schedulers. This allows for custom,
>> workload-specific
>>>> scheduling policies. The drawback with this type of scheduling
>> architecture
>>>> is that the application level frameworks cannot see all the possible
>>>> placement options anymore. Instead, they only see those options that
>>>> correspond to resources offered (Mesos) or allocated (YARN) by the
>> resource
>>>> manager component. This makes priority preemption (higher priority tasks
>>>> kick out lower priority ones) difficult.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ruslan Dautkhanov
>>>> 
>>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com>
>> wrote:
>>>>> 
>>>>> Happy to have it as a contrib executor. However, I personally think
>> yarn
>>>>> is a dead end. It has a lot of catching up to do and all the momentum
>> is
>>>>> with kubernetes.
>>>>> 
>>>>> B.
>>>>> 
>>>>> Verstuurd vanaf mijn iPad
>>>>> 
>>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
>> dautkhanov@gmail.com>
>>>>> het volgende geschreven:
>>>>>> 
>>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
>>>>>> somewhat a competitor for Kubernetes.
>>>>>> 
>>>>>> Great job on adding k8s support to Airflow.
>>>>>> 
>>>>>> Very similarly I see Airflow could integrate with YARN and use
>>>>>> its infrastructure as an "executor" .. have anyone explored
>> feasibility
>>>>> of
>>>>>> this approach?
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> Ruslan Dautkhanov
>>>>> 
>>

Re: Airflow - YARN as an executor?

Posted by Ruslan Dautkhanov <da...@gmail.com>.

Now I think if Airflow on PySpark Executor would be an easier target.
Spark runs on YARN, Mesos and now Kubernetes.
So PySpark Executor would give Airflow porting to these schedulers.
It's my understanding we now have only Spark Operator and not Executor.

Thanks!



-- 
Ruslan Dautkhanov

On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <ac...@gmail.com> wrote:

> Hey I didn’t know this Bolke, I was under the impression of the same as
> Ruslan.
> Thanks for the share
>
> Sent from my iPhone
>
> > On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> >
> > It actually can nowadays: https://cdn.oreillystatic.com/
> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >
> > We also have an on premise setup with ceph (s3a) and HDFS for when we
> need the speed and kubernetes for our workloads. We are kicking out Yarn
> (and hive etc for that matter).
> >
> > Bolke
> >
> >
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <da...@gmail.com>
> het volgende geschreven:
> >>
> >> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> YARN
> >> can - for example schedule tasks local to data.
> >> Hadoop has multiple levels of data locality (node-local, rack-local) -
> so
> >> computation happens local to data to minimize network
> >> data transfer which is expensive.
> >> K8s wasn't designed to handle this scheduling scenarios, as far as I
> know.
> >>
> >> For cloud deployments where we don't have data locality problem
> (because of
> >> s3 is being used instead of storage local
> >> to servers), k8s might be okay.
> >>
> >> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos
> ..
> >> although I think it's an offtopic.
> >>
> >> We're mostly on-prem and we don't see kubernetes take over yarn any time
> >> soon.
> >>
> >> Thanks.
> >>
> >>
> >>
> >> [1]
> >>
> >> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>
> >> *2.3.2 Monolithic Schedulers *
> >>
> >>
> >>
> >> Monolithic schedulers use a single, centralized scheduling algorithm for
> >> all jobs. All workload is run through the same scheduler and same
> >> scheduling logic. Swarm,
> >> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >> improvised on basic monolithic version of Borg and Swarm schedulers.
> This
> >> type of schedulers are not suitable for running heterogeneous modern
> >> workloads which include Spark jobs, containers, and other long running
> jobs,
> >> etc.
> >>
> >>
> >>
> >> *2.3.3 Two Level Schedulers *
> >>
> >>
> >>
> >> Two-level schedulers address the drawbacks of a monolithic scheduler by
> >> separating concerns of resource allocation and task placement. An active
> >> resource manager offers compute resources to multiple parallel,
> independent
> >> “scheduler frameworks”. The Mesos cluster manager pioneered this
> approach,
> >> and YARN supports a limited version of it. In Mesos, resources are
> offered
> >> to application-level schedulers. This allows for custom,
> workload-specific
> >> scheduling policies. The drawback with this type of scheduling
> architecture
> >> is that the application level frameworks cannot see all the possible
> >> placement options anymore. Instead, they only see those options that
> >> correspond to resources offered (Mesos) or allocated (YARN) by the
> resource
> >> manager component. This makes priority preemption (higher priority tasks
> >> kick out lower priority ones) difficult.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>>
> >>> Happy to have it as a contrib executor. However, I personally think
> yarn
> >>> is a dead end. It has a lot of catching up to do and all the momentum
> is
> >>> with kubernetes.
> >>>
> >>> B.
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> dautkhanov@gmail.com>
> >>> het volgende geschreven:
> >>>>
> >>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> >>>> somewhat a competitor for Kubernetes.
> >>>>
> >>>> Great job on adding k8s support to Airflow.
> >>>>
> >>>> Very similarly I see Airflow could integrate with YARN and use
> >>>> its infrastructure as an "executor" .. have anyone explored
> feasibility
> >>> of
> >>>> this approach?
> >>>>
> >>>>
> >>>> Thanks!
> >>>> Ruslan Dautkhanov
> >>>
>

Re: Airflow - YARN as an executor?

Posted by Ace Haidrey <ac...@gmail.com>.

Hey I didn’t know this Bolke, I was under the impression of the same as Ruslan.
Thanks for the share

Sent from my iPhone

> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> 
> It actually can nowadays: https://cdn.oreillystatic.com/en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> 
> We also have an on premise setup with ceph (s3a) and HDFS for when we need the speed and kubernetes for our workloads. We are kicking out Yarn (and hive etc for that matter).
> 
> Bolke
> 
> 
> 
> Verstuurd vanaf mijn iPad
> 
>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <da...@gmail.com> het volgende geschreven:
>> 
>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what YARN
>> can - for example schedule tasks local to data.
>> Hadoop has multiple levels of data locality (node-local, rack-local) - so
>> computation happens local to data to minimize network
>> data transfer which is expensive.
>> K8s wasn't designed to handle this scheduling scenarios, as far as I know.
>> 
>> For cloud deployments where we don't have data locality problem (because of
>> s3 is being used instead of storage local
>> to servers), k8s might be okay.
>> 
>> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos ..
>> although I think it's an offtopic.
>> 
>> We're mostly on-prem and we don't see kubernetes take over yarn any time
>> soon.
>> 
>> Thanks.
>> 
>> 
>> 
>> [1]
>> 
>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/27061/master_Ravula_Shashi_2017.pdf?sequence=1
>> 
>> *2.3.2 Monolithic Schedulers *
>> 
>> 
>> 
>> Monolithic schedulers use a single, centralized scheduling algorithm for
>> all jobs. All workload is run through the same scheduler and same
>> scheduling logic. Swarm,
>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
>> improvised on basic monolithic version of Borg and Swarm schedulers. This
>> type of schedulers are not suitable for running heterogeneous modern
>> workloads which include Spark jobs, containers, and other long running jobs,
>> etc.
>> 
>> 
>> 
>> *2.3.3 Two Level Schedulers *
>> 
>> 
>> 
>> Two-level schedulers address the drawbacks of a monolithic scheduler by
>> separating concerns of resource allocation and task placement. An active
>> resource manager offers compute resources to multiple parallel, independent
>> “scheduler frameworks”. The Mesos cluster manager pioneered this approach,
>> and YARN supports a limited version of it. In Mesos, resources are offered
>> to application-level schedulers. This allows for custom, workload-specific
>> scheduling policies. The drawback with this type of scheduling architecture
>> is that the application level frameworks cannot see all the possible
>> placement options anymore. Instead, they only see those options that
>> correspond to resources offered (Mesos) or allocated (YARN) by the resource
>> manager component. This makes priority preemption (higher priority tasks
>> kick out lower priority ones) difficult.
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Ruslan Dautkhanov
>> 
>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>>> 
>>> Happy to have it as a contrib executor. However, I personally think yarn
>>> is a dead end. It has a lot of catching up to do and all the momentum is
>>> with kubernetes.
>>> 
>>> B.
>>> 
>>> Verstuurd vanaf mijn iPad
>>> 
>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <da...@gmail.com>
>>> het volgende geschreven:
>>>> 
>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
>>>> somewhat a competitor for Kubernetes.
>>>> 
>>>> Great job on adding k8s support to Airflow.
>>>> 
>>>> Very similarly I see Airflow could integrate with YARN and use
>>>> its infrastructure as an "executor" .. have anyone explored feasibility
>>> of
>>>> this approach?
>>>> 
>>>> 
>>>> Thanks!
>>>> Ruslan Dautkhanov
>>>

Re: Airflow - YARN as an executor?

Posted by Bolke de Bruin <bd...@gmail.com>.

It actually can nowadays: https://cdn.oreillystatic.com/en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx

We also have an on premise setup with ceph (s3a) and HDFS for when we need the speed and kubernetes for our workloads. We are kicking out Yarn (and hive etc for that matter).

Bolke



Verstuurd vanaf mijn iPad

> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <da...@gmail.com> het volgende geschreven:
> 
> Kubernetes is a "monolithic" 1-level scheduler that can't handle what YARN
> can - for example schedule tasks local to data.
> Hadoop has multiple levels of data locality (node-local, rack-local) - so
> computation happens local to data to minimize network
> data transfer which is expensive.
> K8s wasn't designed to handle this scheduling scenarios, as far as I know.
> 
> For cloud deployments where we don't have data locality problem (because of
> s3 is being used instead of storage local
> to servers), k8s might be okay.
> 
> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos ..
> although I think it's an offtopic.
> 
> We're mostly on-prem and we don't see kubernetes take over yarn any time
> soon.
> 
> Thanks.
> 
> 
> 
> [1]
> 
> https://aaltodoc.aalto.fi/bitstream/handle/123456789/27061/master_Ravula_Shashi_2017.pdf?sequence=1
> 
> *2.3.2 Monolithic Schedulers *
> 
> 
> 
> Monolithic schedulers use a single, centralized scheduling algorithm for
> all jobs. All workload is run through the same scheduler and same
> scheduling logic. Swarm,
> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> improvised on basic monolithic version of Borg and Swarm schedulers. This
> type of schedulers are not suitable for running heterogeneous modern
> workloads which include Spark jobs, containers, and other long running jobs,
> etc.
> 
> 
> 
> *2.3.3 Two Level Schedulers *
> 
> 
> 
> Two-level schedulers address the drawbacks of a monolithic scheduler by
> separating concerns of resource allocation and task placement. An active
> resource manager offers compute resources to multiple parallel, independent
> “scheduler frameworks”. The Mesos cluster manager pioneered this approach,
> and YARN supports a limited version of it. In Mesos, resources are offered
> to application-level schedulers. This allows for custom, workload-specific
> scheduling policies. The drawback with this type of scheduling architecture
> is that the application level frameworks cannot see all the possible
> placement options anymore. Instead, they only see those options that
> correspond to resources offered (Mesos) or allocated (YARN) by the resource
> manager component. This makes priority preemption (higher priority tasks
> kick out lower priority ones) difficult.
> 
> 
> 
> 
> 
> -- 
> Ruslan Dautkhanov
> 
>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>> Happy to have it as a contrib executor. However, I personally think yarn
>> is a dead end. It has a lot of catching up to do and all the momentum is
>> with kubernetes.
>> 
>> B.
>> 
>> Verstuurd vanaf mijn iPad
>> 
>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <da...@gmail.com>
>> het volgende geschreven:
>>> 
>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
>>> somewhat a competitor for Kubernetes.
>>> 
>>> Great job on adding k8s support to Airflow.
>>> 
>>> Very similarly I see Airflow could integrate with YARN and use
>>> its infrastructure as an "executor" .. have anyone explored feasibility
>> of
>>> this approach?
>>> 
>>> 
>>> Thanks!
>>> Ruslan Dautkhanov
>>

Re: Airflow - YARN as an executor?

Posted by Ruslan Dautkhanov <da...@gmail.com>.

Kubernetes is a "monolithic" 1-level scheduler that can't handle what YARN
can - for example schedule tasks local to data.
Hadoop has multiple levels of data locality (node-local, rack-local) - so
computation happens local to data to minimize network
data transfer which is expensive.
K8s wasn't designed to handle this scheduling scenarios, as far as I know.

For cloud deployments where we don't have data locality problem (because of
s3 is being used instead of storage local
to servers), k8s might be okay.

Nice comparison [1] of k8s vs two-level schedulers like yarn and messos ..
although I think it's an offtopic.

We're mostly on-prem and we don't see kubernetes take over yarn any time
soon.

Thanks.

[1]

https://aaltodoc.aalto.fi/bitstream/handle/123456789/27061/master_Ravula_Shashi_2017.pdf?sequence=1

*2.3.2 Monolithic Schedulers *

Monolithic schedulers use a single, centralized scheduling algorithm for
all jobs. All workload is run through the same scheduler and same
scheduling logic. Swarm,
Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
improvised on basic monolithic version of Borg and Swarm schedulers. This
type of schedulers are not suitable for running heterogeneous modern
workloads which include Spark jobs, containers, and other long running jobs,
etc.

*2.3.3 Two Level Schedulers *

Two-level schedulers address the drawbacks of a monolithic scheduler by
separating concerns of resource allocation and task placement. An active
resource manager offers compute resources to multiple parallel, independent
“scheduler frameworks”. The Mesos cluster manager pioneered this approach,
and YARN supports a limited version of it. In Mesos, resources are offered
to application-level schedulers. This allows for custom, workload-specific
scheduling policies. The drawback with this type of scheduling architecture
is that the application level frameworks cannot see all the possible
placement options anymore. Instead, they only see those options that
correspond to resources offered (Mesos) or allocated (YARN) by the resource
manager component. This makes priority preemption (higher priority tasks
kick out lower priority ones) difficult.

-- 
Ruslan Dautkhanov

On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bd...@gmail.com> wrote:

> Happy to have it as a contrib executor. However, I personally think yarn
> is a dead end. It has a lot of catching up to do and all the momentum is
> with kubernetes.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <da...@gmail.com>
> het volgende geschreven:
> >
> > With Hadoop 3's Docker on YARN support, I think YARN becomes
> > somewhat a competitor for Kubernetes.
> >
> > Great job on adding k8s support to Airflow.
> >
> > Very similarly I see Airflow could integrate with YARN and use
> > its infrastructure as an "executor" .. have anyone explored feasibility
> of
> > this approach?
> >
> >
> > Thanks!
> > Ruslan Dautkhanov
>

Re: Airflow - YARN as an executor?

Posted by Bolke de Bruin <bd...@gmail.com>.

Happy to have it as a contrib executor. However, I personally think yarn is a dead end. It has a lot of catching up to do and all the momentum is with kubernetes. 

B.

Verstuurd vanaf mijn iPad

> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <da...@gmail.com> het volgende geschreven:
> 
> With Hadoop 3's Docker on YARN support, I think YARN becomes
> somewhat a competitor for Kubernetes.
> 
> Great job on adding k8s support to Airflow.
> 
> Very similarly I see Airflow could integrate with YARN and use
> its infrastructure as an "executor" .. have anyone explored feasibility of
> this approach?
> 
> 
> Thanks!
> Ruslan Dautkhanov