You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vikram Kone <vi...@gmail.com> on 2015/08/07 17:43:48 UTC

Spark job workflow engine recommendations

Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
wanted to check with people here to see what they are using today.

Some of the requirements of the workflow engine that I'm looking for are

1. First class support for submitting Spark jobs on Cassandra. Not some
wrapper Java code to submit tasks.
2. Active open source community support and well tested at production scale.
3. Should be dead easy to write job dependencices using XML or web
interface . Ex; job A depends on Job B and Job C, so run Job A after B and
C are finished. Don't need to write full blown java applications to specify
job parameters and dependencies. Should be very simple to use.
4. Time based  recurrent scheduling. Run the spark jobs at a given time
every hour or day or week or month.
5. Job monitoring, alerting on failures and email notifications on daily
basis.

I have looked at Ooyala's spark job server which seems to be hated towards
making spark jobs run faster by sharing contexts between the jobs but isn't
a full blown workflow engine per se. A combination of spark job server and
workflow engine would be ideal

Thanks for the inputs

Re: Spark job workflow engine recommendations

Posted by Nick Pentreath <ni...@gmail.com>.
I also tend to agree that Azkaban is somehqat easier to get set up. Though I haven't used the new UI for Oozie that is part of CDH, so perhaps that is another good option.




It's a pity Azkaban is a little rough in terms of documenting its API, and the scalability is an issue. However it would be possible to have a few different instances running for different use cases  / groups within the org perhaps



—
Sent from Mailbox

On Wed, Aug 12, 2015 at 12:14 AM, Vikram Kone <vi...@gmail.com>
wrote:

> Hi LarsThanks for the brain dump. All the points you made about target audience, degree of high availability and time based scheduling instead of event based scheduling are all valid and make sense.In our case, most of your Devs are .net based and so xml or web based scheduling is preferred over something written in Java/Scalia/Python. Based on my research so far on the available workflow managers today, azkaban is the most easier to adopt since it doesn't have any hard dependence on Hadoop and is easy to onboard and schedule jobs. I was able to install and execute some spark workflows in a day. Though the fact that it's being phased out in linkedin is troubling , I think it's the best suited for our use case today. 
> Sent from Outlook
> On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" <la...@gmail.com> wrote:
> I used to maintain Luigi at Spotify, and got some insight in workflow
> manager characteristics and production behaviour in the process.
> I am evaluating options for my current employer, and the short list is
> basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
> latter is not necessarily more work than adapting an existing tool,
> since existing managers are typically more or less tied to the
> technology used by the company that created them.
> Are your users primarily developers building pipelines that drive
> data-intensive products, or are they analysts, producing business
> intelligence? These groups tend to have preferences for different
> types of tools and interfaces.
> I have a love/hate relationship with Luigi, but given your
> requirements, it is probably the best fit:
> * It has support for Spark, and it seems to be used and maintained.
> * It has no builtin support for Cassandra, but Cassandra is heavily
> used at Spotify. IIRC, the code required to support Cassandra targets
> is more or less trivial. There is no obvious single definition of a
> dataset in C*, so you'll have to come up with a convention and encode
> it as a Target subclass. I guess that is why it never made it outside
> Spotify.
> * The open source community is active and it is well tested in
> production at multiple sites.
> * It is easy to write dependencies, but in a Python DSL. If your users
> are developers, this is preferable over XML or a web interface. There
> are always quirks and odd constraints somewhere that require the
> expressive power of a programming language. It also allows you to
> create extensions without changing Luigi itself.
> * It does not have recurring scheduling bulitin. Luigi needs a motor
> to get going, typically cron, installed on a few machines for
> redundancy. In a typical pipeline scenario, you give output datasets a
> time parameter, which arranges for a dataset to be produced each
> hour/day/week/month.
> * It supports failure notifications.
> Pinball and Airflow have similar architecture to Luigi, with a single
> central scheduler and workers that submit and execute jobs. They seem
> to be more solidly engineered at a glance, but less battle tested
> outside Pinterest/Airbnb, and they have fewer integrations to the data
> ecosystem.
> Azkaban has a different architecture and user interface, and seems
> more geared towards data scientists than developers; it has a good UI
> for controlling jobs, but writing extensions and controlling it
> programmatically seems more difficult than for Luigi.
> All of the tools above are centralised, and the central component can
> become a bottleneck and a single point of problem. I am not aware of
> any decentralised open source workflow managers, but you can run
> multiple instances and shard manually.
> Regarding recurring jobs, it is typically undesirable to blindly run
> jobs at a certain time. If you run jobs, e.g. with cron, and process
> whatever data is available in your input sources, your jobs become
> indeterministic and unreliable. If incoming data is late or missing,
> your jobs will fail or create artificial skews in output data, leading
> to confusing results. Moreover, if jobs fail or have bugs, it will be
> difficult to rerun them and get predictable results. This is why I
> don't think Chronos is a meaningful alternative for scheduling data
> processing.
> There are different strategies on this topic, but IMHO, it is easiest
> create predictable and reliable pipelines by bucketing incoming data
> into datasets that you seal off, and mark ready for processing, and
> then use the workflow manager's DAG logic to process data when input
> datasets are available, rather than at a certain time. If you use
> Kafka for data collection, Secor can handle this logic for you.
> In addition to your requirements, there are IMHO a few more topics one
> needs to consider:
> * How are pipelines tested? I.e. if I change job B below, how can I be
> sure that the new output does not break A? You need to involve the
> workflow DAG in testing such scenarios.
> * How do you debug jobs and DAG problems? In case of trouble, can you
> figure out where the job logs are, or why a particular job does not
> start?
> * Do you need high availability for job scheduling? That will require
> additional components.
> This became a bit of a brain dump on the topic. I hope that it is
> useful. Don't hesitate to get back if I can help.
> Regards,
> Lars Albertsson
> On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone  wrote:
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to schedule
>> spark jobs on a datastax cassandra cluster. Since there are tonnes of
>> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
>> check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production scale.
>> 3. Should be dead easy to write job dependencices using XML or web interface
>> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
>> finished. Don't need to write full blown java applications to specify job
>> parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated towards
>> making spark jobs run faster by sharing contexts between the jobs but isn't
>> a full blown workflow engine per se. A combination of spark job server and
>> workflow engine would be ideal
>>
>> Thanks for the inputs

Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Hi LarsThanks for the brain dump. All the points you made about target audience, degree of high availability and time based scheduling instead of event based scheduling are all valid and make sense.In our case, most of your Devs are .net based and so xml or web based scheduling is preferred over something written in Java/Scalia/Python. Based on my research so far on the available workflow managers today, azkaban is the most easier to adopt since it doesn't have any hard dependence on Hadoop and is easy to onboard and schedule jobs. I was able to install and execute some spark workflows in a day. Though the fact that it's being phased out in linkedin is troubling , I think it's the best suited for our use case today. 

Sent from Outlook




On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" <la...@gmail.com> wrote:










I used to maintain Luigi at Spotify, and got some insight in workflow
manager characteristics and production behaviour in the process.

I am evaluating options for my current employer, and the short list is
basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
latter is not necessarily more work than adapting an existing tool,
since existing managers are typically more or less tied to the
technology used by the company that created them.

Are your users primarily developers building pipelines that drive
data-intensive products, or are they analysts, producing business
intelligence? These groups tend to have preferences for different
types of tools and interfaces.

I have a love/hate relationship with Luigi, but given your
requirements, it is probably the best fit:

* It has support for Spark, and it seems to be used and maintained.

* It has no builtin support for Cassandra, but Cassandra is heavily
used at Spotify. IIRC, the code required to support Cassandra targets
is more or less trivial. There is no obvious single definition of a
dataset in C*, so you'll have to come up with a convention and encode
it as a Target subclass. I guess that is why it never made it outside
Spotify.

* The open source community is active and it is well tested in
production at multiple sites.

* It is easy to write dependencies, but in a Python DSL. If your users
are developers, this is preferable over XML or a web interface. There
are always quirks and odd constraints somewhere that require the
expressive power of a programming language. It also allows you to
create extensions without changing Luigi itself.

* It does not have recurring scheduling bulitin. Luigi needs a motor
to get going, typically cron, installed on a few machines for
redundancy. In a typical pipeline scenario, you give output datasets a
time parameter, which arranges for a dataset to be produced each
hour/day/week/month.

* It supports failure notifications.


Pinball and Airflow have similar architecture to Luigi, with a single
central scheduler and workers that submit and execute jobs. They seem
to be more solidly engineered at a glance, but less battle tested
outside Pinterest/Airbnb, and they have fewer integrations to the data
ecosystem.

Azkaban has a different architecture and user interface, and seems
more geared towards data scientists than developers; it has a good UI
for controlling jobs, but writing extensions and controlling it
programmatically seems more difficult than for Luigi.

All of the tools above are centralised, and the central component can
become a bottleneck and a single point of problem. I am not aware of
any decentralised open source workflow managers, but you can run
multiple instances and shard manually.

Regarding recurring jobs, it is typically undesirable to blindly run
jobs at a certain time. If you run jobs, e.g. with cron, and process
whatever data is available in your input sources, your jobs become
indeterministic and unreliable. If incoming data is late or missing,
your jobs will fail or create artificial skews in output data, leading
to confusing results. Moreover, if jobs fail or have bugs, it will be
difficult to rerun them and get predictable results. This is why I
don't think Chronos is a meaningful alternative for scheduling data
processing.

There are different strategies on this topic, but IMHO, it is easiest
create predictable and reliable pipelines by bucketing incoming data
into datasets that you seal off, and mark ready for processing, and
then use the workflow manager's DAG logic to process data when input
datasets are available, rather than at a certain time. If you use
Kafka for data collection, Secor can handle this logic for you.


In addition to your requirements, there are IMHO a few more topics one
needs to consider:
* How are pipelines tested? I.e. if I change job B below, how can I be
sure that the new output does not break A? You need to involve the
workflow DAG in testing such scenarios.
* How do you debug jobs and DAG problems? In case of trouble, can you
figure out where the job logs are, or why a particular job does not
start?
* Do you need high availability for job scheduling? That will require
additional components.


This became a bit of a brain dump on the topic. I hope that it is
useful. Don't hesitate to get back if I can help.

Regards,

Lars Albertsson



On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone  wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule
> spark jobs on a datastax cassandra cluster. Since there are tonnes of
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
> check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
> finished. Don't need to write full blown java applications to specify job
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs

Re: Spark job workflow engine recommendations

Posted by Lars Albertsson <la...@gmail.com>.
I used to maintain Luigi at Spotify, and got some insight in workflow
manager characteristics and production behaviour in the process.

I am evaluating options for my current employer, and the short list is
basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
latter is not necessarily more work than adapting an existing tool,
since existing managers are typically more or less tied to the
technology used by the company that created them.

Are your users primarily developers building pipelines that drive
data-intensive products, or are they analysts, producing business
intelligence? These groups tend to have preferences for different
types of tools and interfaces.

I have a love/hate relationship with Luigi, but given your
requirements, it is probably the best fit:

* It has support for Spark, and it seems to be used and maintained.

* It has no builtin support for Cassandra, but Cassandra is heavily
used at Spotify. IIRC, the code required to support Cassandra targets
is more or less trivial. There is no obvious single definition of a
dataset in C*, so you'll have to come up with a convention and encode
it as a Target subclass. I guess that is why it never made it outside
Spotify.

* The open source community is active and it is well tested in
production at multiple sites.

* It is easy to write dependencies, but in a Python DSL. If your users
are developers, this is preferable over XML or a web interface. There
are always quirks and odd constraints somewhere that require the
expressive power of a programming language. It also allows you to
create extensions without changing Luigi itself.

* It does not have recurring scheduling bulitin. Luigi needs a motor
to get going, typically cron, installed on a few machines for
redundancy. In a typical pipeline scenario, you give output datasets a
time parameter, which arranges for a dataset to be produced each
hour/day/week/month.

* It supports failure notifications.


Pinball and Airflow have similar architecture to Luigi, with a single
central scheduler and workers that submit and execute jobs. They seem
to be more solidly engineered at a glance, but less battle tested
outside Pinterest/Airbnb, and they have fewer integrations to the data
ecosystem.

Azkaban has a different architecture and user interface, and seems
more geared towards data scientists than developers; it has a good UI
for controlling jobs, but writing extensions and controlling it
programmatically seems more difficult than for Luigi.

All of the tools above are centralised, and the central component can
become a bottleneck and a single point of problem. I am not aware of
any decentralised open source workflow managers, but you can run
multiple instances and shard manually.

Regarding recurring jobs, it is typically undesirable to blindly run
jobs at a certain time. If you run jobs, e.g. with cron, and process
whatever data is available in your input sources, your jobs become
indeterministic and unreliable. If incoming data is late or missing,
your jobs will fail or create artificial skews in output data, leading
to confusing results. Moreover, if jobs fail or have bugs, it will be
difficult to rerun them and get predictable results. This is why I
don't think Chronos is a meaningful alternative for scheduling data
processing.

There are different strategies on this topic, but IMHO, it is easiest
create predictable and reliable pipelines by bucketing incoming data
into datasets that you seal off, and mark ready for processing, and
then use the workflow manager's DAG logic to process data when input
datasets are available, rather than at a certain time. If you use
Kafka for data collection, Secor can handle this logic for you.


In addition to your requirements, there are IMHO a few more topics one
needs to consider:
* How are pipelines tested? I.e. if I change job B below, how can I be
sure that the new output does not break A? You need to involve the
workflow DAG in testing such scenarios.
* How do you debug jobs and DAG problems? In case of trouble, can you
figure out where the job logs are, or why a particular job does not
start?
* Do you need high availability for job scheduling? That will require
additional components.


This became a bit of a brain dump on the topic. I hope that it is
useful. Don't hesitate to get back if I can help.

Regards,

Lars Albertsson



On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone <vi...@gmail.com> wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule
> spark jobs on a datastax cassandra cluster. Since there are tonnes of
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
> check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
> finished. Don't need to write full blown java applications to specify job
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin.
>From what I read online Oozie was very cumbersome to setup and use compared
to azkaban. Since you are from linkedin wanted to get some perspective on
what it lacks compared to Oozie. Ease of use is very important more than
full feature set

On Friday, August 7, 2015, Hien Luu <hl...@linkedin.com> wrote:

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com
> <javascript:_e(%7B%7D,'cvml','vikramkone@gmail.com');>> wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>> C are finished. Don't need to write full blown java applications to specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>

Re: Spark job workflow engine recommendations

Posted by Fengdong Yu <fe...@everstring.com>.
Yes, you can submit job remotely.



> On Nov 19, 2015, at 10:10 AM, Vikram Kone <vi...@gmail.com> wrote:
> 
> Hi Feng,
> Does airflow allow remote submissions of spark jobs via spark-submit?
> 
> On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <fengdongy@everstring.com <ma...@everstring.com>> wrote:
> Hi,
> 
> we use ‘Airflow'  as our job workflow scheduler.
> 
> 
> 
> 
>> On Nov 19, 2015, at 9:47 AM, Vikram Kone <vikramkone@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Nick,
>> Quick question about spark-submit command executed from azkaban with command job type.
>> I see that when I press kill in azkaban portal on a spark-submit job, it doesn't actually kill the application on spark master and it continues to run even though azkaban thinks that it's killed.
>> How do you get around this? Is there a way to kill the spark-submit jobs from azkaban portal?
>> 
>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentreath@gmail.com <ma...@gmail.com>> wrote:
>> Hi Vikram,
>> 
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use local mode deployment and it is fairly easy to set up. It is pretty easy to use and has a nice scheduling and logging interface, as well as SLAs (like kill job and notify if it doesn't complete in 3 hours or whatever). 
>> 
>> However Spark support is not present directly - we run everything with shell scripts and spark-submit. There is a plugin interface where one could create a Spark plugin, but I found it very cumbersome when I did investigate and didn't have the time to work through it to develop that.
>> 
>> It has some quirks and while there is actually a REST API for adding jos and dynamically scheduling jobs, it is not documented anywhere so you kinda have to figure it out for yourself. But in terms of ease of use I found it way better than Oozie. I haven't tried Chronos, and it seemed quite involved to set up. Haven't tried Luigi either.
>> 
>> Spark job server is good but as you say lacks some stuff like scheduling and DAG type workflows (independent of spark-defined job flows).
>> 
>> 
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>> Check also falcon in combination with oozie
>> 
>> Le ven. 7 août 2015 à 17:51, Hien Luu <hluu@linkedin.com.invalid <ma...@linkedin.com.invalid>> a écrit :
>> Looks like Oozie can satisfy most of your requirements. 
>> 
>> 
>> 
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com <ma...@gmail.com>> wrote:
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today.
>> 
>> Some of the requirements of the workflow engine that I'm looking for are
>> 
>> 1. First class support for submitting Spark jobs on Cassandra. Not some wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production scale.
>> 3. Should be dead easy to write job dependencices using XML or web interface . Ex; job A depends on Job B and Job C, so run Job A after B and C are finished. Don't need to write full blown java applications to specify job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily basis.
>> 
>> I have looked at Ooyala's spark job server which seems to be hated towards making spark jobs run faster by sharing contexts between the jobs but isn't a full blown workflow engine per se. A combination of spark job server and workflow engine would be ideal 
>> 
>> Thanks for the inputs
>> 
>> 
>> 
> 
> 


Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Hi Feng,
Does airflow allow remote submissions of spark jobs via spark-submit?

On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <fe...@everstring.com>
wrote:

> Hi,
>
> we use ‘Airflow'  as our job workflow scheduler.
>
>
>
>
> On Nov 19, 2015, at 9:47 AM, Vikram Kone <vi...@gmail.com> wrote:
>
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with
> command job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it
> doesn't actually kill the application on spark master and it continues to
> run even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs
> from azkaban portal?
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>
>

Re: Spark job workflow engine recommendations

Posted by Fengdong Yu <fe...@everstring.com>.
Hi,

we use ‘Airflow'  as our job workflow scheduler.




> On Nov 19, 2015, at 9:47 AM, Vikram Kone <vi...@gmail.com> wrote:
> 
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with command job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it doesn't actually kill the application on spark master and it continues to run even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs from azkaban portal?
> 
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentreath@gmail.com <ma...@gmail.com>> wrote:
> Hi Vikram,
> 
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use local mode deployment and it is fairly easy to set up. It is pretty easy to use and has a nice scheduling and logging interface, as well as SLAs (like kill job and notify if it doesn't complete in 3 hours or whatever). 
> 
> However Spark support is not present directly - we run everything with shell scripts and spark-submit. There is a plugin interface where one could create a Spark plugin, but I found it very cumbersome when I did investigate and didn't have the time to work through it to develop that.
> 
> It has some quirks and while there is actually a REST API for adding jos and dynamically scheduling jobs, it is not documented anywhere so you kinda have to figure it out for yourself. But in terms of ease of use I found it way better than Oozie. I haven't tried Chronos, and it seemed quite involved to set up. Haven't tried Luigi either.
> 
> Spark job server is good but as you say lacks some stuff like scheduling and DAG type workflows (independent of spark-defined job flows).
> 
> 
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> Check also falcon in combination with oozie
> 
> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a écrit :
> Looks like Oozie can satisfy most of your requirements. 
> 
> 
> 
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today.
> 
> Some of the requirements of the workflow engine that I'm looking for are
> 
> 1. First class support for submitting Spark jobs on Cassandra. Not some wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface . Ex; job A depends on Job B and Job C, so run Job A after B and C are finished. Don't need to write full blown java applications to specify job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily basis.
> 
> I have looked at Ooyala's spark job server which seems to be hated towards making spark jobs run faster by sharing contexts between the jobs but isn't a full blown workflow engine per se. A combination of spark job server and workflow engine would be ideal 
> 
> Thanks for the inputs
> 
> 
> 


Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Hi Nick,
Quick question about spark-submit command executed from azkaban with
command job type.
I see that when I press kill in azkaban portal on a spark-submit job, it
doesn't actually kill the application on spark master and it continues to
run even though azkaban thinks that it's killed.
How do you get around this? Is there a way to kill the spark-submit jobs
from azkaban portal?

On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> Hi Vikram,
>
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
> local mode deployment and it is fairly easy to set up. It is pretty easy to
> use and has a nice scheduling and logging interface, as well as SLAs (like
> kill job and notify if it doesn't complete in 3 hours or whatever).
>
> However Spark support is not present directly - we run everything with
> shell scripts and spark-submit. There is a plugin interface where one could
> create a Spark plugin, but I found it very cumbersome when I did
> investigate and didn't have the time to work through it to develop that.
>
> It has some quirks and while there is actually a REST API for adding jos
> and dynamically scheduling jobs, it is not documented anywhere so you kinda
> have to figure it out for yourself. But in terms of ease of use I found it
> way better than Oozie. I haven't tried Chronos, and it seemed quite
> involved to set up. Haven't tried Luigi either.
>
> Spark job server is good but as you say lacks some stuff like scheduling
> and DAG type workflows (independent of spark-defined job flows).
>
>
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Check also falcon in combination with oozie
>>
>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>> écrit :
>>
>>> Looks like Oozie can satisfy most of your requirements.
>>>
>>>
>>>
>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I'm looking for open source workflow tools/engines that allow us to
>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>> wanted to check with people here to see what they are using today.
>>>>
>>>> Some of the requirements of the workflow engine that I'm looking for are
>>>>
>>>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>>>> wrapper Java code to submit tasks.
>>>> 2. Active open source community support and well tested at production
>>>> scale.
>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>> C are finished. Don't need to write full blown java applications to specify
>>>> job parameters and dependencies. Should be very simple to use.
>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>>>> every hour or day or week or month.
>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>> daily basis.
>>>>
>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>> server and workflow engine would be ideal
>>>>
>>>> Thanks for the inputs
>>>>
>>>
>>>
>

Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Oh ok. That's a good enough reason against azkaban then. So looks like
Oozie is the best choice here.

On Friday, August 7, 2015, Ted Yu <yu...@gmail.com> wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentreath@gmail.com
> <javascript:_e(%7B%7D,'cvml','nick.pentreath@gmail.com');>> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com
>> <javascript:_e(%7B%7D,'cvml','jornfranke@gmail.com');>> wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','vikramkone@gmail.com');>> wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Nick Pentreath <ni...@gmail.com>.
We're also using Azkaban for scheduling, and we simply use spark-submit via she'll scripts. It works fine.




The auto retry feature with a large number of retries (like 100 or 1000 perhaps) should take care of long-running jobs with restarts on failure. We haven't used it for streaming yet though we have long-running jobs and Azkaban won't kill them unless an SLA is in place.









—
Sent from Mailbox

On Wed, Oct 7, 2015 at 7:18 PM, Vikram Kone <vi...@gmail.com> wrote:

> Hien,
> I saw this pull request and from what I understand this is geared towards
> running spark jobs over hadoop. We are using spark over cassandra and not
> sure if this new jobtype supports that. I haven't seen any documentation in
> regards to how to use this spark job plugin, so that I can test it out on
> our cluster.
> We are currently submitting our spark jobs using command job type using the
> following command  "dse spark-submit --class com.org.classname ./test.jar"
> etc. What would be the advantage of using the native spark job type over
> command job type?
> I didn't understand from your reply if azkaban already supports long
> running jobs like spark streaming..does it? streaming jobs generally need
> to be running indefinitely or forever and needs to be restarted if for some
> reason they fail (lack of resources may be..). I can probably use the auto
> retry feature for this, but not sure
> I'm looking forward to the multiple executor support which will greatly
> enhance the scalability issue.
> On Wed, Oct 7, 2015 at 9:56 AM, Hien Luu <hl...@linkedin.com> wrote:
>> The spark job type was added recently - see this pull request
>> https://github.com/azkaban/azkaban-plugins/pull/195.  You can leverage
>> the SLA feature to kill a job if it ran longer than expected.
>>
>> BTW, we just solved the scalability issue by supporting multiple
>> executors.  Within a week or two, the code for that should be merged in the
>> main trunk.
>>
>> Hien
>>
>> On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone <vi...@gmail.com> wrote:
>>
>>> Does Azkaban support scheduling long running jobs like spark steaming
>>> jobs? Will Azkaban kill a job if it's running for a long time.
>>>
>>>
>>> On Friday, August 7, 2015, Vikram Kone <vi...@gmail.com> wrote:
>>>
>>>> Hien,
>>>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>>>> linkedin going to use for workflow scheduling? Is there something else
>>>> that's going to replace Azkaban?
>>>>
>>>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> In my opinion, choosing some particular project among its peers should
>>>>> leave enough room for future growth (which may come faster than you
>>>>> initially think).
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>>>>>
>>>>>> Scalability is a known issue due the the current architecture.
>>>>>> However this will be applicable if you run more 20K jobs per day.
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban
>>>>>>> is being phased out at LinkedIn because of scalability issues (though
>>>>>>> UI-wise, Azkaban seems better).
>>>>>>>
>>>>>>> Vikram:
>>>>>>> I suggest you do more research in related projects (maybe using their
>>>>>>> mailing lists).
>>>>>>>
>>>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vikram,
>>>>>>>>
>>>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We
>>>>>>>> just use local mode deployment and it is fairly easy to set up. It is
>>>>>>>> pretty easy to use and has a nice scheduling and logging interface, as well
>>>>>>>> as SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>>>> whatever).
>>>>>>>>
>>>>>>>> However Spark support is not present directly - we run everything
>>>>>>>> with shell scripts and spark-submit. There is a plugin interface where one
>>>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>>>
>>>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>>>
>>>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Check also falcon in combination with oozie
>>>>>>>>>
>>>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid>
>>>>>>>>> a écrit :
>>>>>>>>>
>>>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> I'm looking for open source workflow tools/engines that allow us
>>>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there are
>>>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc,
>>>>>>>>>>> I wanted to check with people here to see what they are using today.
>>>>>>>>>>>
>>>>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>>>>> for are
>>>>>>>>>>>
>>>>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra.
>>>>>>>>>>> Not some wrapper Java code to submit tasks.
>>>>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>>>>> production scale.
>>>>>>>>>>> 3. Should be dead easy to write job dependencices using XML or
>>>>>>>>>>> web interface . Ex; job A depends on Job B and Job C, so run Job A after B
>>>>>>>>>>> and C are finished. Don't need to write full blown java applications to
>>>>>>>>>>> specify job parameters and dependencies. Should be very simple to use.
>>>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a
>>>>>>>>>>> given time every hour or day or week or month.
>>>>>>>>>>> 5. Job monitoring, alerting on failures and email notifications
>>>>>>>>>>> on daily basis.
>>>>>>>>>>>
>>>>>>>>>>> I have looked at Ooyala's spark job server which seems to be
>>>>>>>>>>> hated towards making spark jobs run faster by sharing contexts between the
>>>>>>>>>>> jobs but isn't a full blown workflow engine per se. A combination of spark
>>>>>>>>>>> job server and workflow engine would be ideal
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the inputs
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>

Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Hien,
I saw this pull request and from what I understand this is geared towards
running spark jobs over hadoop. We are using spark over cassandra and not
sure if this new jobtype supports that. I haven't seen any documentation in
regards to how to use this spark job plugin, so that I can test it out on
our cluster.
We are currently submitting our spark jobs using command job type using the
following command  "dse spark-submit --class com.org.classname ./test.jar"
etc. What would be the advantage of using the native spark job type over
command job type?

I didn't understand from your reply if azkaban already supports long
running jobs like spark streaming..does it? streaming jobs generally need
to be running indefinitely or forever and needs to be restarted if for some
reason they fail (lack of resources may be..). I can probably use the auto
retry feature for this, but not sure

I'm looking forward to the multiple executor support which will greatly
enhance the scalability issue.

On Wed, Oct 7, 2015 at 9:56 AM, Hien Luu <hl...@linkedin.com> wrote:

> The spark job type was added recently - see this pull request
> https://github.com/azkaban/azkaban-plugins/pull/195.  You can leverage
> the SLA feature to kill a job if it ran longer than expected.
>
> BTW, we just solved the scalability issue by supporting multiple
> executors.  Within a week or two, the code for that should be merged in the
> main trunk.
>
> Hien
>
> On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone <vi...@gmail.com> wrote:
>
>> Does Azkaban support scheduling long running jobs like spark steaming
>> jobs? Will Azkaban kill a job if it's running for a long time.
>>
>>
>> On Friday, August 7, 2015, Vikram Kone <vi...@gmail.com> wrote:
>>
>>> Hien,
>>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>>> linkedin going to use for workflow scheduling? Is there something else
>>> that's going to replace Azkaban?
>>>
>>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> In my opinion, choosing some particular project among its peers should
>>>> leave enough room for future growth (which may come faster than you
>>>> initially think).
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>>>>
>>>>> Scalability is a known issue due the the current architecture.
>>>>> However this will be applicable if you run more 20K jobs per day.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban
>>>>>> is being phased out at LinkedIn because of scalability issues (though
>>>>>> UI-wise, Azkaban seems better).
>>>>>>
>>>>>> Vikram:
>>>>>> I suggest you do more research in related projects (maybe using their
>>>>>> mailing lists).
>>>>>>
>>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vikram,
>>>>>>>
>>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We
>>>>>>> just use local mode deployment and it is fairly easy to set up. It is
>>>>>>> pretty easy to use and has a nice scheduling and logging interface, as well
>>>>>>> as SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>>> whatever).
>>>>>>>
>>>>>>> However Spark support is not present directly - we run everything
>>>>>>> with shell scripts and spark-submit. There is a plugin interface where one
>>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>>
>>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>>
>>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Check also falcon in combination with oozie
>>>>>>>>
>>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid>
>>>>>>>> a écrit :
>>>>>>>>
>>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I'm looking for open source workflow tools/engines that allow us
>>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there are
>>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc,
>>>>>>>>>> I wanted to check with people here to see what they are using today.
>>>>>>>>>>
>>>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>>>> for are
>>>>>>>>>>
>>>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra.
>>>>>>>>>> Not some wrapper Java code to submit tasks.
>>>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>>>> production scale.
>>>>>>>>>> 3. Should be dead easy to write job dependencices using XML or
>>>>>>>>>> web interface . Ex; job A depends on Job B and Job C, so run Job A after B
>>>>>>>>>> and C are finished. Don't need to write full blown java applications to
>>>>>>>>>> specify job parameters and dependencies. Should be very simple to use.
>>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a
>>>>>>>>>> given time every hour or day or week or month.
>>>>>>>>>> 5. Job monitoring, alerting on failures and email notifications
>>>>>>>>>> on daily basis.
>>>>>>>>>>
>>>>>>>>>> I have looked at Ooyala's spark job server which seems to be
>>>>>>>>>> hated towards making spark jobs run faster by sharing contexts between the
>>>>>>>>>> jobs but isn't a full blown workflow engine per se. A combination of spark
>>>>>>>>>> job server and workflow engine would be ideal
>>>>>>>>>>
>>>>>>>>>> Thanks for the inputs
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Spark job workflow engine recommendations

Posted by Hien Luu <hl...@linkedin.com.INVALID>.
The spark job type was added recently - see this pull request
https://github.com/azkaban/azkaban-plugins/pull/195.  You can leverage the
SLA feature to kill a job if it ran longer than expected.

BTW, we just solved the scalability issue by supporting multiple
executors.  Within a week or two, the code for that should be merged in the
main trunk.

Hien

On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone <vi...@gmail.com> wrote:

> Does Azkaban support scheduling long running jobs like spark steaming
> jobs? Will Azkaban kill a job if it's running for a long time.
>
>
> On Friday, August 7, 2015, Vikram Kone <vi...@gmail.com> wrote:
>
>> Hien,
>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>> linkedin going to use for workflow scheduling? Is there something else
>> that's going to replace Azkaban?
>>
>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> In my opinion, choosing some particular project among its peers should
>>> leave enough room for future growth (which may come faster than you
>>> initially think).
>>>
>>> Cheers
>>>
>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>>>
>>>> Scalability is a known issue due the the current architecture.  However
>>>> this will be applicable if you run more 20K jobs per day.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>>> Azkaban seems better).
>>>>>
>>>>> Vikram:
>>>>> I suggest you do more research in related projects (maybe using their
>>>>> mailing lists).
>>>>>
>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>> nick.pentreath@gmail.com> wrote:
>>>>>
>>>>>> Hi Vikram,
>>>>>>
>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>> whatever).
>>>>>>
>>>>>> However Spark support is not present directly - we run everything
>>>>>> with shell scripts and spark-submit. There is a plugin interface where one
>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>
>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>
>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Check also falcon in combination with oozie
>>>>>>>
>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm looking for open source workflow tools/engines that allow us
>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there are
>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc,
>>>>>>>>> I wanted to check with people here to see what they are using today.
>>>>>>>>>
>>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>>> for are
>>>>>>>>>
>>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>>> production scale.
>>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>>> time every hour or day or week or month.
>>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>>> daily basis.
>>>>>>>>>
>>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>>>>> server and workflow engine would be ideal
>>>>>>>>>
>>>>>>>>> Thanks for the inputs
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark job workflow engine recommendations

Posted by Ruslan Dautkhanov <da...@gmail.com>.
We use Talend, but not for Spark workflows.
Although it does have Spark componenets.

https://www.talend.com/download/talend-open-studio
It is free (commercial support available), easy to design and deploy
workflows.
Talend for BigData 6.0 was released as month ago.

Is anybody using Talend for Spark?



-- 
Ruslan Dautkhanov

On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu <hl...@linkedin.com.invalid>
wrote:

> We are in the middle of figuring that out.  At the high level, we want to
> combine the best parts of existing workflow solutions.
>
> On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone <vi...@gmail.com> wrote:
>
>> Hien,
>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>> linkedin going to use for workflow scheduling? Is there something else
>> that's going to replace Azkaban?
>>
>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> In my opinion, choosing some particular project among its peers should
>>> leave enough room for future growth (which may come faster than you
>>> initially think).
>>>
>>> Cheers
>>>
>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>>>
>>>> Scalability is a known issue due the the current architecture.  However
>>>> this will be applicable if you run more 20K jobs per day.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>>> Azkaban seems better).
>>>>>
>>>>> Vikram:
>>>>> I suggest you do more research in related projects (maybe using their
>>>>> mailing lists).
>>>>>
>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>> nick.pentreath@gmail.com> wrote:
>>>>>
>>>>>> Hi Vikram,
>>>>>>
>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>> whatever).
>>>>>>
>>>>>> However Spark support is not present directly - we run everything
>>>>>> with shell scripts and spark-submit. There is a plugin interface where one
>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>
>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>
>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Check also falcon in combination with oozie
>>>>>>>
>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm looking for open source workflow tools/engines that allow us
>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there are
>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc,
>>>>>>>>> I wanted to check with people here to see what they are using today.
>>>>>>>>>
>>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>>> for are
>>>>>>>>>
>>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>>> production scale.
>>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>>> time every hour or day or week or month.
>>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>>> daily basis.
>>>>>>>>>
>>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>>>>> server and workflow engine would be ideal
>>>>>>>>>
>>>>>>>>> Thanks for the inputs
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Hien Luu <hl...@linkedin.com.INVALID>.
We are in the middle of figuring that out.  At the high level, we want to
combine the best parts of existing workflow solutions.

On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone <vi...@gmail.com> wrote:

> Hien,
> Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
> going to use for workflow scheduling? Is there something else that's going
> to replace Azkaban?
>
> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> In my opinion, choosing some particular project among its peers should
>> leave enough room for future growth (which may come faster than you
>> initially think).
>>
>> Cheers
>>
>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>>
>>> Scalability is a known issue due the the current architecture.  However
>>> this will be applicable if you run more 20K jobs per day.
>>>
>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>> Azkaban seems better).
>>>>
>>>> Vikram:
>>>> I suggest you do more research in related projects (maybe using their
>>>> mailing lists).
>>>>
>>>> Disclaimer: I don't work for LinkedIn.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>> nick.pentreath@gmail.com> wrote:
>>>>
>>>>> Hi Vikram,
>>>>>
>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>> whatever).
>>>>>
>>>>> However Spark support is not present directly - we run everything with
>>>>> shell scripts and spark-submit. There is a plugin interface where one could
>>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>
>>>>> It has some quirks and while there is actually a REST API for adding
>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>
>>>>> Spark job server is good but as you say lacks some stuff like
>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>
>>>>>
>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Check also falcon in combination with oozie
>>>>>>
>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>>>> écrit :
>>>>>>
>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>>>> wanted to check with people here to see what they are using today.
>>>>>>>>
>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>> for are
>>>>>>>>
>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>> production scale.
>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>> time every hour or day or week or month.
>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>> daily basis.
>>>>>>>>
>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>>>> server and workflow engine would be ideal
>>>>>>>>
>>>>>>>> Thanks for the inputs
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Does Azkaban support scheduling long running jobs like spark steaming jobs?
Will Azkaban kill a job if it's running for a long time.

On Friday, August 7, 2015, Vikram Kone <vi...@gmail.com> wrote:

> Hien,
> Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
> going to use for workflow scheduling? Is there something else that's going
> to replace Azkaban?
>
> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yuzhihong@gmail.com
> <javascript:_e(%7B%7D,'cvml','yuzhihong@gmail.com');>> wrote:
>
>> In my opinion, choosing some particular project among its peers should
>> leave enough room for future growth (which may come faster than you
>> initially think).
>>
>> Cheers
>>
>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hluu@linkedin.com
>> <javascript:_e(%7B%7D,'cvml','hluu@linkedin.com');>> wrote:
>>
>>> Scalability is a known issue due the the current architecture.  However
>>> this will be applicable if you run more 20K jobs per day.
>>>
>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yuzhihong@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','yuzhihong@gmail.com');>> wrote:
>>>
>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>> Azkaban seems better).
>>>>
>>>> Vikram:
>>>> I suggest you do more research in related projects (maybe using their
>>>> mailing lists).
>>>>
>>>> Disclaimer: I don't work for LinkedIn.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>> nick.pentreath@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','nick.pentreath@gmail.com');>> wrote:
>>>>
>>>>> Hi Vikram,
>>>>>
>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>> whatever).
>>>>>
>>>>> However Spark support is not present directly - we run everything with
>>>>> shell scripts and spark-submit. There is a plugin interface where one could
>>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>
>>>>> It has some quirks and while there is actually a REST API for adding
>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>
>>>>> Spark job server is good but as you say lacks some stuff like
>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>>
>>>>>
>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','jornfranke@gmail.com');>> wrote:
>>>>>
>>>>>> Check also falcon in combination with oozie
>>>>>>
>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>>>> écrit :
>>>>>>
>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com
>>>>>>> <javascript:_e(%7B%7D,'cvml','vikramkone@gmail.com');>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>>>> wanted to check with people here to see what they are using today.
>>>>>>>>
>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>> for are
>>>>>>>>
>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>> production scale.
>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>> time every hour or day or week or month.
>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>> daily basis.
>>>>>>>>
>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>>>> server and workflow engine would be ideal
>>>>>>>>
>>>>>>>> Thanks for the inputs
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Vikram Kone <vi...@gmail.com>.
Hien,
Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
going to use for workflow scheduling? Is there something else that's going
to replace Azkaban?

On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yu...@gmail.com> wrote:

> In my opinion, choosing some particular project among its peers should
> leave enough room for future growth (which may come faster than you
> initially think).
>
> Cheers
>
> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:
>
>> Scalability is a known issue due the the current architecture.  However
>> this will be applicable if you run more 20K jobs per day.
>>
>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>> Azkaban seems better).
>>>
>>> Vikram:
>>> I suggest you do more research in related projects (maybe using their
>>> mailing lists).
>>>
>>> Disclaimer: I don't work for LinkedIn.
>>>
>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>> nick.pentreath@gmail.com> wrote:
>>>
>>>> Hi Vikram,
>>>>
>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>> whatever).
>>>>
>>>> However Spark support is not present directly - we run everything with
>>>> shell scripts and spark-submit. There is a plugin interface where one could
>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>> investigate and didn't have the time to work through it to develop that.
>>>>
>>>> It has some quirks and while there is actually a REST API for adding
>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>> quite involved to set up. Haven't tried Luigi either.
>>>>
>>>> Spark job server is good but as you say lacks some stuff like
>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Check also falcon in combination with oozie
>>>>>
>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>>> écrit :
>>>>>
>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>>> wanted to check with people here to see what they are using today.
>>>>>>>
>>>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>>>> are
>>>>>>>
>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>> some wrapper Java code to submit tasks.
>>>>>>> 2. Active open source community support and well tested at
>>>>>>> production scale.
>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>> time every hour or day or week or month.
>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>> daily basis.
>>>>>>>
>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>>> server and workflow engine would be ideal
>>>>>>>
>>>>>>> Thanks for the inputs
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Ted Yu <yu...@gmail.com>.
In my opinion, choosing some particular project among its peers should
leave enough room for future growth (which may come faster than you
initially think).

Cheers

On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hl...@linkedin.com> wrote:

> Scalability is a known issue due the the current architecture.  However
> this will be applicable if you run more 20K jobs per day.
>
> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>> Azkaban seems better).
>>
>> Vikram:
>> I suggest you do more research in related projects (maybe using their
>> mailing lists).
>>
>> Disclaimer: I don't work for LinkedIn.
>>
>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Hi Vikram,
>>>
>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>> easy to use and has a nice scheduling and logging interface, as well as
>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>> whatever).
>>>
>>> However Spark support is not present directly - we run everything with
>>> shell scripts and spark-submit. There is a plugin interface where one could
>>> create a Spark plugin, but I found it very cumbersome when I did
>>> investigate and didn't have the time to work through it to develop that.
>>>
>>> It has some quirks and while there is actually a REST API for adding jos
>>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>>> have to figure it out for yourself. But in terms of ease of use I found it
>>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>>> involved to set up. Haven't tried Luigi either.
>>>
>>> Spark job server is good but as you say lacks some stuff like scheduling
>>> and DAG type workflows (independent of spark-defined job flows).
>>>
>>>
>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>>
>>>> Check also falcon in combination with oozie
>>>>
>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>>> écrit :
>>>>
>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>> wanted to check with people here to see what they are using today.
>>>>>>
>>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>>> are
>>>>>>
>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>> some wrapper Java code to submit tasks.
>>>>>> 2. Active open source community support and well tested at production
>>>>>> scale.
>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>> time every hour or day or week or month.
>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>> daily basis.
>>>>>>
>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>>> server and workflow engine would be ideal
>>>>>>
>>>>>> Thanks for the inputs
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Hien Luu <hl...@linkedin.com.INVALID>.
Scalability is a known issue due the the current architecture.  However
this will be applicable if you run more 20K jobs per day.

On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yu...@gmail.com> wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>

Re: Spark job workflow engine recommendations

Posted by Ted Yu <yu...@gmail.com>.
>From what I heard (an ex-coworker who is Oozie committer), Azkaban is being
phased out at LinkedIn because of scalability issues (though UI-wise,
Azkaban seems better).

Vikram:
I suggest you do more research in related projects (maybe using their
mailing lists).

Disclaimer: I don't work for LinkedIn.

On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> Hi Vikram,
>
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
> local mode deployment and it is fairly easy to set up. It is pretty easy to
> use and has a nice scheduling and logging interface, as well as SLAs (like
> kill job and notify if it doesn't complete in 3 hours or whatever).
>
> However Spark support is not present directly - we run everything with
> shell scripts and spark-submit. There is a plugin interface where one could
> create a Spark plugin, but I found it very cumbersome when I did
> investigate and didn't have the time to work through it to develop that.
>
> It has some quirks and while there is actually a REST API for adding jos
> and dynamically scheduling jobs, it is not documented anywhere so you kinda
> have to figure it out for yourself. But in terms of ease of use I found it
> way better than Oozie. I haven't tried Chronos, and it seemed quite
> involved to set up. Haven't tried Luigi either.
>
> Spark job server is good but as you say lacks some stuff like scheduling
> and DAG type workflows (independent of spark-defined job flows).
>
>
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Check also falcon in combination with oozie
>>
>> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
>> écrit :
>>
>>> Looks like Oozie can satisfy most of your requirements.
>>>
>>>
>>>
>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I'm looking for open source workflow tools/engines that allow us to
>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>> wanted to check with people here to see what they are using today.
>>>>
>>>> Some of the requirements of the workflow engine that I'm looking for are
>>>>
>>>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>>>> wrapper Java code to submit tasks.
>>>> 2. Active open source community support and well tested at production
>>>> scale.
>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>> C are finished. Don't need to write full blown java applications to specify
>>>> job parameters and dependencies. Should be very simple to use.
>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>>>> every hour or day or week or month.
>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>> daily basis.
>>>>
>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>> server and workflow engine would be ideal
>>>>
>>>> Thanks for the inputs
>>>>
>>>
>>>
>

Re: Spark job workflow engine recommendations

Posted by Nick Pentreath <ni...@gmail.com>.
Hi Vikram,

We use Azkaban (2.5.0) in our production workflow scheduling. We just use
local mode deployment and it is fairly easy to set up. It is pretty easy to
use and has a nice scheduling and logging interface, as well as SLAs (like
kill job and notify if it doesn't complete in 3 hours or whatever).

However Spark support is not present directly - we run everything with
shell scripts and spark-submit. There is a plugin interface where one could
create a Spark plugin, but I found it very cumbersome when I did
investigate and didn't have the time to work through it to develop that.

It has some quirks and while there is actually a REST API for adding jos
and dynamically scheduling jobs, it is not documented anywhere so you kinda
have to figure it out for yourself. But in terms of ease of use I found it
way better than Oozie. I haven't tried Chronos, and it seemed quite
involved to set up. Haven't tried Luigi either.

Spark job server is good but as you say lacks some stuff like scheduling
and DAG type workflows (independent of spark-defined job flows).


On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jo...@gmail.com> wrote:

> Check also falcon in combination with oozie
>
> Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a
> écrit :
>
>> Looks like Oozie can satisfy most of your requirements.
>>
>>
>>
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm looking for open source workflow tools/engines that allow us to
>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>> wanted to check with people here to see what they are using today.
>>>
>>> Some of the requirements of the workflow engine that I'm looking for are
>>>
>>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>>> wrapper Java code to submit tasks.
>>> 2. Active open source community support and well tested at production
>>> scale.
>>> 3. Should be dead easy to write job dependencices using XML or web
>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>> C are finished. Don't need to write full blown java applications to specify
>>> job parameters and dependencies. Should be very simple to use.
>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>>> every hour or day or week or month.
>>> 5. Job monitoring, alerting on failures and email notifications on daily
>>> basis.
>>>
>>> I have looked at Ooyala's spark job server which seems to be hated
>>> towards making spark jobs run faster by sharing contexts between the jobs
>>> but isn't a full blown workflow engine per se. A combination of spark job
>>> server and workflow engine would be ideal
>>>
>>> Thanks for the inputs
>>>
>>
>>

Re: Spark job workflow engine recommendations

Posted by Jörn Franke <jo...@gmail.com>.
Check also falcon in combination with oozie

Le ven. 7 août 2015 à 17:51, Hien Luu <hl...@linkedin.com.invalid> a écrit :

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com> wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>> C are finished. Don't need to write full blown java applications to specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>

Re: Spark job workflow engine recommendations

Posted by Hien Luu <hl...@linkedin.com.INVALID>.
Looks like Oozie can satisfy most of your requirements.



On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vi...@gmail.com> wrote:

> Hi,
> I'm looking for open source workflow tools/engines that allow us to
> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
> wanted to check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production
> scale.
> 3. Should be dead easy to write job dependencices using XML or web
> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
> C are finished. Don't need to write full blown java applications to specify
> job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs
>