You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bruno Faria <br...@hotmail.com> on 2016/11/29 22:00:46 UTC

Best approach to schedule Spark jobs

I have a standalone Spark cluster and have some jobs scheduled using crontab.

It works but I don't have all the real time monitoring to get emails or to control a flow for example.

Thought about using the Spark "hidden" API to have a better control but seems the API is not officially documented and I don't see much talking about that on that web.

Another option would be Oozie but looks like Oozie only works with Hadoop so I'd need to install it and change my architecture.

Is there any other option you suggest?

I'm using only open source versions (no dist)

Thanks

Get Outlook for iOS<https://aka.ms/o0ukef>


Re: Best approach to schedule Spark jobs

Posted by Sandeep Samudrala <sa...@gmail.com>.
Here at Inmobi, we use Apache Falcon <https://falcon.apache.org/>(with
oozie). The pipelines are fully functional in production. You can look into
Apache Falcon site for more details.

On Wed, Nov 30, 2016 at 7:36 AM, Tiago Albineli Motta <ti...@gmail.com>
wrote:

> Here at Globo.com we use Airflow to schedule and manage our spark
> pipeline. We use the Yarn API in the Airflow Dags to controls things like
> garantee that the job is not running before start another batch.
>
> Tiago Albineli Motta
> Desenvolvedor de Software - Globo.com
> ICQ: 32107100
> http://programandosemcafeina.blogspot.com
>
> On Tue, Nov 29, 2016 at 8:00 PM, Bruno Faria <br...@hotmail.com>
> wrote:
>
>> I have a standalone Spark cluster and have some jobs scheduled using
>> crontab.
>>
>> It works but I don't have all the real time monitoring to get emails or
>> to control a flow for example.
>>
>> Thought about using the Spark "hidden" API to have a better control but
>> seems the API is not officially documented and I don't see much talking
>> about that on that web.
>>
>> Another option would be Oozie but looks like Oozie only works with Hadoop
>> so I'd need to install it and change my architecture.
>>
>> Is there any other option you suggest?
>>
>> I'm using only open source versions (no dist)
>>
>> Thanks
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>
>>
>

Re: Best approach to schedule Spark jobs

Posted by Tiago Albineli Motta <ti...@gmail.com>.
Here at Globo.com we use Airflow to schedule and manage our spark pipeline.
We use the Yarn API in the Airflow Dags to controls things like garantee
that the job is not running before start another batch.

Tiago Albineli Motta
Desenvolvedor de Software - Globo.com
ICQ: 32107100
http://programandosemcafeina.blogspot.com

On Tue, Nov 29, 2016 at 8:00 PM, Bruno Faria <br...@hotmail.com> wrote:

> I have a standalone Spark cluster and have some jobs scheduled using
> crontab.
>
> It works but I don't have all the real time monitoring to get emails or to
> control a flow for example.
>
> Thought about using the Spark "hidden" API to have a better control but
> seems the API is not officially documented and I don't see much talking
> about that on that web.
>
> Another option would be Oozie but looks like Oozie only works with Hadoop
> so I'd need to install it and change my architecture.
>
> Is there any other option you suggest?
>
> I'm using only open source versions (no dist)
>
> Thanks
>
> Get Outlook for iOS <https://aka.ms/o0ukef>
>
>