You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ashish Rawat <as...@myntra.com> on 2017/08/03 15:43:42 UTC

Task partitioning using Airflow

Hi,

We have a use case where we are running some R/Python based data science models, which execute over a single node. The execution time of the models is constantly increasing and we are now planning to split the model training by a partition key and distribute the workload over multiple machines.

Does Airflow provide some simple way to split a task into multiple tasks, all of which will work on a specific value of the key.

--
Regards,
Ashish




Re: Task partitioning using Airflow

Posted by Ashish Rawat <as...@myntra.com>.
I will give it a try, thanks Brian!!

--
Regards,
Ashish



> On 09-Aug-2017, at 11:02 PM, Van Klaveren, Brian N. <bv...@slac.stanford.edu> wrote:
> 
> Hi Ashish,
> 
> Partitioned tasks might be able to be modeled as triggered, n-many parameterized dags/subdags (where the parameter is the partition key). I've used this pattern in the past a lot in other systems, but not specifically with airflow, so I'm not sure how you'd necessarily implement it with Airflow, but hoping this maybe gives you some ideas.
> 
> Brian
> 
> 
>> On Aug 9, 2017, at 10:23 AM, Ashish Rawat <as...@myntra.com> wrote:
>> 
>> Yes, I believe they are used for splitting a bigger DAG into smaller DAGs, for clarity and reusability. In our use case, we need to split/replicate a specific task into multiple tasks, based on the different values of a key, essentially data partitioning and processing.
>> 
>> --
>> Regards,
>> Ashish
>> 
>> 
>> 
>>> On 09-Aug-2017, at 10:49 PM, Van Klaveren, Brian N. <bv...@slac.stanford.edu> wrote:
>>> 
>>> Have you looked into subdags?
>>> 
>>> Brian
>>> 
>>> 
>>>> On Aug 9, 2017, at 10:16 AM, Ashish Rawat <as...@myntra.com> wrote:
>>>> 
>>>> Thanks George. Our use case also periodic scheduling (daily), as well as task dependencies, so we chose Airflow for this use case. However, some of the tasks in a DAG have now become too big to execute over one node, we want to split them into multiple task to reduce execution time. Would you recommend firing parts of an Airflow DAG in another framework?
>>>> 
>>>> --
>>>> Regards,
>>>> Ashish
>>>> 
>>>> 
>>>> 
>>>>> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
>>>>> 
>>>>> Airflow is best for situations where you want to run different tasks that
>>>>> depend on each other or process data that arrives over time. If your goal
>>>>> is to take a large dataset, split it up, and process chunks of it, there
>>>>> are probably other tools better suited to your purpose.
>>>>> 
>>>>> Off the top of my head, you might consider Dask:
>>>>> https://dask.pydata.org/en/latest/ or directly using Celery:
>>>>> http://www.celeryproject.org/
>>>>> 
>>>>> --George
>>>>> 
>>>>> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:
>>>>> 
>>>>>> Hi - Can anyone please provide some pointers for this use case over
>>>>>> Airflow?
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Ashish
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We have a use case where we are running some R/Python based data science
>>>>>> models, which execute over a single node. The execution time of the models
>>>>>> is constantly increasing and we are now planning to split the model
>>>>>> training by a partition key and distribute the workload over multiple
>>>>>> machines.
>>>>>>> 
>>>>>>> Does Airflow provide some simple way to split a task into multiple
>>>>>> tasks, all of which will work on a specific value of the key.
>>>>>>> 
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Ashish
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
> 


Re: Task partitioning using Airflow

Posted by "Van Klaveren, Brian N." <bv...@slac.stanford.edu>.
Hi Ashish,

Partitioned tasks might be able to be modeled as triggered, n-many parameterized dags/subdags (where the parameter is the partition key). I've used this pattern in the past a lot in other systems, but not specifically with airflow, so I'm not sure how you'd necessarily implement it with Airflow, but hoping this maybe gives you some ideas.

Brian


> On Aug 9, 2017, at 10:23 AM, Ashish Rawat <as...@myntra.com> wrote:
> 
> Yes, I believe they are used for splitting a bigger DAG into smaller DAGs, for clarity and reusability. In our use case, we need to split/replicate a specific task into multiple tasks, based on the different values of a key, essentially data partitioning and processing.
> 
> --
> Regards,
> Ashish
> 
> 
> 
>> On 09-Aug-2017, at 10:49 PM, Van Klaveren, Brian N. <bv...@slac.stanford.edu> wrote:
>> 
>> Have you looked into subdags?
>> 
>> Brian
>> 
>> 
>>> On Aug 9, 2017, at 10:16 AM, Ashish Rawat <as...@myntra.com> wrote:
>>> 
>>> Thanks George. Our use case also periodic scheduling (daily), as well as task dependencies, so we chose Airflow for this use case. However, some of the tasks in a DAG have now become too big to execute over one node, we want to split them into multiple task to reduce execution time. Would you recommend firing parts of an Airflow DAG in another framework?
>>> 
>>> --
>>> Regards,
>>> Ashish
>>> 
>>> 
>>> 
>>>> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
>>>> 
>>>> Airflow is best for situations where you want to run different tasks that
>>>> depend on each other or process data that arrives over time. If your goal
>>>> is to take a large dataset, split it up, and process chunks of it, there
>>>> are probably other tools better suited to your purpose.
>>>> 
>>>> Off the top of my head, you might consider Dask:
>>>> https://dask.pydata.org/en/latest/ or directly using Celery:
>>>> http://www.celeryproject.org/
>>>> 
>>>> --George
>>>> 
>>>> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:
>>>> 
>>>>> Hi - Can anyone please provide some pointers for this use case over
>>>>> Airflow?
>>>>> 
>>>>> --
>>>>> Regards,
>>>>> Ashish
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> We have a use case where we are running some R/Python based data science
>>>>> models, which execute over a single node. The execution time of the models
>>>>> is constantly increasing and we are now planning to split the model
>>>>> training by a partition key and distribute the workload over multiple
>>>>> machines.
>>>>>> 
>>>>>> Does Airflow provide some simple way to split a task into multiple
>>>>> tasks, all of which will work on a specific value of the key.
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Ashish
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
> 


Re: Task partitioning using Airflow

Posted by Ashish Rawat <as...@myntra.com>.
Yes, I believe they are used for splitting a bigger DAG into smaller DAGs, for clarity and reusability. In our use case, we need to split/replicate a specific task into multiple tasks, based on the different values of a key, essentially data partitioning and processing.

--
Regards,
Ashish



> On 09-Aug-2017, at 10:49 PM, Van Klaveren, Brian N. <bv...@slac.stanford.edu> wrote:
> 
> Have you looked into subdags?
> 
> Brian
> 
> 
>> On Aug 9, 2017, at 10:16 AM, Ashish Rawat <as...@myntra.com> wrote:
>> 
>> Thanks George. Our use case also periodic scheduling (daily), as well as task dependencies, so we chose Airflow for this use case. However, some of the tasks in a DAG have now become too big to execute over one node, we want to split them into multiple task to reduce execution time. Would you recommend firing parts of an Airflow DAG in another framework?
>> 
>> --
>> Regards,
>> Ashish
>> 
>> 
>> 
>>> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
>>> 
>>> Airflow is best for situations where you want to run different tasks that
>>> depend on each other or process data that arrives over time. If your goal
>>> is to take a large dataset, split it up, and process chunks of it, there
>>> are probably other tools better suited to your purpose.
>>> 
>>> Off the top of my head, you might consider Dask:
>>> https://dask.pydata.org/en/latest/ or directly using Celery:
>>> http://www.celeryproject.org/
>>> 
>>> --George
>>> 
>>> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:
>>> 
>>>> Hi - Can anyone please provide some pointers for this use case over
>>>> Airflow?
>>>> 
>>>> --
>>>> Regards,
>>>> Ashish
>>>> 
>>>> 
>>>> 
>>>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> We have a use case where we are running some R/Python based data science
>>>> models, which execute over a single node. The execution time of the models
>>>> is constantly increasing and we are now planning to split the model
>>>> training by a partition key and distribute the workload over multiple
>>>> machines.
>>>>> 
>>>>> Does Airflow provide some simple way to split a task into multiple
>>>> tasks, all of which will work on a specific value of the key.
>>>>> 
>>>>> --
>>>>> Regards,
>>>>> Ashish
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
> 


Re: Task partitioning using Airflow

Posted by "Van Klaveren, Brian N." <bv...@slac.stanford.edu>.
Have you looked into subdags?

Brian


> On Aug 9, 2017, at 10:16 AM, Ashish Rawat <as...@myntra.com> wrote:
> 
> Thanks George. Our use case also periodic scheduling (daily), as well as task dependencies, so we chose Airflow for this use case. However, some of the tasks in a DAG have now become too big to execute over one node, we want to split them into multiple task to reduce execution time. Would you recommend firing parts of an Airflow DAG in another framework?
> 
> --
> Regards,
> Ashish
> 
> 
> 
>> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
>> 
>> Airflow is best for situations where you want to run different tasks that
>> depend on each other or process data that arrives over time. If your goal
>> is to take a large dataset, split it up, and process chunks of it, there
>> are probably other tools better suited to your purpose.
>> 
>> Off the top of my head, you might consider Dask:
>> https://dask.pydata.org/en/latest/ or directly using Celery:
>> http://www.celeryproject.org/
>> 
>> --George
>> 
>> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:
>> 
>>> Hi - Can anyone please provide some pointers for this use case over
>>> Airflow?
>>> 
>>> --
>>> Regards,
>>> Ashish
>>> 
>>> 
>>> 
>>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> We have a use case where we are running some R/Python based data science
>>> models, which execute over a single node. The execution time of the models
>>> is constantly increasing and we are now planning to split the model
>>> training by a partition key and distribute the workload over multiple
>>> machines.
>>>> 
>>>> Does Airflow provide some simple way to split a task into multiple
>>> tasks, all of which will work on a specific value of the key.
>>>> 
>>>> --
>>>> Regards,
>>>> Ashish
>>>> 
>>>> 
>>>> 
>>> 
>>> 
> 


Re: Task partitioning using Airflow

Posted by Ashish Rawat <as...@myntra.com>.
Thanks George. Our use case also periodic scheduling (daily), as well as task dependencies, so we chose Airflow for this use case. However, some of the tasks in a DAG have now become too big to execute over one node, we want to split them into multiple task to reduce execution time. Would you recommend firing parts of an Airflow DAG in another framework?

--
Regards,
Ashish



> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
> 
> Airflow is best for situations where you want to run different tasks that
> depend on each other or process data that arrives over time. If your goal
> is to take a large dataset, split it up, and process chunks of it, there
> are probably other tools better suited to your purpose.
> 
> Off the top of my head, you might consider Dask:
> https://dask.pydata.org/en/latest/ or directly using Celery:
> http://www.celeryproject.org/
> 
> --George
> 
> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:
> 
>> Hi - Can anyone please provide some pointers for this use case over
>> Airflow?
>> 
>> --
>> Regards,
>> Ashish
>> 
>> 
>> 
>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> We have a use case where we are running some R/Python based data science
>> models, which execute over a single node. The execution time of the models
>> is constantly increasing and we are now planning to split the model
>> training by a partition key and distribute the workload over multiple
>> machines.
>>> 
>>> Does Airflow provide some simple way to split a task into multiple
>> tasks, all of which will work on a specific value of the key.
>>> 
>>> --
>>> Regards,
>>> Ashish
>>> 
>>> 
>>> 
>> 
>> 


Re: Task partitioning using Airflow

Posted by George Leslie-Waksman <ge...@cloverhealth.com.INVALID>.
Airflow is best for situations where you want to run different tasks that
depend on each other or process data that arrives over time. If your goal
is to take a large dataset, split it up, and process chunks of it, there
are probably other tools better suited to your purpose.

Off the top of my head, you might consider Dask:
https://dask.pydata.org/en/latest/ or directly using Celery:
http://www.celeryproject.org/

--George

On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <as...@myntra.com> wrote:

> Hi - Can anyone please provide some pointers for this use case over
> Airflow?
>
> --
> Regards,
> Ashish
>
>
>
> > On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com>
> wrote:
> >
> > Hi,
> >
> > We have a use case where we are running some R/Python based data science
> models, which execute over a single node. The execution time of the models
> is constantly increasing and we are now planning to split the model
> training by a partition key and distribute the workload over multiple
> machines.
> >
> > Does Airflow provide some simple way to split a task into multiple
> tasks, all of which will work on a specific value of the key.
> >
> > --
> > Regards,
> > Ashish
> >
> >
> >
>
>

Re: Task partitioning using Airflow

Posted by Ashish Rawat <as...@myntra.com>.
Hi - Can anyone please provide some pointers for this use case over Airflow?

--
Regards,
Ashish



> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <as...@myntra.com> wrote:
> 
> Hi,
> 
> We have a use case where we are running some R/Python based data science models, which execute over a single node. The execution time of the models is constantly increasing and we are now planning to split the model training by a partition key and distribute the workload over multiple machines.
> 
> Does Airflow provide some simple way to split a task into multiple tasks, all of which will work on a specific value of the key.
> 
> --
> Regards,
> Ashish
> 
> 
>