You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Nadeem Ahmed Nazeer <na...@neon-lab.com> on 2016/08/11 22:09:28 UTC

handling backfill overtime

Hello,

My airflow dag consists of 2 tasks,
1) map-reduce jobs (writes output to s3)
2) hive loads (using files from 1)

My EMR hadoop cluster is running on aws spot instances. So when spot
instance pricing go up, my cluster would die and a new one would come up.

In the event of a cluster death, i am clearing all the hive load tasks from
Airflow. This way it would rebuild the tables back in the new cluster based
on the files in s3.

But overtime, when the backfill becomes very large this approach becomes
inefficient. My dag run frequency is 3 hours (8 runs a day). So for
example, if the cluster goes down after a month, airflow will now have to
backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
bigger with time.

What could be a better way to handle this? Currently, I'm planning to
re-base airflow manually once in a month where in I will bring down
everything and run airflow with new start date of current day. This will
reduce the backfill and keep it under limits of a month. But there's got to
be a better way of doing this.

Please provide any suggestions.

Thanks,
Nadeem

Re: handling backfill overtime

Posted by Lance Norskog <la...@gmail.com>.

Is the problem that you lose the data, or the database?
If it's that you lose the DB, try using a permanent MySQL instance (or even
RDS) for your DB.
If it's that you lose your "digested" Hive data, you can do snapshots of
the disk set.



On Thu, Aug 25, 2016 at 1:37 PM, Nadeem Ahmed Nazeer <na...@neon-lab.com>
wrote:

> Hello Airflowers,
>
> Does someone see a better way to do this? It would really my Airflow set
> up.
>
> Thanks,
> Nadeem
>
> On Thu, Aug 11, 2016 at 3:09 PM, Nadeem Ahmed Nazeer <na...@neon-lab.com>
> wrote:
>
> > Hello,
> >
> > My airflow dag consists of 2 tasks,
> > 1) map-reduce jobs (writes output to s3)
> > 2) hive loads (using files from 1)
> >
> > My EMR hadoop cluster is running on aws spot instances. So when spot
> > instance pricing go up, my cluster would die and a new one would come up.
> >
> > In the event of a cluster death, i am clearing all the hive load tasks
> > from Airflow. This way it would rebuild the tables back in the new
> cluster
> > based on the files in s3.
> >
> > But overtime, when the backfill becomes very large this approach becomes
> > inefficient. My dag run frequency is 3 hours (8 runs a day). So for
> > example, if the cluster goes down after a month, airflow will now have to
> > backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
> > bigger with time.
> >
> > What could be a better way to handle this? Currently, I'm planning to
> > re-base airflow manually once in a month where in I will bring down
> > everything and run airflow with new start date of current day. This will
> > reduce the backfill and keep it under limits of a month. But there's got
> to
> > be a better way of doing this.
> >
> > Please provide any suggestions.
> >
> > Thanks,
> > Nadeem
> >
>



-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA

Re: handling backfill overtime

Posted by Nadeem Ahmed Nazeer <na...@neon-lab.com>.

Hello Airflowers,

Does someone see a better way to do this? It would really my Airflow set
up.

Thanks,
Nadeem

On Thu, Aug 11, 2016 at 3:09 PM, Nadeem Ahmed Nazeer <na...@neon-lab.com>
wrote:

> Hello,
>
> My airflow dag consists of 2 tasks,
> 1) map-reduce jobs (writes output to s3)
> 2) hive loads (using files from 1)
>
> My EMR hadoop cluster is running on aws spot instances. So when spot
> instance pricing go up, my cluster would die and a new one would come up.
>
> In the event of a cluster death, i am clearing all the hive load tasks
> from Airflow. This way it would rebuild the tables back in the new cluster
> based on the files in s3.
>
> But overtime, when the backfill becomes very large this approach becomes
> inefficient. My dag run frequency is 3 hours (8 runs a day). So for
> example, if the cluster goes down after a month, airflow will now have to
> backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
> bigger with time.
>
> What could be a better way to handle this? Currently, I'm planning to
> re-base airflow manually once in a month where in I will bring down
> everything and run airflow with new start date of current day. This will
> reduce the backfill and keep it under limits of a month. But there's got to
> be a better way of doing this.
>
> Please provide any suggestions.
>
> Thanks,
> Nadeem
>