You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Terje Marthinussen <tm...@gmail.com> on 2010/10/28 06:15:43 UTC

Scheduling jobs in hive

Hi,

Are there any good scheduling tools out there suitable for the dependencies
you may get in Hive?

Specific example I have right now:
- 2 tables with event logs from different sources
- 1 table with some additional data from a different source, but this data
is daily summary

None of this data is streamed realtime but rather copied in and it can be
highly asynchronous and even out of order (I may get a summary for Tuesday
before the one for Monday)

I need to join data from these 3 tables to generate daily statistics but
obviously, I do not want to reprocess everything every day and it would be
got to not do queries unless all the data is actually there.

This is not that hard to code to fix with specific code for this specific
case, but I have a hunch that I should be able to generalize this into a
more generic job dependency scheduler. However, I feel a bit like I am
staring at the forest and cannot see a single tree at the moment :)

Just cannot see a solution that I like and I have a clear feeling there
should be a better way to do it than I can think of.

Good ideas?

Terje

Re: Scheduling jobs in hive

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
You might want to look at oozie http://yahoo.github.com/oozie/ . The
trunk version doesn't support hive actions (yet I think). But Cloudera
packages a version that has hive support.

> I need to join data from these 3 tables to generate daily statistics but
> obviously, I do not want to reprocess everything every day and it would be
> got to not do queries unless all the data is actually there.

The coodinator app in oozie *might* prove useful for this
http://archive.cloudera.com/cdh/3/oozie/CoordinatorFunctionalSpec.html#a1._Coordinator_Overview

Hope it heps.

On Wed, Oct 27, 2010 at 9:15 PM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Hi,
>
> Are there any good scheduling tools out there suitable for the dependencies
> you may get in Hive?
>
> Specific example I have right now:
> - 2 tables with event logs from different sources
> - 1 table with some additional data from a different source, but this data
> is daily summary
>
> None of this data is streamed realtime but rather copied in and it can be
> highly asynchronous and even out of order (I may get a summary for Tuesday
> before the one for Monday)
>
> I need to join data from these 3 tables to generate daily statistics but
> obviously, I do not want to reprocess everything every day and it would be
> got to not do queries unless all the data is actually there.
>
> This is not that hard to code to fix with specific code for this specific
> case, but I have a hunch that I should be able to generalize this into a
> more generic job dependency scheduler. However, I feel a bit like I am
> staring at the forest and cannot see a single tree at the moment :)
>
> Just cannot see a solution that I like and I have a clear feeling there
> should be a better way to do it than I can think of.
>
> Good ideas?
>
> Terje
>
>
>