You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jonathan Warden <jo...@gmail.com> on 2009/04/19 10:47:43 UTC

Hive Orchestration

I'm looking for a framework that manages automatic initiation of our  
daily data loading and processing, with knowledge of dependencies  
between tables and "data ready" status flags.

I think some people call this "Orchestration" (though there's not a  
settled definitions of this word).

I get the impression there are a lot of home grown solutions for  
this.  But I'd like a generalized solution that would allow me to just  
create a config file containing:
   - A list of my tables
   - What tables each table depends on
   - Queries for loading one day of data into each table (for external  
"raw data" tables, say a program to fetch this from wherever we fetch  
it from)

Then there'd be a driver process that would automatically run  
everything every day based on my config, and would maintain a status  
(that I could query in report generation and monitoring processes) on  
what data was loaded successfully into a given table for a given day.

There's Apache HOD (Hadoop on Demand), but it's just integration with  
batch schedulers.  Then there's apache ODE (Orchestration Director  
Engine), but this seems to be Web Services Orchestration and I don't  
see it as solving my problem (though I'm not sure).

Any ideas?

Re: Hive Orchestration

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey John,

You could augment the metastore with your dependency information and use
workflow scheduling tool to speak to the metastore's Thrift interface. I've
seen folks do similar things with home-grown Python scripts as well as
packages like Quartz and OpsWise. I haven't played with it, but Vadim
Zaliva's hamake seems like it might to the trick:
http://code.google.com/p/hamake/. You might also check out Hadoop's
XML-driven workload scheduler (
http://issues.apache.org/jira/browse/HADOOP-5303), though I haven't use that
tool either.

Later,
Jeff

On Sun, Apr 19, 2009 at 6:51 PM, John Warden <jo...@gmail.com> wrote:

> Edward, actually I was thinking of Zookeeper for something like this.
> Zookeeper seems like it could serve the role of the state repository for
> things like this -- basically storing what load processes for what dates
> have finished or failed.  It has some advantages over a relational
> database.  But, you could also do this with files on HDFS, perhaps simply
> the presence of a file indicating when a process was finished.
>
> Anyway, I don't have any plans at all -- this is just in the "I wish I had
> this" stage.  I feel like someone out there must have created something
> similar already.
>
>
> On Sun, Apr 19, 2009 at 10:32 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> On Sun, Apr 19, 2009 at 4:47 AM, Jonathan Warden <jo...@gmail.com>
>> wrote:
>> > I'm looking for a framework that manages automatic initiation of our
>> daily
>> > data loading and processing, with knowledge of dependencies between
>> tables
>> > and "data ready" status flags.
>> >
>> > I think some people call this "Orchestration" (though there's not a
>> settled
>> > definitions of this word).
>> >
>> > I get the impression there are a lot of home grown solutions for this.
>>  But
>> > I'd like a generalized solution that would allow me to just create a
>> config
>> > file containing:
>> >  - A list of my tables
>> >  - What tables each table depends on
>> >  - Queries for loading one day of data into each table (for external
>> "raw
>> > data" tables, say a program to fetch this from wherever we fetch it
>> from)
>> >
>> > Then there'd be a driver process that would automatically run everything
>> > every day based on my config, and would maintain a status (that I could
>> > query in report generation and monitoring processes) on what data was
>> loaded
>> > successfully into a given table for a given day.
>> >
>> > There's Apache HOD (Hadoop on Demand), but it's just integration with
>> batch
>> > schedulers.  Then there's apache ODE (Orchestration Director Engine),
>> but
>> > this seems to be Web Services Orchestration and I don't see it as
>> solving my
>> > problem (though I'm not sure).
>> >
>> > Any ideas?
>> >
>>
>> It sounds good. What do you think the overlap with zookeeper is?
>> http://hadoop.apache.org/zookeeper/
>>
>> The "entry" points for Hive seem to be the 'HiveServer' and 'Hive -e'
>> I have used the Hive API directly. Do you have plans for supporting
>> those three things?
>>
>
>

Re: Hive Orchestration

Posted by John Warden <jo...@gmail.com>.
Edward, actually I was thinking of Zookeeper for something like this.
Zookeeper seems like it could serve the role of the state repository for
things like this -- basically storing what load processes for what dates
have finished or failed.  It has some advantages over a relational
database.  But, you could also do this with files on HDFS, perhaps simply
the presence of a file indicating when a process was finished.

Anyway, I don't have any plans at all -- this is just in the "I wish I had
this" stage.  I feel like someone out there must have created something
similar already.

On Sun, Apr 19, 2009 at 10:32 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Sun, Apr 19, 2009 at 4:47 AM, Jonathan Warden <jo...@gmail.com>
> wrote:
> > I'm looking for a framework that manages automatic initiation of our
> daily
> > data loading and processing, with knowledge of dependencies between
> tables
> > and "data ready" status flags.
> >
> > I think some people call this "Orchestration" (though there's not a
> settled
> > definitions of this word).
> >
> > I get the impression there are a lot of home grown solutions for this.
>  But
> > I'd like a generalized solution that would allow me to just create a
> config
> > file containing:
> >  - A list of my tables
> >  - What tables each table depends on
> >  - Queries for loading one day of data into each table (for external "raw
> > data" tables, say a program to fetch this from wherever we fetch it from)
> >
> > Then there'd be a driver process that would automatically run everything
> > every day based on my config, and would maintain a status (that I could
> > query in report generation and monitoring processes) on what data was
> loaded
> > successfully into a given table for a given day.
> >
> > There's Apache HOD (Hadoop on Demand), but it's just integration with
> batch
> > schedulers.  Then there's apache ODE (Orchestration Director Engine), but
> > this seems to be Web Services Orchestration and I don't see it as solving
> my
> > problem (though I'm not sure).
> >
> > Any ideas?
> >
>
> It sounds good. What do you think the overlap with zookeeper is?
> http://hadoop.apache.org/zookeeper/
>
> The "entry" points for Hive seem to be the 'HiveServer' and 'Hive -e'
> I have used the Hive API directly. Do you have plans for supporting
> those three things?
>

Re: Hive Orchestration

Posted by Edward Capriolo <ed...@gmail.com>.
On Sun, Apr 19, 2009 at 4:47 AM, Jonathan Warden <jo...@gmail.com> wrote:
> I'm looking for a framework that manages automatic initiation of our daily
> data loading and processing, with knowledge of dependencies between tables
> and "data ready" status flags.
>
> I think some people call this "Orchestration" (though there's not a settled
> definitions of this word).
>
> I get the impression there are a lot of home grown solutions for this.  But
> I'd like a generalized solution that would allow me to just create a config
> file containing:
>  - A list of my tables
>  - What tables each table depends on
>  - Queries for loading one day of data into each table (for external "raw
> data" tables, say a program to fetch this from wherever we fetch it from)
>
> Then there'd be a driver process that would automatically run everything
> every day based on my config, and would maintain a status (that I could
> query in report generation and monitoring processes) on what data was loaded
> successfully into a given table for a given day.
>
> There's Apache HOD (Hadoop on Demand), but it's just integration with batch
> schedulers.  Then there's apache ODE (Orchestration Director Engine), but
> this seems to be Web Services Orchestration and I don't see it as solving my
> problem (though I'm not sure).
>
> Any ideas?
>

It sounds good. What do you think the overlap with zookeeper is?
http://hadoop.apache.org/zookeeper/

The "entry" points for Hive seem to be the 'HiveServer' and 'Hive -e'
I have used the Hive API directly. Do you have plans for supporting
those three things?

Re: Hive Orchestration

Posted by John Zimmerman <jo...@gmail.com>.
Nice!

On Apr 19, 2009, at 1:47 AM, Jonathan Warden wrote:

> I'm looking for a framework that manages automatic initiation of our  
> daily data loading and processing, with knowledge of dependencies  
> between tables and "data ready" status flags.
>
> I think some people call this "Orchestration" (though there's not a  
> settled definitions of this word).
>
> I get the impression there are a lot of home grown solutions for  
> this.  But I'd like a generalized solution that would allow me to  
> just create a config file containing:
>  - A list of my tables
>  - What tables each table depends on
>  - Queries for loading one day of data into each table (for external  
> "raw data" tables, say a program to fetch this from wherever we  
> fetch it from)
>
> Then there'd be a driver process that would automatically run  
> everything every day based on my config, and would maintain a status  
> (that I could query in report generation and monitoring processes)  
> on what data was loaded successfully into a given table for a given  
> day.
>
> There's Apache HOD (Hadoop on Demand), but it's just integration  
> with batch schedulers.  Then there's apache ODE (Orchestration  
> Director Engine), but this seems to be Web Services Orchestration  
> and I don't see it as solving my problem (though I'm not sure).
>
> Any ideas?