You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Matt Pestritto <ma...@pestritto.com> on 2009/04/27 16:11:42 UTC

Hive Application

Hi All -

Has anyone put any thought behind how to create an application using hive ?
I have a certain algorithm that I implemented in hive, but it currently
lives in a 600+ line text file where I copy and paste pieces into the CLI.
There are obvious problems with this - non-repeatable, can't be scheduled,
input parameters are hard-coded, etc.  I have been thinking about how to
functionalize these pieces without writing a complex application in java or
another language to act as a job controller / manager.  The first obvious
choice is just using the shell to script off different modules and having a
set of scripts call each other and execute hive sql queries.  Another option
that I am throwing around is using python to create a very basic UI where I
can create different modules, set dependencies, control job execution, etc.


I'm wondering if anyone else has also run into these issues and what type of
solutions were implemented ?

Thanks
-Matt

RE: Hive Application

Posted by Ashish Thusoo <at...@facebook.com>.
We implemented something in house called databee which basically does what John mentioned but Kettle is an open source alternative that does that - I think Pentaho acquired them..

Ashish
________________________________________
From: John Warden [john.warden@gmail.com]
Sent: Monday, April 27, 2009 7:22 PM
To: hive-user@hadoop.apache.org
Subject: Re: Hive Application

Hi Matt -- I've run into the same issues.  I was call this Orchestration: the general problem of defining Dependencies, Controlling/Scheduling Execution, and keeping track of state of what has been done.

I haven't found anything great yet, but someone has recommended xactions in the Pentaho open-source suite, and I'll look more into this.  Also Zookeeper might could be a good way to keep track of state, but you still need to build the orchestration application around it.

On Mon, Apr 27, 2009 at 10:11 PM, Matt Pestritto <ma...@pestritto.com>> wrote:
Hi All -

Has anyone put any thought behind how to create an application using hive ?  I have a certain algorithm that I implemented in hive, but it currently lives in a 600+ line text file where I copy and paste pieces into the CLI.  There are obvious problems with this - non-repeatable, can't be scheduled, input parameters are hard-coded, etc.  I have been thinking about how to functionalize these pieces without writing a complex application in java or another language to act as a job controller / manager.  The first obvious choice is just using the shell to script off different modules and having a set of scripts call each other and execute hive sql queries.  Another option that I am throwing around is using python to create a very basic UI where I can create different modules, set dependencies, control job execution, etc.

I'm wondering if anyone else has also run into these issues and what type of solutions were implemented ?

Thanks
-Matt


Re: Hive Application

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Apr 28, 2009 at 2:45 PM, Matt Pestritto <ma...@pestritto.com> wrote:
> Thanks for your comments all.  In the short-term, I will probably use simple
> bash scripting because we have an immediate need.  I'll work on a long-term
> solution.
>
> Thanks
> -Matt
>
> On Tue, Apr 28, 2009 at 2:04 PM, Prasad Chakka <pc...@facebook.com> wrote:
>>
>> > I copy and paste pieces into the CLI.
>> FYI, there hive –e “hive ql command” or hive –f “file name” options. I
>> suppose you want more but you can use these write a simple framework that
>> patches up all the different steps in your program.
>>
>> ________________________________
>> From: John Warden <jo...@gmail.com>
>> Reply-To: <hi...@hadoop.apache.org>, <jo...@gmail.com>
>> Date: Mon, 27 Apr 2009 19:22:29 -0700
>> To: <hi...@hadoop.apache.org>
>> Subject: Re: Hive Application
>>
>> Hi Matt -- I've run into the same issues.  I was call this Orchestration:
>> the general problem of defining Dependencies, Controlling/Scheduling
>> Execution, and keeping track of state of what has been done.
>>
>> I haven't found anything great yet, but someone has recommended xactions
>> in the Pentaho open-source suite, and I'll look more into this.  Also
>> Zookeeper might could be a good way to keep track of state, but you still
>> need to build the orchestration application around it.
>>
>> On Mon, Apr 27, 2009 at 10:11 PM, Matt Pestritto <ma...@pestritto.com>
>> wrote:
>>
>> Hi All -
>>
>> Has anyone put any thought behind how to create an application using hive
>> ?  I have a certain algorithm that I implemented in hive, but it currently
>> lives in a 600+ line text file where I copy and paste pieces into the CLI.
>> There are obvious problems with this - non-repeatable, can't be scheduled,
>> input parameters are hard-coded, etc.  I have been thinking about how to
>> functionalize these pieces without writing a complex application in java or
>> another language to act as a job controller / manager.  The first obvious
>> choice is just using the shell to script off different modules and having a
>> set of scripts call each other and execute hive sql queries.  Another option
>> that I am throwing around is using python to create a very basic UI where I
>> can create different modules, set dependencies, control job execution, etc.
>>
>> I'm wondering if anyone else has also run into these issues and what type
>> of solutions were implemented ?
>>
>> Thanks
>> -Matt
>>
>>
>
>
You may  want to considered using the HWISessionManager. It was
designed for a different purpose, but it can work as a native API. It
has some upswing in that it can be multi threaded/ You can use
blocking/non blocking and it will be easier to trap errors then a bash
script.

I would not say it is great for orchestration, in that it is still top
down code. But you can run hive queries and Map Reduce from the same
program.

HWISessionManager hwi = new HWISessionManager() {} ;
Thread sessionThread = new Thread (hwi);
sessionThread.start();

    HWIAuth auth = new HWIAuth();
    auth.setUser("ecapriolo");
    auth.setGroups(new String [] {"ecapriolo"});

    HWISessionItem item = hwi.createSession(auth, "EC_SELECT");
    item.runSetProcessorQuery("mapred.map.tasks=40");
    item.runSetProcessorQuery("mapred.reduce.tasks=11");
    item.setQuery(
      "FROM logger_data INSERT TABLE logger_data_akami " +
      "PARTIION (log_day='2009-03-22') " +
      "SELECT clientip,hostip,count(1) GROUP BY clientip,hostip " );
    try {
      item.clientStart();
      while (item.getStatus() !=
HWISessionItem.WebSessionItemStatus.QUERY_COMPLETE){
        Thread.sleep(1);
      }
    } catch (HWIException ex ) {
      ex.printStackTrace();
      System.exit(1);
    } catch (InterruptedException ex) {
      System.exit(1);
    }

Re: Hive Application

Posted by Matt Pestritto <ma...@pestritto.com>.
Thanks for your comments all.  In the short-term, I will probably use simple
bash scripting because we have an immediate need.  I'll work on a long-term
solution.

Thanks
-Matt

On Tue, Apr 28, 2009 at 2:04 PM, Prasad Chakka <pc...@facebook.com> wrote:

>  > I copy and paste pieces into the CLI.
> FYI, there hive –e “hive ql command” or hive –f “file name” options. I
> suppose you want more but you can use these write a simple framework that
> patches up all the different steps in your program.
>
> ------------------------------
> *From: *John Warden <jo...@gmail.com>
> *Reply-To: *<hi...@hadoop.apache.org>, <jo...@gmail.com>
> *Date: *Mon, 27 Apr 2009 19:22:29 -0700
> *To: *<hi...@hadoop.apache.org>
> *Subject: *Re: Hive Application
>
> Hi Matt -- I've run into the same issues.  I was call this Orchestration:
> the general problem of defining Dependencies, Controlling/Scheduling
> Execution, and keeping track of state of what has been done.
>
> I haven't found anything great yet, but someone has recommended xactions in
> the Pentaho open-source suite, and I'll look more into this.  Also Zookeeper
> might could be a good way to keep track of state, but you still need to
> build the orchestration application around it.
>
> On Mon, Apr 27, 2009 at 10:11 PM, Matt Pestritto <ma...@pestritto.com>
> wrote:
>
> Hi All -
>
> Has anyone put any thought behind how to create an application using hive
> ?  I have a certain algorithm that I implemented in hive, but it currently
> lives in a 600+ line text file where I copy and paste pieces into the CLI.
> There are obvious problems with this - non-repeatable, can't be scheduled,
> input parameters are hard-coded, etc.  I have been thinking about how to
> functionalize these pieces without writing a complex application in java or
> another language to act as a job controller / manager.  The first obvious
> choice is just using the shell to script off different modules and having a
> set of scripts call each other and execute hive sql queries.  Another option
> that I am throwing around is using python to create a very basic UI where I
> can create different modules, set dependencies, control job execution, etc.
>
>
> I'm wondering if anyone else has also run into these issues and what type
> of solutions were implemented ?
>
> Thanks
> -Matt
>
>
>
>

Re: Hive Application

Posted by Prasad Chakka <pc...@facebook.com>.
> I copy and paste pieces into the CLI.
FYI, there hive -e "hive ql command" or hive -f "file name" options. I suppose you want more but you can use these write a simple framework that patches up all the different steps in your program.

________________________________
From: John Warden <jo...@gmail.com>
Reply-To: <hi...@hadoop.apache.org>, <jo...@gmail.com>
Date: Mon, 27 Apr 2009 19:22:29 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Hive Application

Hi Matt -- I've run into the same issues.  I was call this Orchestration: the general problem of defining Dependencies, Controlling/Scheduling Execution, and keeping track of state of what has been done.

I haven't found anything great yet, but someone has recommended xactions in the Pentaho open-source suite, and I'll look more into this.  Also Zookeeper might could be a good way to keep track of state, but you still need to build the orchestration application around it.

On Mon, Apr 27, 2009 at 10:11 PM, Matt Pestritto <ma...@pestritto.com> wrote:
Hi All -

Has anyone put any thought behind how to create an application using hive ?  I have a certain algorithm that I implemented in hive, but it currently lives in a 600+ line text file where I copy and paste pieces into the CLI.  There are obvious problems with this - non-repeatable, can't be scheduled, input parameters are hard-coded, etc.  I have been thinking about how to functionalize these pieces without writing a complex application in java or another language to act as a job controller / manager.  The first obvious choice is just using the shell to script off different modules and having a set of scripts call each other and execute hive sql queries.  Another option that I am throwing around is using python to create a very basic UI where I can create different modules, set dependencies, control job execution, etc.

I'm wondering if anyone else has also run into these issues and what type of solutions were implemented ?

Thanks
-Matt



Re: Hive Application

Posted by John Warden <jo...@gmail.com>.
Hi Matt -- I've run into the same issues.  I was call this Orchestration:
the general problem of defining Dependencies, Controlling/Scheduling
Execution, and keeping track of state of what has been done.

I haven't found anything great yet, but someone has recommended xactions in
the Pentaho open-source suite, and I'll look more into this.  Also Zookeeper
might could be a good way to keep track of state, but you still need to
build the orchestration application around it.

On Mon, Apr 27, 2009 at 10:11 PM, Matt Pestritto <ma...@pestritto.com> wrote:

> Hi All -
>
> Has anyone put any thought behind how to create an application using hive
> ?  I have a certain algorithm that I implemented in hive, but it currently
> lives in a 600+ line text file where I copy and paste pieces into the CLI.
> There are obvious problems with this - non-repeatable, can't be scheduled,
> input parameters are hard-coded, etc.  I have been thinking about how to
> functionalize these pieces without writing a complex application in java or
> another language to act as a job controller / manager.  The first obvious
> choice is just using the shell to script off different modules and having a
> set of scripts call each other and execute hive sql queries.  Another option
> that I am throwing around is using python to create a very basic UI where I
> can create different modules, set dependencies, control job execution, etc.
>
>
> I'm wondering if anyone else has also run into these issues and what type
> of solutions were implemented ?
>
> Thanks
> -Matt
>