You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Alejandro Abdelnur <tu...@cloudera.com> on 2012/03/19 22:55:31 UTC

Re: Hive CLI and Standalone Server : Need Suggestion

Ed,

Moving this thread to oozie alias.

I was not being defensive, I'm trying to understand how we can improve
Oozie, as you indicate is a pain point in oozie.

Would you say that an action that handles Hadoop Tools would be useful?

Thanks.

Alejandro

On Mon, Mar 19, 2012 at 2:41 PM, Edward Capriolo <ed...@gmail.com>wrote:

> I am not trying to knock oozie but....
> MapReduce Action: Would be great but hadoop docs taught me the proper
> way to write hadoop programs was Tool and Configured. 90% of our
> legacy jobs are tools. MapReduce action can not launch Tools. So
> JavaMain...
>
> SSH action is something I would never allow on our network. Super
> bootleg and insecure.
>
> HiveAction requires the entire hive fat client which is not easy since
> our RDBMS needs to be configured to allow every possible tasktracker
> to access it's metastore. Would be better if HiveAction was
> HiveThriftAction then it would only need minimal jars and a host port
> pair. Again back to JavaMain...
>
> Not sure about the shell action.  May not have been around when I put
> this framework together.
>
> My main point is that oozie in its current form is not very flexible,
> what if I want to add an RDBMS action? Beg developers to patch it in?
> Just having to patch in actions is detracting. (I know there is a jira
> open on this)
>
> The reason I wrote the library was:
>
> https://github.com/edwardcapriolo/m6d_oozie/blob/master/src/main/java/com/m6d/oozie/RunShellProps.java
>
> The problem I was facing with the Shell and Java Main actions is that
> if you want to extract any output to be used in the next phase of the
> job it is not easy to get at. I wrote a JavaMain that was
> <capture-output /> friendly.
>
>
> On Mon, Mar 19, 2012 at 5:23 PM, Alejandro Abdelnur <tu...@cloudera.com>
> wrote:
> > Eduardo,
> >
> > Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a
> > JAVA action (to execute a Java Main class in the cluster), a SSH action
> (to
> > execute a script via SSH in a remote host), and a SHELL action (to
> execute a
> > script in the cluster).
> >
> > Would you mind explaining what does your m6d extension that JAVA, SSH or
> > SHELL cannot do to in a similar way?
> >
> > Thanks.
> >
> > Alejandro
> >
> > On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo <edlinuxguru@gmail.com
> >
> > wrote:
> >>
> >> This is a bit of a problem. ozzie is great for workflow scheduling but
> >> oozie does not have "actions" for everything and adding actions is
> >> non-trivial in current versions.
> >>
> >> I have created some "bootleg/generic" oozie actions that make it easy
> >> to exec pretty much anything and treat it as an action.
> >>
> >> https://github.com/edwardcapriolo/m6d_oozie
> >>
> >> On Mon, Mar 19, 2012 at 3:38 PM,  <ca...@nokia.com> wrote:
> >> > Great topic as I was wondering a similar thing this morning…I want to
> >> > use
> >> > oozie to execute my hive job, but I have to pass the job parameters
> that
> >> > I
> >> > generate with a shell script.  Some of the literature that I’ve seen
> >> > says
> >> > that oozie may or may not allow for calling shell scripts.  Is that
> >> > true?
> >> >
> >> >
> >> >
> >> > Thanks
> >> >
> >> > Carla
> >> >
> >> >
> >> >
> >> > From: ext Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> >> > Sent: Monday, March 19, 2012 15:34
> >> > To: user@hive.apache.org
> >> > Subject: Re: Hive CLI and Standalone Server : Need Suggestion
> >> >
> >> >
> >> >
> >> > Hi LakshmiKanth
> >> >
> >> >         In production systems if you have a sequence of command to be
> >> > executed pack them in order in a file. Then execute the command as
> >> >
> >> > hive -f <filename> ;
> >> >
> >> >
> >> >
> >> > For simplicity, you can use a cron job to run it in a scheduled
> manner.
> >> > Just
> >> > give this command in a .sh file call the file in cron. Infact you can
> >> > use
> >> > any scheduler that would trigger a .sh file.
> >> >
> >> >
> >> >
> >> > But for hadoop based work flows the preferred workflow manager is
> oozie
> >> > and
> >> > I recommend oozie for hadoop jobs.
> >> >
> >> >
> >> >
> >> > Regrads
> >> >
> >> > Bejoy KS
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >
> >> > From: LakshmiKanth P <lk...@gmail.com>
> >> > To: user@hive.apache.org
> >> > Sent: Tuesday, March 20, 2012 12:19 AM
> >> > Subject: Hive CLI and Standalone Server : Need Suggestion
> >> >
> >> >
> >> >
> >> > Hi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > I need to schedule my hive scripts which needs to process incoming
> >> > weblogs
> >> > on an hourly basis.
> >> >
> >> >
> >> >
> >> > Currently, I could process my weblog files by executing my scripts
> from
> >> > hive
> >> > command line interface.  Now I want to keep my scripts in a file and
> >> > invoke
> >> > my scripts at a regular periods of interval.  I came to know that hive
> >> > command line options provides a facility to pass the .sql file as
> input
> >> > for
> >> > execution.  Is it the right approach for any production environment.
> >> >
> >> >
> >> >
> >> > OR
> >> >
> >> >
> >> >
> >> > Should I use my hive server in stand alone mode and inovke my hive
> >> > scripts
> >> > using JDBC API.
> >> >
> >> >
> >> >
> >> > Request you to suggest me the best approach.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > LK
> >> >
> >> >
> >
> >
>

Re: Hive CLI and Standalone Server : Need Suggestion

Posted by Mohammad Islam <mi...@yahoo.com>.
Hi Ed,
It is a good discussion!

I think some of your concerns might already be implemented in recent releases.

For example, Shell action is added in the latest version.  It allows user to execute any shell command and to pass any variables to be used in later phases of workflow.


Supporting new action could be done in two ways:
1. For totally a new Action: Adding  new ActionExecutor, ActionMain, related xsd and update site.xml (More flexible but quite a few steps!)
2. Override the existing actions: Only override the ActionMain. For example, if you don't like Oozie's current MR job submission, you could write your own job submission class and configure it through wf.xml to override the default MapReduceMain.java. (Easier and user's configurable)

Thanks for your suggestions.

Regards,
Mohammad 





----- Original Message -----
From: Alejandro Abdelnur <tu...@cloudera.com>
To: oozie-users@incubator.apache.org
Cc: 
Sent: Monday, March 19, 2012 2:55 PM
Subject: Re: Hive CLI and Standalone Server : Need Suggestion

Ed,

Moving this thread to oozie alias.

I was not being defensive, I'm trying to understand how we can improve
Oozie, as you indicate is a pain point in oozie.

Would you say that an action that handles Hadoop Tools would be useful?

Thanks.

Alejandro

On Mon, Mar 19, 2012 at 2:41 PM, Edward Capriolo <ed...@gmail.com>wrote:

> I am not trying to knock oozie but....
> MapReduce Action: Would be great but hadoop docs taught me the proper
> way to write hadoop programs was Tool and Configured. 90% of our
> legacy jobs are tools. MapReduce action can not launch Tools. So
> JavaMain...
>
> SSH action is something I would never allow on our network. Super
> bootleg and insecure.
>
> HiveAction requires the entire hive fat client which is not easy since
> our RDBMS needs to be configured to allow every possible tasktracker
> to access it's metastore. Would be better if HiveAction was
> HiveThriftAction then it would only need minimal jars and a host port
> pair. Again back to JavaMain...
>
> Not sure about the shell action.  May not have been around when I put
> this framework together.
>
> My main point is that oozie in its current form is not very flexible,
> what if I want to add an RDBMS action? Beg developers to patch it in?
> Just having to patch in actions is detracting. (I know there is a jira
> open on this)
>
> The reason I wrote the library was:
>
> https://github.com/edwardcapriolo/m6d_oozie/blob/master/src/main/java/com/m6d/oozie/RunShellProps.java
>
> The problem I was facing with the Shell and Java Main actions is that
> if you want to extract any output to be used in the next phase of the
> job it is not easy to get at. I wrote a JavaMain that was
> <capture-output /> friendly.
>
>
> On Mon, Mar 19, 2012 at 5:23 PM, Alejandro Abdelnur <tu...@cloudera.com>
> wrote:
> > Eduardo,
> >
> > Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a
> > JAVA action (to execute a Java Main class in the cluster), a SSH action
> (to
> > execute a script via SSH in a remote host), and a SHELL action (to
> execute a
> > script in the cluster).
> >
> > Would you mind explaining what does your m6d extension that JAVA, SSH or
> > SHELL cannot do to in a similar way?
> >
> > Thanks.
> >
> > Alejandro
> >
> > On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo <edlinuxguru@gmail.com
> >
> > wrote:
> >>
> >> This is a bit of a problem. ozzie is great for workflow scheduling but
> >> oozie does not have "actions" for everything and adding actions is
> >> non-trivial in current versions.
> >>
> >> I have created some "bootleg/generic" oozie actions that make it easy
> >> to exec pretty much anything and treat it as an action.
> >>
> >> https://github.com/edwardcapriolo/m6d_oozie
> >>
> >> On Mon, Mar 19, 2012 at 3:38 PM,  <ca...@nokia.com> wrote:
> >> > Great topic as I was wondering a similar thing this morning…I want to
> >> > use
> >> > oozie to execute my hive job, but I have to pass the job parameters
> that
> >> > I
> >> > generate with a shell script.  Some of the literature that I’ve seen
> >> > says
> >> > that oozie may or may not allow for calling shell scripts.  Is that
> >> > true?
> >> >
> >> >
> >> >
> >> > Thanks
> >> >
> >> > Carla
> >> >
> >> >
> >> >
> >> > From: ext Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> >> > Sent: Monday, March 19, 2012 15:34
> >> > To: user@hive.apache.org
> >> > Subject: Re: Hive CLI and Standalone Server : Need Suggestion
> >> >
> >> >
> >> >
> >> > Hi LakshmiKanth
> >> >
> >> >         In production systems if you have a sequence of command to be
> >> > executed pack them in order in a file. Then execute the command as
> >> >
> >> > hive -f <filename> ;
> >> >
> >> >
> >> >
> >> > For simplicity, you can use a cron job to run it in a scheduled
> manner.
> >> > Just
> >> > give this command in a .sh file call the file in cron. Infact you can
> >> > use
> >> > any scheduler that would trigger a .sh file.
> >> >
> >> >
> >> >
> >> > But for hadoop based work flows the preferred workflow manager is
> oozie
> >> > and
> >> > I recommend oozie for hadoop jobs.
> >> >
> >> >
> >> >
> >> > Regrads
> >> >
> >> > Bejoy KS
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >
> >> > From: LakshmiKanth P <lk...@gmail.com>
> >> > To: user@hive.apache.org
> >> > Sent: Tuesday, March 20, 2012 12:19 AM
> >> > Subject: Hive CLI and Standalone Server : Need Suggestion
> >> >
> >> >
> >> >
> >> > Hi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > I need to schedule my hive scripts which needs to process incoming
> >> > weblogs
> >> > on an hourly basis.
> >> >
> >> >
> >> >
> >> > Currently, I could process my weblog files by executing my scripts
> from
> >> > hive
> >> > command line interface.  Now I want to keep my scripts in a file and
> >> > invoke
> >> > my scripts at a regular periods of interval.  I came to know that hive
> >> > command line options provides a facility to pass the .sql file as
> input
> >> > for
> >> > execution.  Is it the right approach for any production environment.
> >> >
> >> >
> >> >
> >> > OR
> >> >
> >> >
> >> >
> >> > Should I use my hive server in stand alone mode and inovke my hive
> >> > scripts
> >> > using JDBC API.
> >> >
> >> >
> >> >
> >> > Request you to suggest me the best approach.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > LK
> >> >
> >> >
> >
> >
>