You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Alejandro Abdelnur <tu...@cloudera.com> on 2012/05/01 20:18:21 UTC

Re: a hive thrift alternative

Edward,

I agree that hive thrift server would be the ideal approach. However
the thrift server is that is not multi-user/multi-job friendly:

http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E

Until Hive address this I think we are better off with the CLI approach.

Thx

On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <ed...@gmail.com> wrote:
> HaHa. I never rejoined the list after it moved from Yahoo.
>
> I would not describe hive-thrift as horrible but there is some unpleasantness.
>
> Near future:
> https://issues.apache.org/jira/browse/HIVE-2935
>
> In any case I am willing to accept the issues. I run multiple
> hive-thrift servers behind ha-proxy
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
>
> This cuts downs concurrency type problems. It's hive so not sure how
> much concurrency is needed there.
>
> Our group just decided to part ways with programming over the CLI. Too
> much stuff like this:
>
> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
> or:
> my list=`hadoop dfs -ls /bla`
>
> That was not unit testable and just really ugly. Even if it fails
> 1/1000 times we have try catch , and we have done stuff that can bring
> up the entire stack end to end in an IDE now.
>
> Layering on top of the CLI is a bad idea in the long run, its like
> expect scripting an ssh session. Not that it was a bad design chose
> for oozie at the time but it is certainly not the ideal way to handle
> it.



-- 
Alejandro

Re: a hive thrift alternative

Posted by Matthew Rathbone <ma...@foursquare.com>.
We're a long way from using a trunk release :-). Like a lot of people we're
using the version bundled in CDH3.

On Wed, May 2, 2012 at 8:12 AM, Edward Capriolo <ed...@gmail.com>wrote:

> https://issues.apache.org/jira/browse/HIVE-2503
>
> I believe what you are describing is fixed in trunk.
>
> On Tuesday, May 1, 2012, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> > Edward,
> >
> > I agree that hive thrift server would be the ideal approach. However
> > the thrift server is that is not multi-user/multi-job friendly:
> >
> >
>
> http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E
> >
> > Until Hive address this I think we are better off with the CLI approach.
> >
> > Thx
> >
> > On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <edlinuxguru@gmail.com
> >
> wrote:
> >> HaHa. I never rejoined the list after it moved from Yahoo.
> >>
> >> I would not describe hive-thrift as horrible but there is some
> unpleasantness.
> >>
> >> Near future:
> >> https://issues.apache.org/jira/browse/HIVE-2935
> >>
> >> In any case I am willing to accept the issues. I run multiple
> >> hive-thrift servers behind ha-proxy
> >>
> >>
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
> >>
> >> This cuts downs concurrency type problems. It's hive so not sure how
> >> much concurrency is needed there.
> >>
> >> Our group just decided to part ways with programming over the CLI. Too
> >> much stuff like this:
> >>
> >> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
> >> or:
> >> my list=`hadoop dfs -ls /bla`
> >>
> >> That was not unit testable and just really ugly. Even if it fails
> >> 1/1000 times we have try catch , and we have done stuff that can bring
> >> up the entire stack end to end in an IDE now.
> >>
> >> Layering on top of the CLI is a bad idea in the long run, its like
> >> expect scripting an ssh session. Not that it was a bad design chose
> >> for oozie at the time but it is certainly not the ideal way to handle
> >> it.
> >
> >
> >
> > --
> > Alejandro
> >
>



-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com | @rathboma <http://twitter.com/rathboma> |
4sq<http://foursquare.com/rathboma>

Re: a hive thrift alternative

Posted by Edward Capriolo <ed...@gmail.com>.
The jdbc client in embedded thick client mode (never used it) sounds
like the Cli on steroids :) Even more on the client path now.

As I stated originally:

"I would not describe hive-thrift as horrible but there is some unpleasantness."

Two people can declare an X and step on each other. OMG no way around
that right?

application1-x=5
application2-x=6

Wait! What if I need to run two copies of application1 at once? Deal
breaker right? Nope.

application1-$pid-x=5
application1-$pid-x=7

Furthermore, most people do not need to touch the conf to run queries
and those using hive-thrift are more likely to just do any variable
replacement in the java code on the client side.

Again forking hive takes about 3-5 seconds and invocation.

$ time hive -S -e  "show tables"
real	0m3.547s
user	0m5.339s
sys	0m0.351s

We have many processes that have hundreds or thousands of steps. Each
fork really adds up runtime. While hive-thrift has a pooled DB
connection and a pooled FS connection we get tasks done in
milliseconds not seconds. I'm not here to try to convince anyone to
switch to my action or anything, but it works fine for me and there is
a big upside.


On Wed, May 2, 2012 at 5:01 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> Hi Ed,
>
> I've checked this with Carl and got the following:
>
> ----
> HIVE-2503 doesn't really fix the underlying problems. The example that
> I gave in that earlier email of HiveServer reusing the same HiveConf
> between disconnects is still valid on trunk (i.e. even with
> HIVE-2503). If Ed wants to access Hive from Oozie via an API instead
> of through the CLI, then I think his best bet is to run the JDBC
> driver in embedded (thick-client) mode.
> ----
>
> Hope this clarifies the current state of things regarding the Thrift server.
>
> Thx
>
>
> On Wed, May 2, 2012 at 6:12 AM, Edward Capriolo <ed...@gmail.com> wrote:
>> https://issues.apache.org/jira/browse/HIVE-2503
>>
>> I believe what you are describing is fixed in trunk.
>>
>> On Tuesday, May 1, 2012, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>> Edward,
>>>
>>> I agree that hive thrift server would be the ideal approach. However
>>> the thrift server is that is not multi-user/multi-job friendly:
>>>
>>>
>> http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E
>>>
>>> Until Hive address this I think we are better off with the CLI approach.
>>>
>>> Thx
>>>
>>> On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>>> HaHa. I never rejoined the list after it moved from Yahoo.
>>>>
>>>> I would not describe hive-thrift as horrible but there is some
>> unpleasantness.
>>>>
>>>> Near future:
>>>> https://issues.apache.org/jira/browse/HIVE-2935
>>>>
>>>> In any case I am willing to accept the issues. I run multiple
>>>> hive-thrift servers behind ha-proxy
>>>>
>>>>
>> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
>>>>
>>>> This cuts downs concurrency type problems. It's hive so not sure how
>>>> much concurrency is needed there.
>>>>
>>>> Our group just decided to part ways with programming over the CLI. Too
>>>> much stuff like this:
>>>>
>>>> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
>>>> or:
>>>> my list=`hadoop dfs -ls /bla`
>>>>
>>>> That was not unit testable and just really ugly. Even if it fails
>>>> 1/1000 times we have try catch , and we have done stuff that can bring
>>>> up the entire stack end to end in an IDE now.
>>>>
>>>> Layering on top of the CLI is a bad idea in the long run, its like
>>>> expect scripting an ssh session. Not that it was a bad design chose
>>>> for oozie at the time but it is certainly not the ideal way to handle
>>>> it.
>>>
>>>
>>>
>>> --
>>> Alejandro
>>>
>
>
>
> --
> Alejandro

Re: a hive thrift alternative

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Hi Ed,

I've checked this with Carl and got the following:

----
HIVE-2503 doesn't really fix the underlying problems. The example that
I gave in that earlier email of HiveServer reusing the same HiveConf
between disconnects is still valid on trunk (i.e. even with
HIVE-2503). If Ed wants to access Hive from Oozie via an API instead
of through the CLI, then I think his best bet is to run the JDBC
driver in embedded (thick-client) mode.
----

Hope this clarifies the current state of things regarding the Thrift server.

Thx


On Wed, May 2, 2012 at 6:12 AM, Edward Capriolo <ed...@gmail.com> wrote:
> https://issues.apache.org/jira/browse/HIVE-2503
>
> I believe what you are describing is fixed in trunk.
>
> On Tuesday, May 1, 2012, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>> Edward,
>>
>> I agree that hive thrift server would be the ideal approach. However
>> the thrift server is that is not multi-user/multi-job friendly:
>>
>>
> http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E
>>
>> Until Hive address this I think we are better off with the CLI approach.
>>
>> Thx
>>
>> On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>> HaHa. I never rejoined the list after it moved from Yahoo.
>>>
>>> I would not describe hive-thrift as horrible but there is some
> unpleasantness.
>>>
>>> Near future:
>>> https://issues.apache.org/jira/browse/HIVE-2935
>>>
>>> In any case I am willing to accept the issues. I run multiple
>>> hive-thrift servers behind ha-proxy
>>>
>>>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
>>>
>>> This cuts downs concurrency type problems. It's hive so not sure how
>>> much concurrency is needed there.
>>>
>>> Our group just decided to part ways with programming over the CLI. Too
>>> much stuff like this:
>>>
>>> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
>>> or:
>>> my list=`hadoop dfs -ls /bla`
>>>
>>> That was not unit testable and just really ugly. Even if it fails
>>> 1/1000 times we have try catch , and we have done stuff that can bring
>>> up the entire stack end to end in an IDE now.
>>>
>>> Layering on top of the CLI is a bad idea in the long run, its like
>>> expect scripting an ssh session. Not that it was a bad design chose
>>> for oozie at the time but it is certainly not the ideal way to handle
>>> it.
>>
>>
>>
>> --
>> Alejandro
>>



-- 
Alejandro

Re: a hive thrift alternative

Posted by Edward Capriolo <ed...@gmail.com>.
https://issues.apache.org/jira/browse/HIVE-2503

I believe what you are describing is fixed in trunk.

On Tuesday, May 1, 2012, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> Edward,
>
> I agree that hive thrift server would be the ideal approach. However
> the thrift server is that is not multi-user/multi-job friendly:
>
>
http://mail-archives.apache.org/mod_mbox/hive-dev/201204.mbox/%3CCAJqeMKTDOmDZfNUUW8kSgkivZPkC%2BkH9H5D_RL2YhJGhh4rqNQ%40mail.gmail.com%3E
>
> Until Hive address this I think we are better off with the CLI approach.
>
> Thx
>
> On Mon, Apr 30, 2012 at 10:03 AM, Edward Capriolo <ed...@gmail.com>
wrote:
>> HaHa. I never rejoined the list after it moved from Yahoo.
>>
>> I would not describe hive-thrift as horrible but there is some
unpleasantness.
>>
>> Near future:
>> https://issues.apache.org/jira/browse/HIVE-2935
>>
>> In any case I am willing to accept the issues. I run multiple
>> hive-thrift servers behind ha-proxy
>>
>>
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/running_a_hive_thrift_cluster
>>
>> This cuts downs concurrency type problems. It's hive so not sure how
>> much concurrency is needed there.
>>
>> Our group just decided to part ways with programming over the CLI. Too
>> much stuff like this:
>>
>> hive -e -S "select x,y from $TABLE WHERE $STUFF" | awk whatever
>> or:
>> my list=`hadoop dfs -ls /bla`
>>
>> That was not unit testable and just really ugly. Even if it fails
>> 1/1000 times we have try catch , and we have done stuff that can bring
>> up the entire stack end to end in an IDE now.
>>
>> Layering on top of the CLI is a bad idea in the long run, its like
>> expect scripting an ssh session. Not that it was a bad design chose
>> for oozie at the time but it is certainly not the ideal way to handle
>> it.
>
>
>
> --
> Alejandro
>