You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Benyi Wang <be...@gmail.com> on 2012/05/01 23:36:37 UTC

Re: Hive server concurrent connection

Thanks Carl. This is clear.

When will HiveServer2 be implemented?

On Mon, Apr 30, 2012 at 12:15 PM, Carl Steinbach <ca...@cloudera.com> wrote:

> Hi Benyi,
>
> The quote from the HiveServer2 proposal reads in full:
>
> "In fact, it's impossible for HiveServer to support concurrent connections
> using the current Thrift API, *a result of the fact that Thrift doesn't
> provide server-side access to connection handles*"
>
> The point I'm trying to make with this statement is that HiveServer
> maintains session state using thread-local variables and implicitly relies
> on Thrift consistently mapping the same connection to the same Thrift
> worker thread, but this isn't a valid assumption to make. For example, if a
> client executes "set mapred.reduce.tasks=1" followed by "select .....", you
> can't assume that both of these statements will be executed by the same
> worker thread. Furthermore, the Thrift API doesn't provide any mechanism
> for detecting client disconnects (see THRIFT-1195), which results in
> incorrect behavior like this:
>
> % hive -h localhost -p 10000
> [localhost:10000] hive> set x=1;
> set x=1;
> [localhost:10000] hive> set x;
> set x;
> x=1
> [localhost:10000] hive> quit;
> quit;
> % hive -h localhost -p 10000
> [localhost:10000] hive> set x;
> set x;
> x=1
> [localhost:10000] hive> quit;
> quit;
>
> In this example I opened a connection to HiveServer and modified my
> sessions state on the server by setting x=1. I then killed the connection
> and reconnected, and then printed the value of x again. Since I'm creating
> a new connection/session I expect x to be undefined, however I actually see
> the value of x which I set in the previous connection. This happens because
> Thrift assigns the same worker thread to service the second connection, and
> since there's no way of detecting client disconnects, HiveServer was unable
> clear the thread-local session state associated with that worker thread
> before Thrift reassigned it to the second connection.
>
> While it's tempting to try to solve these problems by modifying Thrift to
> provide direct access to the connection handle (which would allow us map
> connections to session state on the server-side), this approach makes it
> really hard to support HA since it depends on the physical connection
> lasting as long as the user session, which isn't a fair assumption to make
> in the context of queries that can take many hours to complete.
>
> Instead, the approach we're taking with HiveServer2 is to provide explicit
> support for sessions in the client API, e.g every RPC call references a
> session ID which the server then maps to persistent session state. This
> makes it possible for any worker thread to service any request from any
> client connection.
>
> I hope this clarifies the limitations of the current HiveServer
> implementation as well as the motivations for implementing HiveServer2.
> Please let me know if you have any more questions.
>
> Thanks.
>
> Carl
>
> On Thu, Apr 26, 2012 at 11:55 AM, Benyi Wang <be...@gmail.com>
> wrote:
>
> > I'm a little confused with "In fact, it's impossible for HiveServer to
> > support concurrent connections using the current Thrift API" in hive wiki
> > page
> > https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API.
> >
> > I started a hive server on hostA using cdh3u3
> >
> > hadoop-hive.noarch                  0.7.1+42.36-2
> >  installed
> >
> > Then I logged on two nodes: hostB, and hostC, then start hive client
> >
> > $ hive -h hostA -p 10000
> >
> > It seems that both of two hive clients work normally.
> >
> > Am I wrong? or the issue in the wiki page has been resolved?
> >
>

Re: Hive server concurrent connection

Posted by Carl Steinbach <ca...@cloudera.com>.

Hi Benyi,

Implementing HiveServer2 is my current priority at work. We expect it to be
ready in time for Hive 0.10.0, which should release sometime in the next
four months.

Thanks.

Carl

On Tue, May 1, 2012 at 2:36 PM, Benyi Wang <be...@gmail.com> wrote:

> Thanks Carl. This is clear.
>
> When will HiveServer2 be implemented?
>
> On Mon, Apr 30, 2012 at 12:15 PM, Carl Steinbach <ca...@cloudera.com>
> wrote:
>
> > Hi Benyi,
> >
> > The quote from the HiveServer2 proposal reads in full:
> >
> > "In fact, it's impossible for HiveServer to support concurrent
> connections
> > using the current Thrift API, *a result of the fact that Thrift doesn't
> > provide server-side access to connection handles*"
> >
> > The point I'm trying to make with this statement is that HiveServer
> > maintains session state using thread-local variables and implicitly
> relies
> > on Thrift consistently mapping the same connection to the same Thrift
> > worker thread, but this isn't a valid assumption to make. For example,
> if a
> > client executes "set mapred.reduce.tasks=1" followed by "select .....",
> you
> > can't assume that both of these statements will be executed by the same
> > worker thread. Furthermore, the Thrift API doesn't provide any mechanism
> > for detecting client disconnects (see THRIFT-1195), which results in
> > incorrect behavior like this:
> >
> > % hive -h localhost -p 10000
> > [localhost:10000] hive> set x=1;
> > set x=1;
> > [localhost:10000] hive> set x;
> > set x;
> > x=1
> > [localhost:10000] hive> quit;
> > quit;
> > % hive -h localhost -p 10000
> > [localhost:10000] hive> set x;
> > set x;
> > x=1
> > [localhost:10000] hive> quit;
> > quit;
> >
> > In this example I opened a connection to HiveServer and modified my
> > sessions state on the server by setting x=1. I then killed the connection
> > and reconnected, and then printed the value of x again. Since I'm
> creating
> > a new connection/session I expect x to be undefined, however I actually
> see
> > the value of x which I set in the previous connection. This happens
> because
> > Thrift assigns the same worker thread to service the second connection,
> and
> > since there's no way of detecting client disconnects, HiveServer was
> unable
> > clear the thread-local session state associated with that worker thread
> > before Thrift reassigned it to the second connection.
> >
> > While it's tempting to try to solve these problems by modifying Thrift to
> > provide direct access to the connection handle (which would allow us map
> > connections to session state on the server-side), this approach makes it
> > really hard to support HA since it depends on the physical connection
> > lasting as long as the user session, which isn't a fair assumption to
> make
> > in the context of queries that can take many hours to complete.
> >
> > Instead, the approach we're taking with HiveServer2 is to provide
> explicit
> > support for sessions in the client API, e.g every RPC call references a
> > session ID which the server then maps to persistent session state. This
> > makes it possible for any worker thread to service any request from any
> > client connection.
> >
> > I hope this clarifies the limitations of the current HiveServer
> > implementation as well as the motivations for implementing HiveServer2.
> > Please let me know if you have any more questions.
> >
> > Thanks.
> >
> > Carl
> >
> > On Thu, Apr 26, 2012 at 11:55 AM, Benyi Wang <be...@gmail.com>
> > wrote:
> >
> > > I'm a little confused with "In fact, it's impossible for HiveServer to
> > > support concurrent connections using the current Thrift API" in hive
> wiki
> > > page
> > >
> https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API.
> > >
> > > I started a hive server on hostA using cdh3u3
> > >
> > > hadoop-hive.noarch                  0.7.1+42.36-2
> > >  installed
> > >
> > > Then I logged on two nodes: hostB, and hostC, then start hive client
> > >
> > > $ hive -h hostA -p 10000
> > >
> > > It seems that both of two hive clients work normally.
> > >
> > > Am I wrong? or the issue in the wiki page has been resolved?
> > >
> >
>