You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by VJ Anand <vj...@sankia.com> on 2014/11/04 04:26:57 UTC

Question Tez under the hood

I have a follow-up question -- Bikas mentioned that the Tez App Master
submits one DAG at a time -- Now, for a Query engine like Hive, where there
would be multiple requests, how is this handled? Are we creating multiple
App Masters that round robins between them? Even then, when large number of
requests are submitted to the Hive server, if the App master can submit
only one DAG at a time, we would have situations where there would be many
outstanding requests. Is there a way we can make the App Master
multi-threaded?

-- 
*VJ Anand*

Re: Question Tez under the hood

Posted by Siddharth Seth <ss...@apache.org>.
That's right - there would be one AppMaster per query. Hive server, afaik,
does have some kind of round robin logic - as well as the capability to
launch new AMs if required. The hive user group will likely be able to
answer this better.

The number of AMs (applications) that can run on a YARN cluster is
controlled via queue configuration, and we typically rely on that to
throttle concurrent execution requests. Without Hive server, the number of
concurrent Hive queries would be limited by this factor. Additional queries
would just queue up as pending applications on YARN. Hive Server likely has
it's own logic to limit this.

In terms of making the AppMaster multi-threaded. This has been discussed,
but I don't believe there's any immediate plans for this. This would run
into scheduling decisions like how many queries to allow through
simultaneously, priorities, which queue to run in etc. Currently, we rely
on YARN scheduling for all of this.

On Mon, Nov 3, 2014 at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master
> submits one DAG at a time -- Now, for a Query engine like Hive, where there
> would be multiple requests, how is this handled? Are we creating multiple
> App Masters that round robins between them? Even then, when large number of
> requests are submitted to the Hive server, if the App master can submit
> only one DAG at a time, we would have situations where there would be many
> outstanding requests. Is there a way we can make the App Master
> multi-threaded?
>
> --
> *VJ Anand*
>
>

RE: Question Tez under the hood

Posted by Bikas Saha <bi...@hortonworks.com>.
The cluster utilization may not be limited to shared workload. Taking the
example of HiveServer2 and a cluster dedicated to Hive queries. Lets say
the typical queries take 30s to run and use 25% of the cluster. So we can
run 4 queries in parallel (using 4 Tez sessions/AMs running concurrently).
We can complete 8 queries per minute. That’s the qpm that can be supported
based on the per query expected latency and the cluster size. A larger
cluster can support higher qpm. Shorter queries can support higher qpm.



In your system, with 200 queries per min, does it actually run 200 queries
simultaneously or less. How much of the cluster is occupied per query and
how long does each query take. With these numbers you can do the math of
whether your cluster has enough capacity to reach this 200qpm with multiple
concurrent Tez sessions or not.



Bikas



*From:* VJ Anand [mailto:vjanand@sankia.com]
*Sent:* Tuesday, November 04, 2014 3:37 PM
*To:* user@tez.apache.org
*Subject:* Re: Question Tez under the hood



I agree, if the cluster resources are fully utilized then it is moot
question. But, considering this case, where I could support a dedicated
cluster without sharing any other work load/types, I was concerned about
the lack of multiple thread support within Tez AM -- the current
requirement are in the range ~200 queries per min (these are concurrent
requests) - Do you still suggest/advice that I can build on top of this
framework?



-VJ





On Tue, Nov 4, 2014 at 2:13 PM, Bikas Saha <bi...@hortonworks.com> wrote:

What kind of concurrency load are we talking about here.



Note that HiveServer2 and similar systems are currently building using Tez
and support concurrency using multiple Tez sessions.. If the system is
being fully used then its orthogonal that a single TEZ AM cannot support
concurrent DAGs because the system capacity is already fully utilized. The
service can accept high concurrency but can execute only as much as the
cluster capacity allows. Sharing the cluster capacity between the queries
depends on that services policy. E.g. FIFO, fair-share etc.



Bikas



*From:* VJ Anand [mailto:vjanand@sankia.com]
*Sent:* Tuesday, November 04, 2014 1:24 PM


*To:* user@tez.apache.org
*Subject:* Re: Question Tez under the hood



Thanks for the info and response. The purpose of asking regarding Hive, was
to see whether a query engine that I have been working on could be moved to
use Tez as its execution layer. Currently, this query engine supports a
large number of concurrent user request, and needs to do so going forward.
Given, the current Tez AM limitations, would it possible for me to purpose
build a YARN AM, that can leverage the Tez DAG execution framework? In
other words, the AM would support the concurrent use cases, etc., needed,
but at the same leverage the DAG API's and the frameword? if so, any
pointers?



-VJ



On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <bi...@hortonworks.com> wrote:

To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
session across queries. Hive CLI will typically end up using only 1
session since the CLI blocks until the current query completes.
HiveServer2 has concurrent query support and has its own logic about when
a Tez session can be re-used.

About running multiple queries in the AM. I believe there was a jira for
that. If not we should track that. Queries in Hive typically follow a V
shape with a large parallelism in the beginning that tapers off. It may be
possible to get significant gains by pipelining queries sequentially where
the next query fills up the unused space left behind by the current query
as it winds down.

Bikas


-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org]
Sent: Monday, November 03, 2014 7:46 PM
To: user@tez.apache.org
Subject: Re: Question Tez under the hood

Hi

For the most part, each Hive CLI session or JDBC/ODBC connection to
HiveServer2 would map to a single Application Master. HiveServer does have
some optimizations though ( to avoid the overhead cost of launching a new
AM ) where it tries to keep a pool of ApplicationMasters around and does
some scheduling around them. In cases where the no. of queries is high, I
am not sure whether it starts spawning new AMs or queues up queries.
Something that is probably best asked on the Hive mailing lists.

As for making the AM able to handle multiple DAGs concurrently, the
problem does not lie in fixing that but more in terms of whether a cluster
has enough capacity to handle that many queries/DAGs concurrently. The
amount of savings in running multiple queries in a single AM is the
resources utilized per AM. In the end, the level of throughput may not
increase by much if there are not enough resources to run containers
needed by all the tasks of each of these queries.

On the other hand, there have been some discussions around looking at
supporting concurrent DAGs within a single AM. This has interesting
problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
now has to decide priorities across different DAGs and decide how to
allocate containers to complete the tasks for each DAG. From a YARN point
of view, the Tez AM is a single application and therefore all resource
management/prioritization/preemption now falls onto the Tez AM to manage
the multiple queries unlike in the case where each query has its own AM.

- Hitesh

On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master
submits one DAG at a time -- Now, for a Query engine like Hive, where
there would be multiple requests, how is this handled? Are we creating
multiple App Masters that round robins between them? Even then, when large
number of requests are submitted to the Hive server, if the App master can
submit only one DAG at a time, we would have situations where there would
be many outstanding requests. Is there a way we can make the App Master
multi-threaded?
>
> --
> VJ Anand
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.





-- 

*VJ Anand*

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question Tez under the hood

Posted by VJ Anand <vj...@sankia.com>.
I agree, if the cluster resources are fully utilized then it is moot
question. But, considering this case, where I could support a dedicated
cluster without sharing any other work load/types, I was concerned about
the lack of multiple thread support within Tez AM -- the current
requirement are in the range ~200 queries per min (these are concurrent
requests) - Do you still suggest/advice that I can build on top of this
framework?

-VJ


On Tue, Nov 4, 2014 at 2:13 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> What kind of concurrency load are we talking about here.
>
>
>
> Note that HiveServer2 and similar systems are currently building using Tez
> and support concurrency using multiple Tez sessions.. If the system is
> being fully used then its orthogonal that a single TEZ AM cannot support
> concurrent DAGs because the system capacity is already fully utilized. The
> service can accept high concurrency but can execute only as much as the
> cluster capacity allows. Sharing the cluster capacity between the queries
> depends on that services policy. E.g. FIFO, fair-share etc.
>
>
>
> Bikas
>
>
>
> *From:* VJ Anand [mailto:vjanand@sankia.com]
> *Sent:* Tuesday, November 04, 2014 1:24 PM
>
> *To:* user@tez.apache.org
> *Subject:* Re: Question Tez under the hood
>
>
>
> Thanks for the info and response. The purpose of asking regarding Hive,
> was to see whether a query engine that I have been working on could be
> moved to use Tez as its execution layer. Currently, this query engine
> supports a large number of concurrent user request, and needs to do so
> going forward. Given, the current Tez AM limitations, would it possible for
> me to purpose build a YARN AM, that can leverage the Tez DAG execution
> framework? In other words, the AM would support the concurrent use cases,
> etc., needed, but at the same leverage the DAG API's and the frameword? if
> so, any pointers?
>
>
>
> -VJ
>
>
>
> On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
> session across queries. Hive CLI will typically end up using only 1
> session since the CLI blocks until the current query completes.
> HiveServer2 has concurrent query support and has its own logic about when
> a Tez session can be re-used.
>
> About running multiple queries in the AM. I believe there was a jira for
> that. If not we should track that. Queries in Hive typically follow a V
> shape with a large parallelism in the beginning that tapers off. It may be
> possible to get significant gains by pipelining queries sequentially where
> the next query fills up the unused space left behind by the current query
> as it winds down.
>
> Bikas
>
>
> -----Original Message-----
> From: Hitesh Shah [mailto:hitesh@apache.org]
> Sent: Monday, November 03, 2014 7:46 PM
> To: user@tez.apache.org
> Subject: Re: Question Tez under the hood
>
> Hi
>
> For the most part, each Hive CLI session or JDBC/ODBC connection to
> HiveServer2 would map to a single Application Master. HiveServer does have
> some optimizations though ( to avoid the overhead cost of launching a new
> AM ) where it tries to keep a pool of ApplicationMasters around and does
> some scheduling around them. In cases where the no. of queries is high, I
> am not sure whether it starts spawning new AMs or queues up queries.
> Something that is probably best asked on the Hive mailing lists.
>
> As for making the AM able to handle multiple DAGs concurrently, the
> problem does not lie in fixing that but more in terms of whether a cluster
> has enough capacity to handle that many queries/DAGs concurrently. The
> amount of savings in running multiple queries in a single AM is the
> resources utilized per AM. In the end, the level of throughput may not
> increase by much if there are not enough resources to run containers
> needed by all the tasks of each of these queries.
>
> On the other hand, there have been some discussions around looking at
> supporting concurrent DAGs within a single AM. This has interesting
> problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
> now has to decide priorities across different DAGs and decide how to
> allocate containers to complete the tasks for each DAG. From a YARN point
> of view, the Tez AM is a single application and therefore all resource
> management/prioritization/preemption now falls onto the Tez AM to manage
> the multiple queries unlike in the case where each query has its own AM.
>
> - Hitesh
>
> On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:
>
> > I have a follow-up question -- Bikas mentioned that the Tez App Master
> submits one DAG at a time -- Now, for a Query engine like Hive, where
> there would be multiple requests, how is this handled? Are we creating
> multiple App Masters that round robins between them? Even then, when large
> number of requests are submitted to the Hive server, if the App master can
> submit only one DAG at a time, we would have situations where there would
> be many outstanding requests. Is there a way we can make the App Master
> multi-threaded?
> >
> > --
> > VJ Anand
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
>
> --
>
> *VJ Anand*
>
>

RE: Question Tez under the hood

Posted by Bikas Saha <bi...@hortonworks.com>.
What kind of concurrency load are we talking about here.



Note that HiveServer2 and similar systems are currently building using Tez
and support concurrency using multiple Tez sessions.. If the system is
being fully used then its orthogonal that a single TEZ AM cannot support
concurrent DAGs because the system capacity is already fully utilized. The
service can accept high concurrency but can execute only as much as the
cluster capacity allows. Sharing the cluster capacity between the queries
depends on that services policy. E.g. FIFO, fair-share etc.



Bikas



*From:* VJ Anand [mailto:vjanand@sankia.com]
*Sent:* Tuesday, November 04, 2014 1:24 PM
*To:* user@tez.apache.org
*Subject:* Re: Question Tez under the hood



Thanks for the info and response. The purpose of asking regarding Hive, was
to see whether a query engine that I have been working on could be moved to
use Tez as its execution layer. Currently, this query engine supports a
large number of concurrent user request, and needs to do so going forward.
Given, the current Tez AM limitations, would it possible for me to purpose
build a YARN AM, that can leverage the Tez DAG execution framework? In
other words, the AM would support the concurrent use cases, etc., needed,
but at the same leverage the DAG API's and the frameword? if so, any
pointers?



-VJ



On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <bi...@hortonworks.com> wrote:

To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
session across queries. Hive CLI will typically end up using only 1
session since the CLI blocks until the current query completes.
HiveServer2 has concurrent query support and has its own logic about when
a Tez session can be re-used.

About running multiple queries in the AM. I believe there was a jira for
that. If not we should track that. Queries in Hive typically follow a V
shape with a large parallelism in the beginning that tapers off. It may be
possible to get significant gains by pipelining queries sequentially where
the next query fills up the unused space left behind by the current query
as it winds down.

Bikas


-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org]
Sent: Monday, November 03, 2014 7:46 PM
To: user@tez.apache.org
Subject: Re: Question Tez under the hood

Hi

For the most part, each Hive CLI session or JDBC/ODBC connection to
HiveServer2 would map to a single Application Master. HiveServer does have
some optimizations though ( to avoid the overhead cost of launching a new
AM ) where it tries to keep a pool of ApplicationMasters around and does
some scheduling around them. In cases where the no. of queries is high, I
am not sure whether it starts spawning new AMs or queues up queries.
Something that is probably best asked on the Hive mailing lists.

As for making the AM able to handle multiple DAGs concurrently, the
problem does not lie in fixing that but more in terms of whether a cluster
has enough capacity to handle that many queries/DAGs concurrently. The
amount of savings in running multiple queries in a single AM is the
resources utilized per AM. In the end, the level of throughput may not
increase by much if there are not enough resources to run containers
needed by all the tasks of each of these queries.

On the other hand, there have been some discussions around looking at
supporting concurrent DAGs within a single AM. This has interesting
problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
now has to decide priorities across different DAGs and decide how to
allocate containers to complete the tasks for each DAG. From a YARN point
of view, the Tez AM is a single application and therefore all resource
management/prioritization/preemption now falls onto the Tez AM to manage
the multiple queries unlike in the case where each query has its own AM.

- Hitesh

On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master
submits one DAG at a time -- Now, for a Query engine like Hive, where
there would be multiple requests, how is this handled? Are we creating
multiple App Masters that round robins between them? Even then, when large
number of requests are submitted to the Hive server, if the App master can
submit only one DAG at a time, we would have situations where there would
be many outstanding requests. Is there a way we can make the App Master
multi-threaded?
>
> --
> VJ Anand
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.





-- 

*VJ Anand*

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question Tez under the hood

Posted by VJ Anand <vj...@sankia.com>.
Thanks for the info and response. The purpose of asking regarding Hive, was
to see whether a query engine that I have been working on could be moved to
use Tez as its execution layer. Currently, this query engine supports a
large number of concurrent user request, and needs to do so going forward.
Given, the current Tez AM limitations, would it possible for me to purpose
build a YARN AM, that can leverage the Tez DAG execution framework? In
other words, the AM would support the concurrent use cases, etc., needed,
but at the same leverage the DAG API's and the frameword? if so, any
pointers?

-VJ

On Tue, Nov 4, 2014 at 11:44 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
> session across queries. Hive CLI will typically end up using only 1
> session since the CLI blocks until the current query completes.
> HiveServer2 has concurrent query support and has its own logic about when
> a Tez session can be re-used.
>
> About running multiple queries in the AM. I believe there was a jira for
> that. If not we should track that. Queries in Hive typically follow a V
> shape with a large parallelism in the beginning that tapers off. It may be
> possible to get significant gains by pipelining queries sequentially where
> the next query fills up the unused space left behind by the current query
> as it winds down.
>
> Bikas
>
> -----Original Message-----
> From: Hitesh Shah [mailto:hitesh@apache.org]
> Sent: Monday, November 03, 2014 7:46 PM
> To: user@tez.apache.org
> Subject: Re: Question Tez under the hood
>
> Hi
>
> For the most part, each Hive CLI session or JDBC/ODBC connection to
> HiveServer2 would map to a single Application Master. HiveServer does have
> some optimizations though ( to avoid the overhead cost of launching a new
> AM ) where it tries to keep a pool of ApplicationMasters around and does
> some scheduling around them. In cases where the no. of queries is high, I
> am not sure whether it starts spawning new AMs or queues up queries.
> Something that is probably best asked on the Hive mailing lists.
>
> As for making the AM able to handle multiple DAGs concurrently, the
> problem does not lie in fixing that but more in terms of whether a cluster
> has enough capacity to handle that many queries/DAGs concurrently. The
> amount of savings in running multiple queries in a single AM is the
> resources utilized per AM. In the end, the level of throughput may not
> increase by much if there are not enough resources to run containers
> needed by all the tasks of each of these queries.
>
> On the other hand, there have been some discussions around looking at
> supporting concurrent DAGs within a single AM. This has interesting
> problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
> now has to decide priorities across different DAGs and decide how to
> allocate containers to complete the tasks for each DAG. From a YARN point
> of view, the Tez AM is a single application and therefore all resource
> management/prioritization/preemption now falls onto the Tez AM to manage
> the multiple queries unlike in the case where each query has its own AM.
>
> - Hitesh
>
> On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:
>
> > I have a follow-up question -- Bikas mentioned that the Tez App Master
> submits one DAG at a time -- Now, for a Query engine like Hive, where
> there would be multiple requests, how is this handled? Are we creating
> multiple App Masters that round robins between them? Even then, when large
> number of requests are submitted to the Hive server, if the App master can
> submit only one DAG at a time, we would have situations where there would
> be many outstanding requests. Is there a way we can make the App Master
> multi-threaded?
> >
> > --
> > VJ Anand
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
*VJ Anand*

RE: Question Tez under the hood

Posted by Bikas Saha <bi...@hortonworks.com>.
To be clear, HiveServer2 and Hive CLI use TezSessions and try to reuse the
session across queries. Hive CLI will typically end up using only 1
session since the CLI blocks until the current query completes.
HiveServer2 has concurrent query support and has its own logic about when
a Tez session can be re-used.

About running multiple queries in the AM. I believe there was a jira for
that. If not we should track that. Queries in Hive typically follow a V
shape with a large parallelism in the beginning that tapers off. It may be
possible to get significant gains by pipelining queries sequentially where
the next query fills up the unused space left behind by the current query
as it winds down.

Bikas

-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org]
Sent: Monday, November 03, 2014 7:46 PM
To: user@tez.apache.org
Subject: Re: Question Tez under the hood

Hi

For the most part, each Hive CLI session or JDBC/ODBC connection to
HiveServer2 would map to a single Application Master. HiveServer does have
some optimizations though ( to avoid the overhead cost of launching a new
AM ) where it tries to keep a pool of ApplicationMasters around and does
some scheduling around them. In cases where the no. of queries is high, I
am not sure whether it starts spawning new AMs or queues up queries.
Something that is probably best asked on the Hive mailing lists.

As for making the AM able to handle multiple DAGs concurrently, the
problem does not lie in fixing that but more in terms of whether a cluster
has enough capacity to handle that many queries/DAGs concurrently. The
amount of savings in running multiple queries in a single AM is the
resources utilized per AM. In the end, the level of throughput may not
increase by much if there are not enough resources to run containers
needed by all the tasks of each of these queries.

On the other hand, there have been some discussions around looking at
supporting concurrent DAGs within a single AM. This has interesting
problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM
now has to decide priorities across different DAGs and decide how to
allocate containers to complete the tasks for each DAG. From a YARN point
of view, the Tez AM is a single application and therefore all resource
management/prioritization/preemption now falls onto the Tez AM to manage
the multiple queries unlike in the case where each query has its own AM.

- Hitesh

On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master
submits one DAG at a time -- Now, for a Query engine like Hive, where
there would be multiple requests, how is this handled? Are we creating
multiple App Masters that round robins between them? Even then, when large
number of requests are submitted to the Hive server, if the App master can
submit only one DAG at a time, we would have situations where there would
be many outstanding requests. Is there a way we can make the App Master
multi-threaded?
>
> --
> VJ Anand
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question Tez under the hood

Posted by Hitesh Shah <hi...@apache.org>.
Hi 

For the most part, each Hive CLI session or JDBC/ODBC connection to HiveServer2 would map to a single Application Master. HiveServer does have some optimizations though ( to avoid the overhead cost of launching a new AM ) where it tries to keep a pool of ApplicationMasters around and does some scheduling around them. In cases where the no. of queries is high, I am not sure whether it starts spawning new AMs or queues up queries. Something that is probably best asked on the Hive mailing lists. 

As for making the AM able to handle multiple DAGs concurrently, the problem does not lie in fixing that but more in terms of whether a cluster has enough capacity to handle that many queries/DAGs concurrently. The amount of savings in running multiple queries in a single AM is the resources utilized per AM. In the end, the level of throughput may not increase by much if there are not enough resources to run containers needed by all the tasks of each of these queries. 

On the other hand, there have been some discussions around looking at supporting concurrent DAGs within a single AM. This has interesting problems similar to that of the JobTracker in Hadoop 1.x i.e the Tez AM now has to decide priorities across different DAGs and decide how to allocate containers to complete the tasks for each DAG. From a YARN point of view, the Tez AM is a single application and therefore all resource management/prioritization/preemption now falls onto the Tez AM to manage the multiple queries unlike in the case where each query has its own AM.

— Hitesh

On Nov 3, 2014, at 7:26 PM, VJ Anand <vj...@sankia.com> wrote:

> I have a follow-up question -- Bikas mentioned that the Tez App Master submits one DAG at a time -- Now, for a Query engine like Hive, where there would be multiple requests, how is this handled? Are we creating multiple App Masters that round robins between them? Even then, when large number of requests are submitted to the Hive server, if the App master can submit only one DAG at a time, we would have situations where there would be many outstanding requests. Is there a way we can make the App Master multi-threaded? 
> 
> -- 
> VJ Anand
>