You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Prabhu Joseph <pr...@gmail.com> on 2016/06/23 12:21:52 UTC

Spark Thrift Server Concurrency

Hi All,

   On submitting 20 parallel same SQL query to Spark Thrift Server, the
query execution time for some queries are less than a second and some are
more than 2seconds. The Spark Thrift Server logs shows all 20 queries are
submitted at same time 16/06/23 12:12:01 but the result schema are at
different times.

16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query
'select distinct val2 from philips1 where key>=1000 and key<=1500

16/06/23 12:12:*02* INFO SparkExecuteStatementOperation: Result Schema:
ArrayBuffer(val2#2110)
16/06/23 12:12:*03* INFO SparkExecuteStatementOperation: Result Schema:
ArrayBuffer(val2#2182)
16/06/23 12:12:*04* INFO SparkExecuteStatementOperation: Result Schema:
ArrayBuffer(val2#2344)
16/06/23 12:12:*05* INFO SparkExecuteStatementOperation: Result Schema:
ArrayBuffer(val2#2362)

There are sufficient executors running on YARN. The concurrency is affected
by Single Driver. How to improve the concurrency and what are the best
practices.

Thanks,
Prabhu Joseph

Re: Spark Thrift Server Concurrency

Posted by Prabhu Joseph <pr...@gmail.com>.
Spark Thrift Server is started with

./sbin/start-thriftserver.sh --master yarn-client --hiveconf
hive.server2.thrift.port=10001 --num-executors 4 --executor-cores 2
--executor-memory 4G --conf spark.scheduler.mode=FAIR

20 parallel below queries are executed

select distinct val2 from philips1 where key>=1000 and key<=1500

And there is no issue at the backend Spark Executors, as spark jobs UI
shows all 20 queries are launched and completed with same duration. And all
20 queries are received by Spark Thrift Server at same time. But the Spark
Driver present inside Spark Thrift Sever  looks like overloaded and hence
the queries are not parsed and
submitted to executors at same time and hence seeing the delay in query
execution time from client.





On Thu, Jun 23, 2016 at 11:12 PM, Michael Segel <ms...@hotmail.com>
wrote:

> Hi,
> There are  a lot of moving parts and a lot of unknowns from your
> description.
> Besides the version stuff.
>
> How many executors, how many cores? How much memory?
> Are you persisting (memory and disk) or just caching (memory)
>
> During the execution… same tables… are  you seeing a lot of shuffling of
> data for some queries and not others?
>
> It sounds like an interesting problem…
>
> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph <pr...@gmail.com>
> wrote:
>
> Hi All,
>
>    On submitting 20 parallel same SQL query to Spark Thrift Server, the
> query execution time for some queries are less than a second and some are
> more than 2seconds. The Spark Thrift Server logs shows all 20 queries are
> submitted at same time 16/06/23 12:12:01 but the result schema are at
> different times.
>
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query
> 'select distinct val2 from philips1 where key>=1000 and key<=1500
>
> 16/06/23 12:12:*02* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:*03* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:*04* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:*05* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2362)
>
> There are sufficient executors running on YARN. The concurrency is
> affected by Single Driver. How to improve the concurrency and what are the
> best practices.
>
> Thanks,
> Prabhu Joseph
>
>
>

Re: Spark Thrift Server Concurrency

Posted by Prabhu Joseph <pr...@gmail.com>.
Spark Thrift Server is started with

./sbin/start-thriftserver.sh --master yarn-client --hiveconf
hive.server2.thrift.port=10001 --num-executors 4 --executor-cores 2
--executor-memory 4G --conf spark.scheduler.mode=FAIR

20 parallel below queries are executed

select distinct val2 from philips1 where key>=1000 and key<=1500

And there is no issue at the backend Spark Executors, as spark jobs UI
shows all 20 queries are launched and completed with same duration. And all
20 queries are received by Spark Thrift Server at same time. But the Spark
Driver present inside Spark Thrift Sever  looks like overloaded and hence
the queries are not parsed and
submitted to executors at same time and hence seeing the delay in query
execution time from client.





On Thu, Jun 23, 2016 at 11:12 PM, Michael Segel <ms...@hotmail.com>
wrote:

> Hi,
> There are  a lot of moving parts and a lot of unknowns from your
> description.
> Besides the version stuff.
>
> How many executors, how many cores? How much memory?
> Are you persisting (memory and disk) or just caching (memory)
>
> During the execution… same tables… are  you seeing a lot of shuffling of
> data for some queries and not others?
>
> It sounds like an interesting problem…
>
> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph <pr...@gmail.com>
> wrote:
>
> Hi All,
>
>    On submitting 20 parallel same SQL query to Spark Thrift Server, the
> query execution time for some queries are less than a second and some are
> more than 2seconds. The Spark Thrift Server logs shows all 20 queries are
> submitted at same time 16/06/23 12:12:01 but the result schema are at
> different times.
>
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query
> 'select distinct val2 from philips1 where key>=1000 and key<=1500
>
> 16/06/23 12:12:*02* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:*03* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:*04* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:*05* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2362)
>
> There are sufficient executors running on YARN. The concurrency is
> affected by Single Driver. How to improve the concurrency and what are the
> best practices.
>
> Thanks,
> Prabhu Joseph
>
>
>

Re: Spark Thrift Server Concurrency

Posted by Michael Segel <ms...@hotmail.com>.
Hi, 
There are  a lot of moving parts and a lot of unknowns from your description. 
Besides the version stuff. 

How many executors, how many cores? How much memory? 
Are you persisting (memory and disk) or just caching (memory) 

During the execution… same tables… are  you seeing a lot of shuffling of data for some queries and not others? 

It sounds like an interesting problem… 

> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph <pr...@gmail.com> wrote:
> 
> Hi All,
> 
>    On submitting 20 parallel same SQL query to Spark Thrift Server, the query execution time for some queries are less than a second and some are more than 2seconds. The Spark Thrift Server logs shows all 20 queries are submitted at same time 16/06/23 12:12:01 but the result schema are at different times.
> 
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query 'select distinct val2 from philips1 where key>=1000 and key<=1500
> 
> 16/06/23 12:12:02 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2110)
> 16/06/23 12:12:03 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2182)
> 16/06/23 12:12:04 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2344)
> 16/06/23 12:12:05 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2362)
> 
> There are sufficient executors running on YARN. The concurrency is affected by Single Driver. How to improve the concurrency and what are the best practices.
> 
> Thanks,
> Prabhu Joseph


Re: Spark Thrift Server Concurrency

Posted by Mich Talebzadeh <mi...@gmail.com>.
which version of spark and are you using YARN in client mode or cluster
mode?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 13:21, Prabhu Joseph <pr...@gmail.com> wrote:

> Hi All,
>
>    On submitting 20 parallel same SQL query to Spark Thrift Server, the
> query execution time for some queries are less than a second and some are
> more than 2seconds. The Spark Thrift Server logs shows all 20 queries are
> submitted at same time 16/06/23 12:12:01 but the result schema are at
> different times.
>
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query
> 'select distinct val2 from philips1 where key>=1000 and key<=1500
>
> 16/06/23 12:12:*02* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:*03* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:*04* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:*05* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2362)
>
> There are sufficient executors running on YARN. The concurrency is
> affected by Single Driver. How to improve the concurrency and what are the
> best practices.
>
> Thanks,
> Prabhu Joseph
>

Re: Spark Thrift Server Concurrency

Posted by Michael Segel <ms...@hotmail.com>.
Hi, 
There are  a lot of moving parts and a lot of unknowns from your description. 
Besides the version stuff. 

How many executors, how many cores? How much memory? 
Are you persisting (memory and disk) or just caching (memory) 

During the execution… same tables… are  you seeing a lot of shuffling of data for some queries and not others? 

It sounds like an interesting problem… 

> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph <pr...@gmail.com> wrote:
> 
> Hi All,
> 
>    On submitting 20 parallel same SQL query to Spark Thrift Server, the query execution time for some queries are less than a second and some are more than 2seconds. The Spark Thrift Server logs shows all 20 queries are submitted at same time 16/06/23 12:12:01 but the result schema are at different times.
> 
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query 'select distinct val2 from philips1 where key>=1000 and key<=1500
> 
> 16/06/23 12:12:02 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2110)
> 16/06/23 12:12:03 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2182)
> 16/06/23 12:12:04 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2344)
> 16/06/23 12:12:05 INFO SparkExecuteStatementOperation: Result Schema: ArrayBuffer(val2#2362)
> 
> There are sufficient executors running on YARN. The concurrency is affected by Single Driver. How to improve the concurrency and what are the best practices.
> 
> Thanks,
> Prabhu Joseph