You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by Tobias Bockrath <tb...@web-computing.de> on 2015/05/15 15:48:23 UTC

Performance Comparison of SparkSQL Shell an Zeppelin

Hello,

we deployed Apache Spark 1.3.0 and Apache Zeppelin build with Spark 1.30 in
a Hadoop Cluster with one Namenode and two Datanodes. Both are running in
yarn-client Mode. So the setup and the preconditions are equal.

We executed several SQL queries via the Zeppelin frontend and via the
SparkSQL shell. For example we tried queries with 5 join conditions. Also
we tried queries on a pre joined dataset with more than 1.000.000 records.

We figured out that the execution time of the SparkSQL Shell is much faster
than Zeppelins. In fact the execution of SparkSQL queries was 4x - 40x
faster than equal queries executed with Zeppelin.

Does anyone has similiar experiences? Why has Zeppelin such an overhead
although theres the same engine "under the hood"?  How does Zeppelin handle
queries? Are they passed to Spark directly or are there any optimizations?

Kind regards
Tobias

Re: Performance Comparison of SparkSQL Shell an Zeppelin

Posted by Tobias Bockrath <tb...@web-computing.de>.

Today I executed a simple Select Query on a Dataset with 1.219.210 Records
via the Zeppelin UI (with Spark already started).
It took Zeppelin about 118 seconds to execute the query , although Spark
needed only 7,3 seconds to execute the query (I checked it in the Spark
UI). So I am wondering: What is Zeppelin doing all the time? Where is the
performance killer?
Could you give me some details about the Zeppelin Architecture?
I image the following: The SparkContext is already started, so the query
can be passed directly to the SparkContext. I am using Spark 1.3.0, so I
think Spark is now doing some Query Optimizations. Next Spark is executing
the query on our Testcluster. Result is a Dataframe or in RDD containing
the result of the query. Now Spark is ready (I think it took Spark 7,3
seconds to get to this point). Next Zeppelin is converting the DataFrame /
RDD into a Java Object, maybe an Array in order to push it to the Frontend.
In the Frontend view Zeppelin needs to get back all the data from the  Java
Object to make these nice tables and diagrams. I think converting the data
from the DataFrame/RDD and pushing it to the Frontend is the bottleneck. It
took Zeppelin about 110,7(118-7,3) seconds to display the results, although
results are limited by 10.000.
Have you figured out similar behaviors?

kind regards
Tobias

2015-05-18 9:39 GMT+02:00 Tobias Bockrath <tb...@web-computing.de>:

>
> ---------- Forwarded message ----------
> From: Cheolsoo Park <pi...@gmail.com>
> Date: 2015-05-17 21:39 GMT+02:00
> Subject: Re: Performance Comparison of SparkSQL Shell an Zeppelin
> To: users@zeppelin.incubator.apache.org
>
>
> >> Zeppelin internally uses sqlContext.sql to execute queries. And uses
> take() to get results.
>
> I cannot speak for all queries, but I have also seen some queries run much
> slower in sql context compared to sql shell in Spark 1.3. For instance, if
> I run the following query in sql shell against a Hive table-
>
> "select * from <my_table> where <partition_key_1>=20150501 and
> <partition_key_2>=0 limit 10;"
>
> Spark launches a single task job that loads 10 rows from a single input
> split in the partition, and it takes about 1 minute.
>
> On the other hand, if I run the same query in scala shell via hiveContext,
> i.e.-
>
> hc.sql("select * from <my_table> where <partition_key_1>=20150501 and
> <partition_key_2>=0 limit 10")
>
> Spark launches two stage jobs where in the 1st stage, thousands of tasks
> scan all the input splits in the partition, and in the 2nd stage, a single
> reducer task collects 10 rows from them. This job takes more than 10
> minutes to finish.
>
> I don't know yet why the execution plan differs depending on how queries
> are executed, but it seems to me that sql-shell does more optimizations
> than sql-context. This is actually quite confusing/frustrating because
> since DataFrame is introduced, all the queries are supposed to be optimized
> by the same code path, and thus, queries should result in the same
> execution plan no matter how they're executed (e.g. scala, python, sql,
> etc). Apparently, it's not the case in 1.3.
>
> Thanks!
> Cheolsoo
>
>
>
> On Sat, May 16, 2015 at 2:15 AM, Wood, Dean Jr (GE Oil & Gas) <
> Dean1.Wood@ge.com> wrote:
>
>> I’m going to be running some benchmarks against it on Monday on an AWS
>> cluster. I’ll let you know what I come up with.
>>
>> Dean
>>
>> On 16 May 2015, at 00:30, moon soo Lee <moon@apache.org<mailto:
>> moon@apache.org>> wrote:
>>
>> Hi,
>>
>> Zeppelin internally uses sqlContext.sql to execute queries. And uses
>> take() to get results.
>>
>> There might be overhead of transfer result to web gui and rendering it.
>> But i guess rest of the process are the same.
>>
>> I also curious any other people experiences the similar problem.
>>
>> Best,
>> moon
>>
>> On 2015년 5월 15일 (금) at 오후 10:49 Tobias Bockrath <tb@web-computing.de
>> <ma...@web-computing.de>> wrote:
>> Hello,
>>
>> we deployed Apache Spark 1.3.0 and Apache Zeppelin build with Spark 1.30
>> in a Hadoop Cluster with one Namenode and two Datanodes. Both are running
>> in yarn-client Mode. So the setup and the preconditions are equal.
>>
>> We executed several SQL queries via the Zeppelin frontend and via the
>> SparkSQL shell. For example we tried queries with 5 join conditions. Also
>> we tried queries on a pre joined dataset with more than 1.000.000 records.
>>
>> We figured out that the execution time of the SparkSQL Shell is much
>> faster than Zeppelins. In fact the execution of SparkSQL queries was 4x -
>> 40x faster than equal queries executed with Zeppelin.
>>
>> Does anyone has similiar experiences? Why has Zeppelin such an overhead
>> although theres the same engine "under the hood"?  How does Zeppelin handle
>> queries? Are they passed to Spark directly or are there any optimizations?
>>
>> Kind regards
>> Tobias
>>
>>
>>
>
>

Re: Performance Comparison of SparkSQL Shell an Zeppelin

Posted by Cheolsoo Park <pi...@gmail.com>.

>> Zeppelin internally uses sqlContext.sql to execute queries. And uses
take() to get results.

I cannot speak for all queries, but I have also seen some queries run much
slower in sql context compared to sql shell in Spark 1.3. For instance, if
I run the following query in sql shell against a Hive table-

"select * from <my_table> where <partition_key_1>=20150501 and
<partition_key_2>=0 limit 10;"

Spark launches a single task job that loads 10 rows from a single input
split in the partition, and it takes about 1 minute.

On the other hand, if I run the same query in scala shell via hiveContext,
i.e.-

hc.sql("select * from <my_table> where <partition_key_1>=20150501 and
<partition_key_2>=0 limit 10")

Spark launches two stage jobs where in the 1st stage, thousands of tasks
scan all the input splits in the partition, and in the 2nd stage, a single
reducer task collects 10 rows from them. This job takes more than 10
minutes to finish.

I don't know yet why the execution plan differs depending on how queries
are executed, but it seems to me that sql-shell does more optimizations
than sql-context. This is actually quite confusing/frustrating because
since DataFrame is introduced, all the queries are supposed to be optimized
by the same code path, and thus, queries should result in the same
execution plan no matter how they're executed (e.g. scala, python, sql,
etc). Apparently, it's not the case in 1.3.

Thanks!
Cheolsoo

On Sat, May 16, 2015 at 2:15 AM, Wood, Dean Jr (GE Oil & Gas) <
Dean1.Wood@ge.com> wrote:

> I’m going to be running some benchmarks against it on Monday on an AWS
> cluster. I’ll let you know what I come up with.
>
> Dean
>
> On 16 May 2015, at 00:30, moon soo Lee <moon@apache.org<mailto:
> moon@apache.org>> wrote:
>
> Hi,
>
> Zeppelin internally uses sqlContext.sql to execute queries. And uses
> take() to get results.
>
> There might be overhead of transfer result to web gui and rendering it.
> But i guess rest of the process are the same.
>
> I also curious any other people experiences the similar problem.
>
> Best,
> moon
>
> On 2015년 5월 15일 (금) at 오후 10:49 Tobias Bockrath <tb@web-computing.de
> <ma...@web-computing.de>> wrote:
> Hello,
>
> we deployed Apache Spark 1.3.0 and Apache Zeppelin build with Spark 1.30
> in a Hadoop Cluster with one Namenode and two Datanodes. Both are running
> in yarn-client Mode. So the setup and the preconditions are equal.
>
> We executed several SQL queries via the Zeppelin frontend and via the
> SparkSQL shell. For example we tried queries with 5 join conditions. Also
> we tried queries on a pre joined dataset with more than 1.000.000 records.
>
> We figured out that the execution time of the SparkSQL Shell is much
> faster than Zeppelins. In fact the execution of SparkSQL queries was 4x -
> 40x faster than equal queries executed with Zeppelin.
>
> Does anyone has similiar experiences? Why has Zeppelin such an overhead
> although theres the same engine "under the hood"?  How does Zeppelin handle
> queries? Are they passed to Spark directly or are there any optimizations?
>
> Kind regards
> Tobias
>
>
>

Re: Performance Comparison of SparkSQL Shell an Zeppelin

Posted by "Wood, Dean Jr (GE Oil & Gas)" <De...@ge.com>.

I’m going to be running some benchmarks against it on Monday on an AWS cluster. I’ll let you know what I come up with.

Dean

On 16 May 2015, at 00:30, moon soo Lee <mo...@apache.org>> wrote:

Hi,

Zeppelin internally uses sqlContext.sql to execute queries. And uses take() to get results.

There might be overhead of transfer result to web gui and rendering it. But i guess rest of the process are the same.

I also curious any other people experiences the similar problem.

Best,
moon

On 2015년 5월 15일 (금) at 오후 10:49 Tobias Bockrath <tb...@web-computing.de>> wrote:
Hello,

we deployed Apache Spark 1.3.0 and Apache Zeppelin build with Spark 1.30 in a Hadoop Cluster with one Namenode and two Datanodes. Both are running in yarn-client Mode. So the setup and the preconditions are equal.

We executed several SQL queries via the Zeppelin frontend and via the SparkSQL shell. For example we tried queries with 5 join conditions. Also we tried queries on a pre joined dataset with more than 1.000.000 records.

We figured out that the execution time of the SparkSQL Shell is much faster than Zeppelins. In fact the execution of SparkSQL queries was 4x - 40x faster than equal queries executed with Zeppelin.

Does anyone has similiar experiences? Why has Zeppelin such an overhead although theres the same engine "under the hood"?  How does Zeppelin handle queries? Are they passed to Spark directly or are there any optimizations?

Kind regards
Tobias

Re: Performance Comparison of SparkSQL Shell an Zeppelin

Posted by moon soo Lee <mo...@apache.org>.

Hi,

Zeppelin internally uses sqlContext.sql to execute queries. And uses take()
to get results.

There might be overhead of transfer result to web gui and rendering it. But
i guess rest of the process are the same.

I also curious any other people experiences the similar problem.

Best,
moon

On 2015년 5월 15일 (금) at 오후 10:49 Tobias Bockrath <tb...@web-computing.de> wrote:

> Hello,
>
> we deployed Apache Spark 1.3.0 and Apache Zeppelin build with Spark 1.30
> in a Hadoop Cluster with one Namenode and two Datanodes. Both are running
> in yarn-client Mode. So the setup and the preconditions are equal.
>
> We executed several SQL queries via the Zeppelin frontend and via the
> SparkSQL shell. For example we tried queries with 5 join conditions. Also
> we tried queries on a pre joined dataset with more than 1.000.000 records.
>
> We figured out that the execution time of the SparkSQL Shell is much
> faster than Zeppelins. In fact the execution of SparkSQL queries was 4x -
> 40x faster than equal queries executed with Zeppelin.
>
> Does anyone has similiar experiences? Why has Zeppelin such an overhead
> although theres the same engine "under the hood"?  How does Zeppelin handle
> queries? Are they passed to Spark directly or are there any optimizations?
>
> Kind regards
> Tobias
>
>