You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Louis Hust <lo...@gmail.com> on 2015/07/26 09:47:37 UTC

Spark is much slower than direct access MySQL

Hi, all,

I am using spark DataFrame to fetch small table from MySQL,
and i found it cost so much than directly access MySQL Using JDBC.

Time cost for Spark is about 2033ms, and direct access at about 16ms.

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

So If my configuration for spark is wrong? How to optimise Spark to achieve
the similar performance like direct access?

Any idea will be appreciated!

Re: Spark is much slower than direct access MySQL

Posted by Louis Hust <lo...@gmail.com>.

I got it, thanks for that

2015-07-26 17:21 GMT+08:00 Paolo Platter <pa...@agilelab.it>:

>  If you want a performance boost, you need to load the full table in
> memory using caching and them execute your query directly on cached
> dataframe. Otherwise you use spark only as a bridge and you don't leverage
> the distributed in memory engine of spark.
>
> Paolo
>
> Inviata dal mio Windows Phone
>  ------------------------------
> Da: Louis Hust <lo...@gmail.com>
> Inviato: ‎26/‎07/‎2015 10:28
> A: Shixiong Zhu <zs...@gmail.com>
> Cc: Jerrick Hoang <je...@gmail.com>; user@spark.apache.org
> Oggetto: Re: Spark is much slower than direct access MySQL
>
>  Thanks for your explain
>
> 2015-07-26 16:22 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:
>
>> Oh, I see. That's the total time of executing a query in Spark. Then the
>> difference is reasonable, considering Spark has much more work to do, e.g.,
>> launching tasks in executors.
>>
>>      Best Regards,
>> Shixiong Zhu
>>
>> 2015-07-26 16:16 GMT+08:00 Louis Hust <lo...@gmail.com>:
>>
>>> Look at the given url:
>>>
>>>  Code can be found at:
>>>
>>>
>>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>>
>>> 2015-07-26 16:14 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:
>>>
>>>> Could you clarify how you measure the Spark time cost? Is it the total
>>>> time of running the query? If so, it's possible because the overhead of
>>>> Spark dominates for small queries.
>>>>
>>>>      Best Regards,
>>>> Shixiong Zhu
>>>>
>>>> 2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>:
>>>>
>>>>> how big is the dataset? how complicated is the query?
>>>>>
>>>>>  On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, all,
>>>>>>
>>>>>>  I am using spark DataFrame to fetch small table from MySQL,
>>>>>> and i found it cost so much than directly access MySQL Using JDBC.
>>>>>>
>>>>>>  Time cost for Spark is about 2033ms, and direct access at
>>>>>> about 16ms.
>>>>>>
>>>>>>  Code can be found at:
>>>>>>
>>>>>>
>>>>>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>>>>>
>>>>>>  So If my configuration for spark is wrong? How to optimise Spark to
>>>>>> achieve the similar performance like direct access?
>>>>>>
>>>>>>  Any idea will be appreciated!
>>>>>>
>>>>>>
>>>>
>>>
>>
>

R: Spark is much slower than direct access MySQL

Posted by Paolo Platter <pa...@agilelab.it>.

If you want a performance boost, you need to load the full table in memory using caching and them execute your query directly on cached dataframe. Otherwise you use spark only as a bridge and you don't leverage the distributed in memory engine of spark.

Paolo

Inviata dal mio Windows Phone
________________________________
Da: Louis Hust<ma...@gmail.com>
Inviato: ‎26/‎07/‎2015 10:28
A: Shixiong Zhu<ma...@gmail.com>
Cc: Jerrick Hoang<ma...@gmail.com>; user@spark.apache.org<ma...@spark.apache.org>
Oggetto: Re: Spark is much slower than direct access MySQL

Thanks for your explain

2015-07-26 16:22 GMT+08:00 Shixiong Zhu <zs...@gmail.com>>:
Oh, I see. That's the total time of executing a query in Spark. Then the difference is reasonable, considering Spark has much more work to do, e.g., launching tasks in executors.


Best Regards,

Shixiong Zhu

2015-07-26 16:16 GMT+08:00 Louis Hust <lo...@gmail.com>>:
Look at the given url:

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

2015-07-26 16:14 GMT+08:00 Shixiong Zhu <zs...@gmail.com>>:
Could you clarify how you measure the Spark time cost? Is it the total time of running the query? If so, it's possible because the overhead of Spark dominates for small queries.


Best Regards,

Shixiong Zhu

2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>>:
how big is the dataset? how complicated is the query?

On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com>> wrote:
Hi, all,

I am using spark DataFrame to fetch small table from MySQL,
and i found it cost so much than directly access MySQL Using JDBC.

Time cost for Spark is about 2033ms, and direct access at about 16ms.

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

So If my configuration for spark is wrong? How to optimise Spark to achieve the similar performance like direct access?

Any idea will be appreciated!

Re: Spark is much slower than direct access MySQL

Posted by Louis Hust <lo...@gmail.com>.

Thanks for your explain

2015-07-26 16:22 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:

> Oh, I see. That's the total time of executing a query in Spark. Then the
> difference is reasonable, considering Spark has much more work to do, e.g.,
> launching tasks in executors.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-07-26 16:16 GMT+08:00 Louis Hust <lo...@gmail.com>:
>
>> Look at the given url:
>>
>> Code can be found at:
>>
>>
>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>
>> 2015-07-26 16:14 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:
>>
>>> Could you clarify how you measure the Spark time cost? Is it the total
>>> time of running the query? If so, it's possible because the overhead of
>>> Spark dominates for small queries.
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>:
>>>
>>>> how big is the dataset? how complicated is the query?
>>>>
>>>> On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, all,
>>>>>
>>>>> I am using spark DataFrame to fetch small table from MySQL,
>>>>> and i found it cost so much than directly access MySQL Using JDBC.
>>>>>
>>>>> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>>>>>
>>>>> Code can be found at:
>>>>>
>>>>>
>>>>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>>>>
>>>>> So If my configuration for spark is wrong? How to optimise Spark to
>>>>> achieve the similar performance like direct access?
>>>>>
>>>>> Any idea will be appreciated!
>>>>>
>>>>>
>>>
>>
>

Re: Spark is much slower than direct access MySQL

Posted by Shixiong Zhu <zs...@gmail.com>.

Oh, I see. That's the total time of executing a query in Spark. Then the
difference is reasonable, considering Spark has much more work to do, e.g.,
launching tasks in executors.

Best Regards,
Shixiong Zhu

2015-07-26 16:16 GMT+08:00 Louis Hust <lo...@gmail.com>:

> Look at the given url:
>
> Code can be found at:
>
>
> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>
> 2015-07-26 16:14 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:
>
>> Could you clarify how you measure the Spark time cost? Is it the total
>> time of running the query? If so, it's possible because the overhead of
>> Spark dominates for small queries.
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>:
>>
>>> how big is the dataset? how complicated is the query?
>>>
>>> On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com>
>>> wrote:
>>>
>>>> Hi, all,
>>>>
>>>> I am using spark DataFrame to fetch small table from MySQL,
>>>> and i found it cost so much than directly access MySQL Using JDBC.
>>>>
>>>> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>>>>
>>>> Code can be found at:
>>>>
>>>>
>>>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>>>
>>>> So If my configuration for spark is wrong? How to optimise Spark to
>>>> achieve the similar performance like direct access?
>>>>
>>>> Any idea will be appreciated!
>>>>
>>>>
>>
>

Re: Spark is much slower than direct access MySQL

Posted by Louis Hust <lo...@gmail.com>.

Look at the given url:

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

2015-07-26 16:14 GMT+08:00 Shixiong Zhu <zs...@gmail.com>:

> Could you clarify how you measure the Spark time cost? Is it the total
> time of running the query? If so, it's possible because the overhead of
> Spark dominates for small queries.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>:
>
>> how big is the dataset? how complicated is the query?
>>
>> On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com> wrote:
>>
>>> Hi, all,
>>>
>>> I am using spark DataFrame to fetch small table from MySQL,
>>> and i found it cost so much than directly access MySQL Using JDBC.
>>>
>>> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>>>
>>> Code can be found at:
>>>
>>>
>>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>>
>>> So If my configuration for spark is wrong? How to optimise Spark to
>>> achieve the similar performance like direct access?
>>>
>>> Any idea will be appreciated!
>>>
>>>
>

Re: Spark is much slower than direct access MySQL

Posted by Shixiong Zhu <zs...@gmail.com>.

Could you clarify how you measure the Spark time cost? Is it the total time
of running the query? If so, it's possible because the overhead of
Spark dominates for small queries.

Best Regards,
Shixiong Zhu

2015-07-26 15:56 GMT+08:00 Jerrick Hoang <je...@gmail.com>:

> how big is the dataset? how complicated is the query?
>
> On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com> wrote:
>
>> Hi, all,
>>
>> I am using spark DataFrame to fetch small table from MySQL,
>> and i found it cost so much than directly access MySQL Using JDBC.
>>
>> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>>
>> Code can be found at:
>>
>>
>> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>>
>> So If my configuration for spark is wrong? How to optimise Spark to
>> achieve the similar performance like direct access?
>>
>> Any idea will be appreciated!
>>
>>

Re: Spark is much slower than direct access MySQL

Posted by Jerrick Hoang <je...@gmail.com>.

how big is the dataset? how complicated is the query?

On Sun, Jul 26, 2015 at 12:47 AM Louis Hust <lo...@gmail.com> wrote:

> Hi, all,
>
> I am using spark DataFrame to fetch small table from MySQL,
> and i found it cost so much than directly access MySQL Using JDBC.
>
> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>
> Code can be found at:
>
>
> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>
> So If my configuration for spark is wrong? How to optimise Spark to
> achieve the similar performance like direct access?
>
> Any idea will be appreciated!
>
>