You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rabin Banerjee <de...@gmail.com> on 2016/09/10 05:21:47 UTC

SparkSQL DAG generation , DAG optimization , DAG execution

HI All,

 I am writing and executing a Spark Batch program which only use SPARK-SQL
, But it is taking lot of time and finally giving GC overhead .

Here is the program ,

1.Read 3 files ,one medium size and 2 small files, and register them as DF.
2.
     fire sql with complex aggregation and windowing .
     register result as DF.

3.  .........Repeat step 2 almost 50 times .so 50 sql .

4. All SQLs are sequential , i.e next step requires prev step result .

5. Finally save the final DF .(This is the only action called).

Note ::

1. I haven't persists the intermediate DF , as I think Spark will optimize
multiple SQL into one physical execution plan .
2. Executor memory and Driver memory is set as 4gb which is too high as
data size is in MB.

Questions ::

1. Will Spark optimize multiple SQL queries into one single plysical plan ?
2. In DAG I can see a lot of file read and lot of stages , Why ? I only
called action once ?
3. Is every SQL will execute and its intermediate result will be stored in
memory ?
4. What is something that causing OOM and GC overhead here ?
5. What is optimization that could be taken care of ?

Spark Version 1.5.x


Thanks in advance .
Rabin

Re: SparkSQL DAG generation , DAG optimization , DAG execution

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi

   1. You are doing some analytics I guess?
   2. It is almost impossible to guess what is happening except that you
   are looping 50 times over the same set of sql?
   3. Your sql step n depends on step n-1. So spark cannot get rid of 1 -n
   steps
   4. you are not storing anything in  memory(no cache, no persist), so all
   memory is used for the execution
   5. What happens when you run it only once? How much memory is used (look
   at UI page, 4040 by default)
   6.  What Spark mode is being used (Local, Standalone, Yarn)
   7. OOM could be anything depending on how much you are allocating to
   your driver memory in spark-submit

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 10 September 2016 at 06:21, Rabin Banerjee <de...@gmail.com>
wrote:

> HI All,
>
>  I am writing and executing a Spark Batch program which only use SPARK-SQL
> , But it is taking lot of time and finally giving GC overhead .
>
> Here is the program ,
>
> 1.Read 3 files ,one medium size and 2 small files, and register them as DF.
> 2.
>      fire sql with complex aggregation and windowing .
>      register result as DF.
>
> 3.  .........Repeat step 2 almost 50 times .so 50 sql .
>
> 4. All SQLs are sequential , i.e next step requires prev step result .
>
> 5. Finally save the final DF .(This is the only action called).
>
> Note ::
>
> 1. I haven't persists the intermediate DF , as I think Spark will optimize
> multiple SQL into one physical execution plan .
> 2. Executor memory and Driver memory is set as 4gb which is too high as
> data size is in MB.
>
> Questions ::
>
> 1. Will Spark optimize multiple SQL queries into one single plysical plan ?
> 2. In DAG I can see a lot of file read and lot of stages , Why ? I only
> called action once ?
> 3. Is every SQL will execute and its intermediate result will be stored in
> memory ?
> 4. What is something that causing OOM and GC overhead here ?
> 5. What is optimization that could be taken care of ?
>
> Spark Version 1.5.x
>
>
> Thanks in advance .
> Rabin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>