You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/06/30 21:10:57 UTC

Re: Logical Plan

Which version are you using here? If the underlying files change,
technically we should go through optimization again.

Perhaps the real "fix" is to figure out why is logical plan creation so
slow for 700 columns.

On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
wrote:

> Is there a way I can use same Logical plan for a query. Everything will be
> same except underlying file will be different.
>
> Issue is that my query has around 700 columns and Generating logical plan
> takes 20 seconds and it happens every 2 minutes but every time underlying
> file is different.
>
> I do not know these files in advance so I cant create the table on
> directory level. These files are created and then used in the final query.
>
> Thanks
>

Re: Logical Plan

Posted by Reynold Xin <rx...@databricks.com>.

drop user@spark and keep only dev@

This is something great to figure out, if you have time. Two things that
would be great to try:

1. See how this works on Spark 2.0.

2. If it is slow, try the following:

org.apache.spark.sql.catalyst.rules.RuleExecutor.resetTime()

// run your query

org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent()


And report back where the time are spent if possible. Thanks!



On Thu, Jun 30, 2016 at 2:53 PM, Darshan Singh <da...@gmail.com>
wrote:

> I am using 1.5.2.
>
> I have a data-frame with 10 column and then I pivot 1 column and generate
> the 700 columns.
>
> it is like
>
> val df1 = sqlContext.read.parquet("file1")
> df1.registerTempTable("df1")
> val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
> col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
> 0.0 end ) as col4_700 from df1 group by col1, col2")
>
> Now this last statement takes around 20-30 seconds. I run this a number of
> times only difference is that for df1 file is different. Everything else is
> same.
>
> The actual statement takes 2-3 seconds so it is bit frustrating that just
> generating plan for df2 is taking too much time.Worse thing is that this
> run on driver so it is not palatalized.
>
> I have similar issue in another query where from these 700 columns we
> generate more columns by adding or subtracting these and it again takes
> lots of time.
>
> Not sure what could be done here.
>
> Thanks
>
> On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Which version are you using here? If the underlying files change,
>> technically we should go through optimization again.
>>
>> Perhaps the real "fix" is to figure out why is logical plan creation so
>> slow for 700 columns.
>>
>>
>> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
>> wrote:
>>
>>> Is there a way I can use same Logical plan for a query. Everything will
>>> be same except underlying file will be different.
>>>
>>> Issue is that my query has around 700 columns and Generating logical
>>> plan takes 20 seconds and it happens every 2 minutes but every time
>>> underlying file is different.
>>>
>>> I do not know these files in advance so I cant create the table on
>>> directory level. These files are created and then used in the final query.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Logical Plan

Posted by Mich Talebzadeh <mi...@gmail.com>.

I don't think Spark optimizer supports something like statement cache where
plan is cached and bind variables (like RDBMS) are used for different
values, thus saving the parsing.

What you re stating is that the source and tempTable change but the plan
itself remains the same. I have not seen this in 1.6.1 and as I understand
Spark does yet support CBO yet not even in 2.0


HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 June 2016 at 22:53, Darshan Singh <da...@gmail.com> wrote:

> I am using 1.5.2.
>
> I have a data-frame with 10 column and then I pivot 1 column and generate
> the 700 columns.
>
> it is like
>
> val df1 = sqlContext.read.parquet("file1")
> df1.registerTempTable("df1")
> val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
> col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
> 0.0 end ) as col4_700 from df1 group by col1, col2")
>
> Now this last statement takes around 20-30 seconds. I run this a number of
> times only difference is that for df1 file is different. Everything else is
> same.
>
> The actual statement takes 2-3 seconds so it is bit frustrating that just
> generating plan for df2 is taking too much time.Worse thing is that this
> run on driver so it is not palatalized.
>
> I have similar issue in another query where from these 700 columns we
> generate more columns by adding or subtracting these and it again takes
> lots of time.
>
> Not sure what could be done here.
>
> Thanks
>
> On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Which version are you using here? If the underlying files change,
>> technically we should go through optimization again.
>>
>> Perhaps the real "fix" is to figure out why is logical plan creation so
>> slow for 700 columns.
>>
>>
>> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
>> wrote:
>>
>>> Is there a way I can use same Logical plan for a query. Everything will
>>> be same except underlying file will be different.
>>>
>>> Issue is that my query has around 700 columns and Generating logical
>>> plan takes 20 seconds and it happens every 2 minutes but every time
>>> underlying file is different.
>>>
>>> I do not know these files in advance so I cant create the table on
>>> directory level. These files are created and then used in the final query.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Logical Plan

Posted by Mich Talebzadeh <mi...@gmail.com>.

I don't think Spark optimizer supports something like statement cache where
plan is cached and bind variables (like RDBMS) are used for different
values, thus saving the parsing.

What you re stating is that the source and tempTable change but the plan
itself remains the same. I have not seen this in 1.6.1 and as I understand
Spark does yet support CBO yet not even in 2.0


HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 June 2016 at 22:53, Darshan Singh <da...@gmail.com> wrote:

> I am using 1.5.2.
>
> I have a data-frame with 10 column and then I pivot 1 column and generate
> the 700 columns.
>
> it is like
>
> val df1 = sqlContext.read.parquet("file1")
> df1.registerTempTable("df1")
> val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
> col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
> 0.0 end ) as col4_700 from df1 group by col1, col2")
>
> Now this last statement takes around 20-30 seconds. I run this a number of
> times only difference is that for df1 file is different. Everything else is
> same.
>
> The actual statement takes 2-3 seconds so it is bit frustrating that just
> generating plan for df2 is taking too much time.Worse thing is that this
> run on driver so it is not palatalized.
>
> I have similar issue in another query where from these 700 columns we
> generate more columns by adding or subtracting these and it again takes
> lots of time.
>
> Not sure what could be done here.
>
> Thanks
>
> On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Which version are you using here? If the underlying files change,
>> technically we should go through optimization again.
>>
>> Perhaps the real "fix" is to figure out why is logical plan creation so
>> slow for 700 columns.
>>
>>
>> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
>> wrote:
>>
>>> Is there a way I can use same Logical plan for a query. Everything will
>>> be same except underlying file will be different.
>>>
>>> Issue is that my query has around 700 columns and Generating logical
>>> plan takes 20 seconds and it happens every 2 minutes but every time
>>> underlying file is different.
>>>
>>> I do not know these files in advance so I cant create the table on
>>> directory level. These files are created and then used in the final query.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Logical Plan

Posted by Darshan Singh <da...@gmail.com>.

I am using 1.5.2.

I have a data-frame with 10 column and then I pivot 1 column and generate
the 700 columns.

it is like

val df1 = sqlContext.read.parquet("file1")
df1.registerTempTable("df1")
val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
0.0 end ) as col4_700 from df1 group by col1, col2")

Now this last statement takes around 20-30 seconds. I run this a number of
times only difference is that for df1 file is different. Everything else is
same.

The actual statement takes 2-3 seconds so it is bit frustrating that just
generating plan for df2 is taking too much time.Worse thing is that this
run on driver so it is not palatalized.

I have similar issue in another query where from these 700 columns we
generate more columns by adding or subtracting these and it again takes
lots of time.

Not sure what could be done here.

Thanks

On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <rx...@databricks.com> wrote:

> Which version are you using here? If the underlying files change,
> technically we should go through optimization again.
>
> Perhaps the real "fix" is to figure out why is logical plan creation so
> slow for 700 columns.
>
>
> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
> wrote:
>
>> Is there a way I can use same Logical plan for a query. Everything will
>> be same except underlying file will be different.
>>
>> Issue is that my query has around 700 columns and Generating logical plan
>> takes 20 seconds and it happens every 2 minutes but every time underlying
>> file is different.
>>
>> I do not know these files in advance so I cant create the table on
>> directory level. These files are created and then used in the final query.
>>
>> Thanks
>>
>
>

Re: Logical Plan

Posted by Mich Talebzadeh <mi...@gmail.com>.

A logical plan should not change assuming the same DAG diagram is used
throughout


Have you tried Spark GUI Page under stages? This is Spark 2

example:

[image: Inline images 1]

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 June 2016 at 22:10, Reynold Xin <rx...@databricks.com> wrote:

> Which version are you using here? If the underlying files change,
> technically we should go through optimization again.
>
> Perhaps the real "fix" is to figure out why is logical plan creation so
> slow for 700 columns.
>
>
> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <da...@gmail.com>
> wrote:
>
>> Is there a way I can use same Logical plan for a query. Everything will
>> be same except underlying file will be different.
>>
>> Issue is that my query has around 700 columns and Generating logical plan
>> takes 20 seconds and it happens every 2 minutes but every time underlying
>> file is different.
>>
>> I do not know these files in advance so I cant create the table on
>> directory level. These files are created and then used in the final query.
>>
>> Thanks
>>
>
>