You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Patrick <ti...@gmail.com> on 2017/02/18 21:23:22 UTC

Efficient Spark-Sql queries when only nth Column changes

Hi,

I have read 5 columns from parquet into data frame. My queries on the
parquet table is of below type:

val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby
col1,col2)
val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby
col1,col3)
val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby
col1,col4)
val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby
col1,col5)

And then i require to union the results from df1 to df4 into a single df.


So basically, only the second column is changing, Is there any efficient
way to write the above queries  in Spark-Sql instead of writing 4 different
queries(OR in loop) and doing union to get the result.


Thanks

Re: Efficient Spark-Sql queries when only nth Column changes

Posted by ayan guha <gu...@gmail.com>.

Try grouping sets.

On Sun, Feb 19, 2017 at 8:23 AM, Patrick <ti...@gmail.com> wrote:

> Hi,
>
> I have read 5 columns from parquet into data frame. My queries on the
> parquet table is of below type:
>
> val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby
> col1,col2)
> val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby
> col1,col3)
> val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby
> col1,col4)
> val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby
> col1,col5)
>
> And then i require to union the results from df1 to df4 into a single df.
>
>
> So basically, only the second column is changing, Is there any efficient
> way to write the above queries  in Spark-Sql instead of writing 4 different
> queries(OR in loop) and doing union to get the result.
>
>
> Thanks
>
>
>
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Efficient Spark-Sql queries when only nth Column changes

Posted by Patrick <ti...@gmail.com>.

Hi,

Thanks all,

I checked with both the approaches, grouping sets worked better for me,
because i didn't want to cache it as i am specifying large fraction of
memory to Shuffle operation.
However, i could only do grouping sets using HiveContext. I am using Spark
1.5 and I think SQLContext doesnt have this functionality, so incase any
one want to use SQLContext, they need to stick to cache option.


Thanks

On Sun, Feb 19, 2017 at 3:02 AM, Yong Zhang <ja...@hotmail.com> wrote:

> If you only need the group by in the same hierarchy logic, then you can
> group by at the lowest level, and cache it, then use the cached DF to
> derive to the higher level, so Spark will only scan the originally table
> once, and reuse the cache in the following.
>
>
> val df_base =  sqlContext.sql("select col1,col2,col3,col4,col5, count(*)
> from table groupby col1,col2,col3,col4,col5").cache
>
> df_base.registerTempTable("df_base")
>
> val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group
> by col1, col2")
>
> val df2 = // similar logic
>
> Yong
> ------------------------------
> *From:* Patrick <ti...@gmail.com>
> *Sent:* Saturday, February 18, 2017 4:23 PM
> *To:* user
> *Subject:* Efficient Spark-Sql queries when only nth Column changes
>
> Hi,
>
> I have read 5 columns from parquet into data frame. My queries on the
> parquet table is of below type:
>
> val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby
> col1,col2)
> val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby
> col1,col3)
> val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby
> col1,col4)
> val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby
> col1,col5)
>
> And then i require to union the results from df1 to df4 into a single df.
>
>
> So basically, only the second column is changing, Is there any efficient
> way to write the above queries  in Spark-Sql instead of writing 4 different
> queries(OR in loop) and doing union to get the result.
>
>
> Thanks
>
>
>
>
>
>

Re: Efficient Spark-Sql queries when only nth Column changes

Posted by Yong Zhang <ja...@hotmail.com>.

If you only need the group by in the same hierarchy logic, then you can group by at the lowest level, and cache it, then use the cached DF to derive to the higher level, so Spark will only scan the originally table once, and reuse the cache in the following.


val df_base =  sqlContext.sql("select col1,col2,col3,col4,col5, count(*) from table groupby col1,col2,col3,col4,col5").cache

df_base.registerTempTable("df_base")

val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group by col1, col2")

val df2 = // similar logic

Yong
________________________________
From: Patrick <ti...@gmail.com>
Sent: Saturday, February 18, 2017 4:23 PM
To: user
Subject: Efficient Spark-Sql queries when only nth Column changes

Hi,

I have read 5 columns from parquet into data frame. My queries on the parquet table is of below type:

val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby col1,col2)
val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby col1,col3)
val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby col1,col4)
val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby col1,col5)

And then i require to union the results from df1 to df4 into a single df.


So basically, only the second column is changing, Is there any efficient way to write the above queries  in Spark-Sql instead of writing 4 different queries(OR in loop) and doing union to get the result.


Thanks