You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/03/28 14:35:41 UTC

Groupby in fast in Impala than spark sql - any suggestions

Hi,

I am working on requirement where i need to join two tables and do group by
to get max value on some fileds.

Table1: 10 GB of data
Table2: 96 GB of data

Same query in Impala is taking around 20 miniutes and it took almost 3
hours to run in spark sql.

I have added repartition to dataframe, persist as memory and disk still
response is very bad. any suggetions.

val results_group_dataframe=sqlContext.sql("SELECT
a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM
GeoSpatialTemp A GROUP BY a.VIN,
a.OriginalSamplingState").repartition(numPartitions)

Thanks,

Asmath

Re: Groupby in fast in Impala than spark sql - any suggestions

Posted by Ryan <ry...@gmail.com>.
and could you paste the stage and task information from SparkUI

On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ry...@gmail.com> wrote:

> how long does it take if you remove the repartition and just collect the
> result? I don't think repartition is needed here. There's already a shuffle
> for group by
>
> On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
> mdkhajaasmath@gmail.com> wrote:
>
>> Hi,
>>
>> I am working on requirement where i need to join two tables and do group
>> by to get max value on some fileds.
>>
>> Table1: 10 GB of data
>> Table2: 96 GB of data
>>
>> Same query in Impala is taking around 20 miniutes and it took almost 3
>> hours to run in spark sql.
>>
>> I have added repartition to dataframe, persist as memory and disk still
>> response is very bad. any suggetions.
>>
>> val results_group_dataframe=sqlContext.sql("SELECT a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM GeoSpatialTemp A GROUP BY a.VIN, a.OriginalSamplingState").repartition(numPartitions)
>>
>> Thanks,
>>
>> Asmath
>>
>>
>

Re: Groupby in fast in Impala than spark sql - any suggestions

Posted by Ryan <ry...@gmail.com>.
how long does it take if you remove the repartition and just collect the
result? I don't think repartition is needed here. There's already a shuffle
for group by

On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
mdkhajaasmath@gmail.com> wrote:

> Hi,
>
> I am working on requirement where i need to join two tables and do group
> by to get max value on some fileds.
>
> Table1: 10 GB of data
> Table2: 96 GB of data
>
> Same query in Impala is taking around 20 miniutes and it took almost 3
> hours to run in spark sql.
>
> I have added repartition to dataframe, persist as memory and disk still
> response is very bad. any suggetions.
>
> val results_group_dataframe=sqlContext.sql("SELECT a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM GeoSpatialTemp A GROUP BY a.VIN, a.OriginalSamplingState").repartition(numPartitions)
>
> Thanks,
>
> Asmath
>
>