You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/03/28 14:35:41 UTC
Groupby in fast in Impala than spark sql - any suggestions
Hi,
I am working on requirement where i need to join two tables and do group by
to get max value on some fileds.
Table1: 10 GB of data
Table2: 96 GB of data
Same query in Impala is taking around 20 miniutes and it took almost 3
hours to run in spark sql.
I have added repartition to dataframe, persist as memory and disk still
response is very bad. any suggetions.
val results_group_dataframe=sqlContext.sql("SELECT
a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM
GeoSpatialTemp A GROUP BY a.VIN,
a.OriginalSamplingState").repartition(numPartitions)
Thanks,
Asmath
Re: Groupby in fast in Impala than spark sql - any suggestions
Posted by Ryan <ry...@gmail.com>.
and could you paste the stage and task information from SparkUI
On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ry...@gmail.com> wrote:
> how long does it take if you remove the repartition and just collect the
> result? I don't think repartition is needed here. There's already a shuffle
> for group by
>
> On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
> mdkhajaasmath@gmail.com> wrote:
>
>> Hi,
>>
>> I am working on requirement where i need to join two tables and do group
>> by to get max value on some fileds.
>>
>> Table1: 10 GB of data
>> Table2: 96 GB of data
>>
>> Same query in Impala is taking around 20 miniutes and it took almost 3
>> hours to run in spark sql.
>>
>> I have added repartition to dataframe, persist as memory and disk still
>> response is very bad. any suggetions.
>>
>> val results_group_dataframe=sqlContext.sql("SELECT a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM GeoSpatialTemp A GROUP BY a.VIN, a.OriginalSamplingState").repartition(numPartitions)
>>
>> Thanks,
>>
>> Asmath
>>
>>
>
Re: Groupby in fast in Impala than spark sql - any suggestions
Posted by Ryan <ry...@gmail.com>.
how long does it take if you remove the repartition and just collect the
result? I don't think repartition is needed here. There's already a shuffle
for group by
On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
mdkhajaasmath@gmail.com> wrote:
> Hi,
>
> I am working on requirement where i need to join two tables and do group
> by to get max value on some fileds.
>
> Table1: 10 GB of data
> Table2: 96 GB of data
>
> Same query in Impala is taking around 20 miniutes and it took almost 3
> hours to run in spark sql.
>
> I have added repartition to dataframe, persist as memory and disk still
> response is very bad. any suggetions.
>
> val results_group_dataframe=sqlContext.sql("SELECT a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM GeoSpatialTemp A GROUP BY a.VIN, a.OriginalSamplingState").repartition(numPartitions)
>
> Thanks,
>
> Asmath
>
>