You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kevin Paul <ke...@gmail.com> on 2014/11/01 01:30:50 UTC

Some of the statistics function in SparkSQL is very slow

Hi all, some of the statistics function that I tried in HiveContext is
very slow, notably percentile, var_sampl, the symptom is same as what
I describe in my previous email,  when I do schemaRDD.collect on the
resulting RDD, the shuffle size is around 1000GB, could I do anything
else to speed up this?

Thanks,
Kevin Paul
---------- Forwarded message ----------
From: Kevin Paul <ke...@gmail.com>
Date: Sat, Oct 25, 2014 at 8:48 PM
Subject: HiveSQL percentile is query slow
To: user <us...@spark.apache.org>


Hi all, I tried to run the following sql command in HiveContext with
my table loaded into memory:
  SELECT percentile(myColumn, array(0.1, 0.5)) FROM myTable

The query took more than 5 minutes to complete, but the query like
  SELECT min(myColumn), max(myColumn) FROM myTable
only took around 10 seconds to run.

My Spark version is 1.2.0 SNAPSHOT, the cluster is 10 slaves, and the
dataset is 10G, and I'm running on Yarn-client mode.
The query took two stages to run:
 1st. is mapPartitions at Exchanged.scala:86  with duration 9s
 2nd. is collect at SparkPlan.scala: 85 with duration 5.3 min

I attach the Summary Metrics for the collect task here
Thanks,
Kevin Paul