You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sadhan Sood <sa...@gmail.com> on 2014/11/12 00:38:10 UTC
Partition caching taking too long

While testing SparkSQL on top of our Hive metastore, we were trying to
cache the data for one partition of the table in memory like this:

CACHE TABLE xyz_20141029 AS SELECT * FROM xyz where date_prefix = 20141029

Table xyz is a hive table which is partitioned with date_prefix. The data
is date_prefix = 20141029 directory is one parquet file:

hdfs dfs -ls /event_logs/xyz/20141029

Found 1 items

-rw-r--r--   3 ubuntu hadoop  854521061 2014-11-11 22:20
/event_logs/xyz/20141029/part-01493178cd7f2-31eb-3f9d-b004-149a97ac4d79-r-01493.lzo.parquet

The file size is no more than 800 MB but still the cache command is taking
longer than an hour and is reading data > multiple Gigs from what seems
like from the UI with multiple failures.

Stage 0(mapPartition) which took longest was running as (from UI):

0

mapPartitions at Exchange.scala:86 +details

RDD: HiveTableScan [tid#46,compact#47,date_prefix#45], (MetastoreRelation
default, bid, None), Some((CAST(date_prefix#45, DoubleType) = 2.0141029E7))

org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)

org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)

org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)

org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)

org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)

org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)

org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)

org.apache.spark.sql.SchemaRDD.count(SchemaRDD.scala:343)

org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult$lzycompute(commands.scala:168)

org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult(commands.scala:159)

org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

org.apache.spark.sql.execution.CacheTableCommand.execute(commands.scala:153)

org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)

org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:105)

2014/11/11 22:28:47 40 min(Duration)


19546/19546(Tasks: Succeeded/Total) 201.1 GB(input) 973.5 KB(Shuffle Write)


I need help understanding what is going on and how we can optimize the
caching.