You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "leosandylh@gmail.com" <le...@gmail.com> on 2014/01/11 08:47:56 UTC

转发: some problems about shark on spark





leosandylh@gmail.com

发件人： leosandylh@gmail.com
发送时间： 2014-01-10 22:29
收件人： user; shark-users
主题： some problems about shark on spark
HI ALL,
How could I set the param MEMORY_ONLY_SER 、Spark.kryoserializer.buffer.mb 、 Spark.default.parallelism and Spark.worker.timeout
when I run a shark query ? 
May I set other params in spark-env.sh or hive-site.xml instead ?
or set name=value in the shark cli ?

I have a shark query test :
table a 38b ; table b 23b ;
sql: select a.* , b.* from a join b on a.id = b.id ;
it build three stages :
stage1 has tow tasks:
task1: rdd.HadoopRDD : input split table a 0+19 ;
task2: rdd.HadoopRDD : input split table a 19+19;
stage2 has two tasks: 
task1: rdd.HadoopRDD : input split table b 0+11 ;
task2: rdd.HadoopRDD : input split table b 11+12;
stage3 has one task:
task1: just fetch map outputs for shuffle and write to hdfs path .

Why these tables so small , but build two tasks to read it ?
How could I control the reduce task nums in shark ? It seems compute by the biggest father RDD's partitions ?

THX !




leosandylh@gmail.com