You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "leosandylh@gmail.com" <le...@gmail.com> on 2014/01/11 08:47:56 UTC
转发: some problems about shark on spark
leosandylh@gmail.com
发件人: leosandylh@gmail.com
发送时间: 2014-01-10 22:29
收件人: user; shark-users
主题: some problems about shark on spark
HI ALL,
How could I set the param MEMORY_ONLY_SER 、Spark.kryoserializer.buffer.mb 、 Spark.default.parallelism and Spark.worker.timeout
when I run a shark query ?
May I set other params in spark-env.sh or hive-site.xml instead ?
or set name=value in the shark cli ?
I have a shark query test :
table a 38b ; table b 23b ;
sql: select a.* , b.* from a join b on a.id = b.id ;
it build three stages :
stage1 has tow tasks:
task1: rdd.HadoopRDD : input split table a 0+19 ;
task2: rdd.HadoopRDD : input split table a 19+19;
stage2 has two tasks:
task1: rdd.HadoopRDD : input split table b 0+11 ;
task2: rdd.HadoopRDD : input split table b 11+12;
stage3 has one task:
task1: just fetch map outputs for shuffle and write to hdfs path .
Why these tables so small , but build two tasks to read it ?
How could I control the reduce task nums in shark ? It seems compute by the biggest father RDD's partitions ?
THX !
leosandylh@gmail.com