You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by HARIPRIYA AYYALASOMAYAJULA <ah...@gmail.com> on 2014/11/07 00:38:30 UTC

job works well on small data set but fails on large data set

Hello all,

I am running the following operations:
val part1= maOutput.toArray.flatten
val  part2 = sc.parallelize(part1)
 val reduceOutput = part2.combineByKey(
            (v) => (v, 1),
            (acc: (Double, Int), v) => ( acc._1 + v, acc._2 + 1),
            (acc1: (Double, Int), acc2: (Double, Int)) => (acc1._1 +
acc2._1, acc1._2 + acc2._2)
          )

while mapOutput is an output of map function which is a tuple of (x,y)
where y is a Double value  and x is a tuple of 4 strings. When I used float
instead of Double, it worked with small data set but failed on the large
file.

I changed it to Double and on the large file it works till I get the
mapOutput. But when I include the remaining part , it fails.

Can someone please help me understand where I am going wrong?

Thank you for your time.
-- 
Regards,
Haripriya Ayyalasomayajula
Graduate Student
Department of Computer Science
University of Houston
Contact : 650-796-7112