You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by amit karmakar <am...@gmail.com> on 2014/04/23 18:23:08 UTC

Spark hangs when i call parallelize + count on a ArrayList having 40k elements

Spark hangs after i perform the following operations


ArrayList<byte[]> bytesList = new ArrayList<byte[]>();
/*
   add 40k entries to bytesList
*/

JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList);
 System.out.println("Count=" + rdd.count());


If i add just one entry it works.

It works if i modify,
JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList)
to
JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList, 20);

There is nothing in the logs that can help understand the reason.

What could be reason for this ?


Regards,
Amit Kumar Karmakar

Re: Spark hangs when i call parallelize + count on a ArrayList having 40k elements

Posted by Xiangrui Meng <me...@gmail.com>.
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD sc.parallelize(0 until 20,
20) and call mapPartitions to generate them in parallel. -Xiangrui

On Wed, Apr 23, 2014 at 9:23 AM, amit karmakar
<am...@gmail.com> wrote:
> Spark hangs after i perform the following operations
>
>
> ArrayList<byte[]> bytesList = new ArrayList<byte[]>();
> /*
>    add 40k entries to bytesList
> */
>
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList);
>  System.out.println("Count=" + rdd.count());
>
>
> If i add just one entry it works.
>
> It works if i modify,
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList)
> to
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList, 20);
>
> There is nothing in the logs that can help understand the reason.
>
> What could be reason for this ?
>
>
> Regards,
> Amit Kumar Karmakar