You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by amit karmakar <am...@gmail.com> on 2014/04/23 18:23:08 UTC
Spark hangs when i call parallelize + count on a ArrayList
having 40k elements
Spark hangs after i perform the following operations
ArrayList<byte[]> bytesList = new ArrayList<byte[]>();
/*
add 40k entries to bytesList
*/
JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList);
System.out.println("Count=" + rdd.count());
If i add just one entry it works.
It works if i modify,
JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList)
to
JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList, 20);
There is nothing in the logs that can help understand the reason.
What could be reason for this ?
Regards,
Amit Kumar Karmakar
Re: Spark hangs when i call parallelize + count on a
ArrayList having 40k elements
Posted by Xiangrui Meng <me...@gmail.com>.
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD sc.parallelize(0 until 20,
20) and call mapPartitions to generate them in parallel. -Xiangrui
On Wed, Apr 23, 2014 at 9:23 AM, amit karmakar
<am...@gmail.com> wrote:
> Spark hangs after i perform the following operations
>
>
> ArrayList<byte[]> bytesList = new ArrayList<byte[]>();
> /*
> add 40k entries to bytesList
> */
>
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList);
> System.out.println("Count=" + rdd.count());
>
>
> If i add just one entry it works.
>
> It works if i modify,
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList)
> to
> JavaRDD<byte[]> rdd = sparkContext.parallelize(bytesList, 20);
>
> There is nothing in the logs that can help understand the reason.
>
> What could be reason for this ?
>
>
> Regards,
> Amit Kumar Karmakar