You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/11/21 03:57:46 UTC

Spark failing when loading large amount of data

Hi,

I am using sc.textFile("shared_dir/*")  to load all the files in a directory
on a shared partition. The total size of the files in this directory is 1.2
TB. We have a 16  node cluster with 3 TB memory (1 node is driver, 15 nodes
are workers). But the loading fails after around 1 TB of data is read (in
the mapPartitions stage). Basically, there  is no progress in mapPartitions
after 1 TB of input. It seems that the cluster has sufficient memory but not
sure why the program get stuck.

1.2 TB of data divided across 15 worker nodes would require each node to
have about 80 GB of memory. Every node in the cluster is allocated around
170GB of memory. According to the spark documentation, the default storage
fraction for RDDs is 60% of the allocated memory. I have increased that to
0.8 (by setting --conf spark.storage.memorFraction=0.8) , so each node
should have around 136 GB of memory for storing RDDs. So I am not sure why
the program is failing in the mapPartitions stage where it seems to be 
loading the data. 

I dont have a good idea about the Spark internals and would appreciate any
help in fixing this issue. 

thanks
   



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-failing-when-loading-large-amount-of-data-tp19441.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark failing when loading large amount of data

Posted by Preeti Khurana <Pr...@guavus.com>.
The disk size will not correctly reflect the size of the memory needed to
store that data.Memory depends a lot on the data structures you are using .
The objects that hold the data shall have their own overheads.

Try with reduced input size or more memory.



On 21/11/14 8:26 am, "SK" <sk...@gmail.com> wrote:

>Hi,
>
>I am using sc.textFile("shared_dir/*")  to load all the files in a
>directory
>on a shared partition. The total size of the files in this directory is
>1.2
>TB. We have a 16  node cluster with 3 TB memory (1 node is driver, 15
>nodes
>are workers). But the loading fails after around 1 TB of data is read (in
>the mapPartitions stage). Basically, there  is no progress in
>mapPartitions
>after 1 TB of input. It seems that the cluster has sufficient memory but
>not
>sure why the program get stuck.
>
>1.2 TB of data divided across 15 worker nodes would require each node to
>have about 80 GB of memory. Every node in the cluster is allocated around
>170GB of memory. According to the spark documentation, the default storage
>fraction for RDDs is 60% of the allocated memory. I have increased that to
>0.8 (by setting --conf spark.storage.memorFraction=0.8) , so each node
>should have around 136 GB of memory for storing RDDs. So I am not sure why
>the program is failing in the mapPartitions stage where it seems to be
>loading the data. 
>
>I dont have a good idea about the Spark internals and would appreciate any
>help in fixing this issue.
>
>thanks
>   
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-failing-when-loa
>ding-large-amount-of-data-tp19441.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>For additional commands, e-mail: user-help@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org