You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by yxzhao <yx...@ualr.edu> on 2014/06/21 21:55:52 UTC

Spark Processing Large Data Stuck

I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
attached log shows. It was stuck there for more than 10 hours and then I
killed it at last. But I did not find any information explaining why it was
stuck. Any suggestions? Thanks.

Spark_OK_48_pagerank.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log>  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Processing Large Data Stuck

Posted by Peng Cheng <pc...@uow.edu.au>.

JVM will quit after spending most of its time on GC (about 95%), but usually
before that you have to wait for a long time, particularly if your job is
already at massive scale.

Since it is hard to run profiling online, maybe its easier for debugging if
you make a lot of partitions (so you can watch the progress bar) and post
the last log before it froze.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Processing Large Data Stuck

Posted by yxzhao <yx...@ualr.edu>.

Thanks Krishna,
I use a small cluster and each compute node has 16GB of RAM and 8 2.66GHz
CPU cores.









On Sat, Jun 21, 2014 at 3:16 PM, Krishna Sankar [via Apache Spark User
List] <ml...@n3.nabble.com> wrote:

> Hi,
>
>    - I have seen similar behavior before. As far as I can tell, the root
>    cause is the out of memory error - verified this by monitoring the memory.
>       - I had a 30 GB file and was running on a single machine with 16GB.
>       So I knew it would fail.
>       - But instead of raising an exception, some part of the system
>       keeps on churning.
>    - My suggestion is to follow the memory settings for the JVM (try
>    bigger settings), make sure the settings are propagated to all the workers
>    and finally monitor the memory while the job is running.
>    - Another vector is to split the file, try with progressively
>    increasing size.
>    - I also see symptoms of failed connections. While I can't positively
>    say that it is a problem, check your topology & network connectivity.
>    - Out of curiosity, what kind of machines are you running ? Bare metal
>    ? EC2 ? How much memory ? 64 bit OS ?
>       - I assume these are big machines and so the resources themselves
>       might not be a problem.
>
> Cheers
> <k/>
>
>
> On Sat, Jun 21, 2014 at 12:55 PM, yxzhao <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=8077&i=0>> wrote:
>
>> I run the pagerank example processing a large data set, 5GB in size,
>> using 48
>> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
>> attached log shows. It was stuck there for more than 10 hours and then I
>> killed it at last. But I did not find any information explaining why it
>> was
>> stuck. Any suggestions? Thanks.
>>
>> Spark_OK_48_pagerank.log
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8077.html
>  To unsubscribe from Spark Processing Large Data Stuck, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8075&code=eXh6aGFvQHVhbHIuZWR1fDgwNzV8LTY0Mjc0NDkzMQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Processing Large Data Stuck

Posted by Krishna Sankar <ks...@gmail.com>.

Hi,

   - I have seen similar behavior before. As far as I can tell, the root
   cause is the out of memory error - verified this by monitoring the memory.
      - I had a 30 GB file and was running on a single machine with 16GB.
      So I knew it would fail.
      - But instead of raising an exception, some part of the system keeps
      on churning.
   - My suggestion is to follow the memory settings for the JVM (try bigger
   settings), make sure the settings are propagated to all the workers and
   finally monitor the memory while the job is running.
   - Another vector is to split the file, try with progressively increasing
   size.
   - I also see symptoms of failed connections. While I can't positively
   say that it is a problem, check your topology & network connectivity.
   - Out of curiosity, what kind of machines are you running ? Bare metal ?
   EC2 ? How much memory ? 64 bit OS ?
      - I assume these are big machines and so the resources themselves
      might not be a problem.

Cheers
<k/>


On Sat, Jun 21, 2014 at 12:55 PM, yxzhao <yx...@ualr.edu> wrote:

> I run the pagerank example processing a large data set, 5GB in size, using
> 48
> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
> attached log shows. It was stuck there for more than 10 hours and then I
> killed it at last. But I did not find any information explaining why it was
> stuck. Any suggestions? Thanks.
>
> Spark_OK_48_pagerank.log
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>