You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashish Rangole <ar...@gmail.com> on 2013/10/03 22:10:55 UTC

Troubleshooting and how to interpret the logs

Hi,

Trying to figure out what does it mean when the application (driver
program) logs end with the the lines like the ones below. This is with the
application running on Spark 0.8.0 on EC2.

Any help will be greatly appreciated.

Thanks!


13/10/03 16:17:33 INFO cluster.ClusterTaskSetManager: Finished TID 1744 in
1183507 ms on ip-10-232-80-206.ec2.internal (progress: 46/60)
13/10/03 16:17:33 INFO scheduler.DAGScheduler: Completed ResultTask(4, 20)
13/10/03 16:17:57 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(1, ip-10-170-16-83.ec2.internal, 46907, 0) with
no recent
heart beats: 68685ms exceeds 45000ms
13/10/03 16:18:57 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0)
with no recent
 heart beats: 95376ms exceeds 45000ms
13/10/03 16:19:13 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager ip-10-232-27-176.ec2.internal:55654 with 19.3 GB
RAM
13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size:
1975.7
 KB, free: 19.2 GB)
13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size:
1975.7
 KB, free: 19.2 GB)
13/10/03 16:22:57 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0)
with no recent
 heart beats: 78761ms exceeds 45000ms

Re: Troubleshooting and how to interpret the logs

Posted by Ashish Rangole <ar...@gmail.com>.
Hi Matei,

Thanks! Will look out for long GC pauses.
On Oct 3, 2013 3:00 PM, "Matei Zaharia" <ma...@gmail.com> wrote:

> Hi Ashish,
>
> Those "removing" messages mean that the node in question didn't
> communicate with your application for 45 seconds. Most likely the executor
> process on the node died, though there's also a chance that it was doing a
> super-long garbage collection or that there was a network problem. Look at
> the logs on the node to see if it crashed due to an exception or the OS
> killed it for some reason. For garbage collection debugging, you can turn
> on -XX:+PrintGCDetails in SPARK_JAVA_OPTS to log the lengths of GC pauses.
>
> Matei
>
> On Oct 3, 2013, at 1:10 PM, Ashish Rangole <ar...@gmail.com> wrote:
>
> > Hi,
> >
> > Trying to figure out what does it mean when the application (driver
> program) logs end with the the lines like the ones below. This is with the
> application running on Spark 0.8.0 on EC2.
> >
> > Any help will be greatly appreciated.
> >
> > Thanks!
> >
> >
> > 13/10/03 16:17:33 INFO cluster.ClusterTaskSetManager: Finished TID 1744
> in 1183507 ms on ip-10-232-80-206.ec2.internal (progress: 46/60)
> > 13/10/03 16:17:33 INFO scheduler.DAGScheduler: Completed ResultTask(4,
> 20)
> > 13/10/03 16:17:57 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(1, ip-10-170-16-83.ec2.internal, 46907, 0) with
> no recent
> > heart beats: 68685ms exceeds 45000ms
> > 13/10/03 16:18:57 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0)
> with no recent
> >  heart beats: 95376ms exceeds 45000ms
> > 13/10/03 16:19:13 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager ip-10-232-27-176.ec2.internal:55654 with 19.3 GB
> RAM
> > 13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size:
> 1975.7
> >  KB, free: 19.2 GB)
> > 13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size:
> 1975.7
> >  KB, free: 19.2 GB)
> > 13/10/03 16:22:57 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0)
> with no recent
> >  heart beats: 78761ms exceeds 45000ms
> >
>
>

Re: Troubleshooting and how to interpret the logs

Posted by Matei Zaharia <ma...@gmail.com>.
Hi Ashish,

Those "removing" messages mean that the node in question didn't communicate with your application for 45 seconds. Most likely the executor process on the node died, though there's also a chance that it was doing a super-long garbage collection or that there was a network problem. Look at the logs on the node to see if it crashed due to an exception or the OS killed it for some reason. For garbage collection debugging, you can turn on -XX:+PrintGCDetails in SPARK_JAVA_OPTS to log the lengths of GC pauses.

Matei

On Oct 3, 2013, at 1:10 PM, Ashish Rangole <ar...@gmail.com> wrote:

> Hi,
> 
> Trying to figure out what does it mean when the application (driver program) logs end with the the lines like the ones below. This is with the application running on Spark 0.8.0 on EC2.
> 
> Any help will be greatly appreciated.
> 
> Thanks!
> 
> 
> 13/10/03 16:17:33 INFO cluster.ClusterTaskSetManager: Finished TID 1744 in 1183507 ms on ip-10-232-80-206.ec2.internal (progress: 46/60)
> 13/10/03 16:17:33 INFO scheduler.DAGScheduler: Completed ResultTask(4, 20)
> 13/10/03 16:17:57 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, ip-10-170-16-83.ec2.internal, 46907, 0) with no recent
> heart beats: 68685ms exceeds 45000ms
> 13/10/03 16:18:57 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0) with no recent
>  heart beats: 95376ms exceeds 45000ms
> 13/10/03 16:19:13 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager ip-10-232-27-176.ec2.internal:55654 with 19.3 GB RAM
> 13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size: 1975.7
>  KB, free: 19.2 GB)
> 13/10/03 16:21:38 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_7_293 in memory on ip-10-232-27-176.ec2.internal:55654 (size: 1975.7
>  KB, free: 19.2 GB)
> 13/10/03 16:22:57 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-232-27-176.ec2.internal, 55654, 0) with no recent
>  heart beats: 78761ms exceeds 45000ms
>