You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohammed Guller <mo...@glassbeam.com> on 2014/07/02 02:57:09 UTC

Lost TID: Loss was due to fetch failure from BlockManagerId

I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don't see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster?

14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:50 WARN TaskSetManager: Lost TID 31370 (task 6.22:0)
14/06/30 19:30:50 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0)

Thanks.

Mohammed

RE: Lost TID: Loss was due to fetch failure from BlockManagerId

Posted by Mohammed Guller <mo...@glassbeam.com>.

Thanks, guys.

It turned out to be a firewall issue. All the worker nodes had iptables enabled, which only allowed connection from worker to the master node on the standard 7070 port. Once I opened up the other ports, it is working now.

Mohammed

From: Mayur Rustagi [mailto:mayur.rustagi@gmail.com]
Sent: Tuesday, July 1, 2014 10:45 PM
To: user@spark.apache.org
Subject: Re: Lost TID: Loss was due to fetch failure from BlockManagerId

It could be cause you are out of memory on the worker nodes & blocks are not getting registered..
A older issue with 0.6.0 was with dead nodes causing loss of task & then resubmission of data in an infinite loop... It was fixed in 0.7.0 though.
Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164 or any of the machines where the crash log is displayed.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi<https://twitter.com/mayur_rustagi>


On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska <ya...@gmail.com>> wrote:
A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164<http://192.168.222.152/192.168.222.164>. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails -- but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...

On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
> I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
> worker). Our app is fetching data from Cassandra and doing a basic filter,
> map, and countByKey on that data. I have run into a strange problem. Even if
> the number of rows in Cassandra is just 1M, the Spark job goes seems to go
> into an infinite loop and runs for hours. With a small amount of data (less
> than 100 rows), the job does finish, but takes almost 30-40 seconds and we
> frequently see the messages shown below. If we run the same application on a
> single node Spark (--master local[4]), then we don’t see these warnings and
> the task finishes in less than 6-7 seconds. Any idea what could be the cause
> for these problems when we run our application on a standalone 4-node spark
> cluster?
>
>
>
> 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
>
> 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
>
> 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
>
> 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
>
> 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
>
> 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
>
> 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:50 WARN TaskSetManager: Lost TID 31370 (task 6.22:0)
>
> 14/06/30 19:30:50 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
>
>
> Thanks.
>
>
>
> Mohammed
>
>

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

Posted by Mayur Rustagi <ma...@gmail.com>.

It could be cause you are out of memory on the worker nodes & blocks are
not getting registered..
A older issue with 0.6.0 was with dead nodes causing loss of task & then
resubmission of data in an infinite loop... It was fixed in 0.7.0 though.
Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164
or any of the machines where the crash log is displayed.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska <ya...@gmail.com>
wrote:

> A lot of things can get funny when you run distributed as opposed to
> local -- e.g. some jar not making it over. Do you see anything of
> interest in the log on the executor machines -- I'm guessing
> 192.168.222.152/192.168.222.164. From here
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
> seems like the warning message is logged after the task fails -- but I
> wonder if you might see something more useful as to why it failed to
> begin with. As an example we've had cases in Hdfs where a small
> example would work, but on a larger example we'd hit a bad file. But
> the executor log is usually pretty explicit as to what happened...
>
> On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
> > I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
> > worker). Our app is fetching data from Cassandra and doing a basic
> filter,
> > map, and countByKey on that data. I have run into a strange problem.
> Even if
> > the number of rows in Cassandra is just 1M, the Spark job goes seems to
> go
> > into an infinite loop and runs for hours. With a small amount of data
> (less
> > than 100 rows), the job does finish, but takes almost 30-40 seconds and
> we
> > frequently see the messages shown below. If we run the same application
> on a
> > single node Spark (--master local[4]), then we don’t see these warnings
> and
> > the task finishes in less than 6-7 seconds. Any idea what could be the
> cause
> > for these problems when we run our application on a standalone 4-node
> spark
> > cluster?
> >
> >
> >
> > 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
> >
> > 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
> >
> > 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
> >
> > 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
> >
> > 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(0, 192.168.222.142, 39342, 0)
> >
> > 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
> >
> > 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(0, 192.168.222.142, 39342, 0)
> >
> > 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
> >
> > 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
> >
> > 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
> >
> > 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
> >
> > 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
> >
> > 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
> >
> > 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
> >
> > 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
> >
> > 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
> >
> > 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
> >
> > 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
> >
> > 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
> >
> > 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
> >
> > 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
> >
> > 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(1, 192.168.222.152, 45896, 0)
> >
> > 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
> >
> > 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(1, 192.168.222.152, 45896, 0)
> >
> > 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
> >
> > 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(1, 192.168.222.152, 45896, 0)
> >
> > 14/06/30 19:30:50 WARN TaskSetManager: Lost TID 31370 (task 6.22:0)
> >
> > 14/06/30 19:30:50 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(1, 192.168.222.152, 45896, 0)
> >
> >
> >
> > Thanks.
> >
> >
> >
> > Mohammed
> >
> >
>

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

Posted by Yana Kadiyska <ya...@gmail.com>.

A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails -- but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...

On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller <mo...@glassbeam.com> wrote:
> I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
> worker). Our app is fetching data from Cassandra and doing a basic filter,
> map, and countByKey on that data. I have run into a strange problem. Even if
> the number of rows in Cassandra is just 1M, the Spark job goes seems to go
> into an infinite loop and runs for hours. With a small amount of data (less
> than 100 rows), the job does finish, but takes almost 30-40 seconds and we
> frequently see the messages shown below. If we run the same application on a
> single node Spark (--master local[4]), then we don’t see these warnings and
> the task finishes in less than 6-7 seconds. Any idea what could be the cause
> for these problems when we run our application on a standalone 4-node spark
> cluster?
>
>
>
> 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
>
> 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
>
> 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
>
> 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
>
> 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
>
> 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
>
> 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
> 14/06/30 19:30:50 WARN TaskSetManager: Lost TID 31370 (task 6.22:0)
>
> 14/06/30 19:30:50 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
>
>
>
> Thanks.
>
>
>
> Mohammed
>
>