You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2015/12/10 09:22:18 UTC

A bug in Spark standalone? Worker registration and deregistration

Hi,

While toying with Spark Standalone I've noticed the following messages
in the logs of the master:

INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
INFO Master: localhost:59920 got disassociated, removing it.
...
WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
on 192.168.1.6:59919

Why does the message "WARN Master: Removing
worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
60 seconds" appear when the worker should've been gone already (as
pointed out in "INFO Master: localhost:59920 got disassociated,
removing it.")?

Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?

I started master using "./sbin/start-master.sh -h localhost" and the
workers "./sbin/start-slave.sh spark://localhost:7077".

p.s. Are such questions appropriate for this mailing list?

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: A bug in Spark standalone? Worker registration and deregistration

Posted by Bryan Cutler <cu...@gmail.com>.
Hi Jacek,

I also recently noticed those messages, and some others, and am wondering
if there is an issue.  I am also seeing the following when I have event
logging enabled.  The first application is submitted and executes fine, but
all subsequent attempts produce an error log, but the master fails to load
it.  Not sure if this is related to the messages you see, but I would also
like to know if others can reproduce.  Here are the logs

MASTER
15/12/09 21:19:10 INFO Master: Registering app Spark Pi
15/12/09 21:19:10 INFO Master: Registered app Spark Pi with ID
app-20151209211910-0001
15/12/09 21:19:10 INFO Master: Launching executor app-20151209211910-0001/0
on worker worker-20151209211739-***
15/12/09 21:19:14 INFO Master: Received unregister request from application
app-20151209211910-0001
15/12/09 21:19:14 INFO Master: Removing app app-20151209211910-0001
15/12/09 21:19:14 WARN Master: Application Spark Pi is still in progress,
it may be terminated abnormally.
15/12/09 21:19:14 WARN Master: No event logs found for application Spark Pi
in file:/home/bryan/git/spark/logs/.
15/12/09 21:19:14 INFO Master: localhost.localdomain:54174 got
disassociated, removing it.
15/12/09 21:19:14 WARN Master: Got status update for unknown executor
app-20151209211910-0001/0
15/12/09 21:21:59 WARN Master: Got status update for unknown executor
app-20151209211830-0000/0
15/12/09 21:22:00 INFO Master: localhost.localdomain:54163 got
disassociated, removing it.

WORKER
15/12/09 21:19:14 INFO Worker: Asked to kill executor
app-20151209211910-0001/0
15/12/09 21:19:14 INFO ExecutorRunner: Runner thread for executor
app-20151209211910-0001/0 interrupted
15/12/09 21:19:14 INFO ExecutorRunner: Killing process!
15/12/09 21:19:14 ERROR FileAppender: Error writing stream to file
/home/bryan/git/spark/work/app-20151209211910-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1730)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/12/09 21:19:14 INFO Worker: Executor app-20151209211910-0001/0 finished
with state KILLED exitStatus 143
15/12/09 21:19:14 INFO Worker: Cleaning up local directories for
application app-20151209211910-0001
15/12/09 21:19:14 INFO ExternalShuffleBlockResolver: Application
app-20151209211910-0001 removed, cleanupLocalDirs = true


On Thu, Dec 10, 2015 at 2:45 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I'm on yesterday's master HEAD.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Dec 10, 2015 at 9:50 AM, Sasaki Kai <sa...@treasure-data.com>
> wrote:
> > Hi, Jacek
> >
> > What version of Spark do you use?
> > I started sbin/start-master.sh script as you did against master HEAD.
> But there is no warning log such you pasted.
> > While you can specify hostname with -h option, you can also omit it. The
> master name can be set automatically with
> > the name `hostname` command. You can also try it.
> >
> > Kai Sasaki
> >
> >> On Dec 10, 2015, at 5:22 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi,
> >>
> >> While toying with Spark Standalone I've noticed the following messages
> >> in the logs of the master:
> >>
> >> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB
> RAM
> >> INFO Master: localhost:59920 got disassociated, removing it.
> >> ...
> >> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> >> we got no heartbeat in 60 seconds
> >> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> >> on 192.168.1.6:59919
> >>
> >> Why does the message "WARN Master: Removing
> >> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> >> 60 seconds" appear when the worker should've been gone already (as
> >> pointed out in "INFO Master: localhost:59920 got disassociated,
> >> removing it.")?
> >>
> >> Could it be that the ids are different - 192.168.1.6:59919 vs
> localhost:59920?
> >>
> >> I started master using "./sbin/start-master.sh -h localhost" and the
> >> workers "./sbin/start-slave.sh spark://localhost:7077".
> >>
> >> p.s. Are such questions appropriate for this mailing list?
> >>
> >> Pozdrawiam,
> >> Jacek
> >>
> >> --
> >> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> >> http://blog.jaceklaskowski.pl
> >> Mastering Spark
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> >> Follow me at https://twitter.com/jaceklaskowski
> >> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A bug in Spark standalone? Worker registration and deregistration

Posted by Jacek Laskowski <ja...@japila.pl>.
On Thu, Dec 10, 2015 at 8:10 PM, Shixiong Zhu <zs...@gmail.com> wrote:
> Jacek, could you create a JIRA for it? I just reproduced it. It's a bug in
> how Master handles the Worker disconnection.

Hi Shixiong,

I'm saved. Kept thinking I'm lost in the sources and see ghosts :-)

https://issues.apache.org/jira/browse/SPARK-12267

Pozdrawiam,
Jacek

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: A bug in Spark standalone? Worker registration and deregistration

Posted by Shixiong Zhu <zs...@gmail.com>.
Jacek, could you create a JIRA for it? I just reproduced it. It's a bug in
how Master handles the Worker disconnection.

Best Regards,
Shixiong Zhu

2015-12-10 2:45 GMT-08:00 Jacek Laskowski <ja...@japila.pl>:

> Hi,
>
> I'm on yesterday's master HEAD.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Dec 10, 2015 at 9:50 AM, Sasaki Kai <sa...@treasure-data.com>
> wrote:
> > Hi, Jacek
> >
> > What version of Spark do you use?
> > I started sbin/start-master.sh script as you did against master HEAD.
> But there is no warning log such you pasted.
> > While you can specify hostname with -h option, you can also omit it. The
> master name can be set automatically with
> > the name `hostname` command. You can also try it.
> >
> > Kai Sasaki
> >
> >> On Dec 10, 2015, at 5:22 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi,
> >>
> >> While toying with Spark Standalone I've noticed the following messages
> >> in the logs of the master:
> >>
> >> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB
> RAM
> >> INFO Master: localhost:59920 got disassociated, removing it.
> >> ...
> >> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> >> we got no heartbeat in 60 seconds
> >> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> >> on 192.168.1.6:59919
> >>
> >> Why does the message "WARN Master: Removing
> >> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> >> 60 seconds" appear when the worker should've been gone already (as
> >> pointed out in "INFO Master: localhost:59920 got disassociated,
> >> removing it.")?
> >>
> >> Could it be that the ids are different - 192.168.1.6:59919 vs
> localhost:59920?
> >>
> >> I started master using "./sbin/start-master.sh -h localhost" and the
> >> workers "./sbin/start-slave.sh spark://localhost:7077".
> >>
> >> p.s. Are such questions appropriate for this mailing list?
> >>
> >> Pozdrawiam,
> >> Jacek
> >>
> >> --
> >> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> >> http://blog.jaceklaskowski.pl
> >> Mastering Spark
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> >> Follow me at https://twitter.com/jaceklaskowski
> >> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A bug in Spark standalone? Worker registration and deregistration

Posted by Jacek Laskowski <ja...@japila.pl>.
Hi,

I'm on yesterday's master HEAD.

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Thu, Dec 10, 2015 at 9:50 AM, Sasaki Kai <sa...@treasure-data.com> wrote:
> Hi, Jacek
>
> What version of Spark do you use?
> I started sbin/start-master.sh script as you did against master HEAD. But there is no warning log such you pasted.
> While you can specify hostname with -h option, you can also omit it. The master name can be set automatically with
> the name `hostname` command. You can also try it.
>
> Kai Sasaki
>
>> On Dec 10, 2015, at 5:22 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> Hi,
>>
>> While toying with Spark Standalone I've noticed the following messages
>> in the logs of the master:
>>
>> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
>> INFO Master: localhost:59920 got disassociated, removing it.
>> ...
>> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
>> we got no heartbeat in 60 seconds
>> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
>> on 192.168.1.6:59919
>>
>> Why does the message "WARN Master: Removing
>> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
>> 60 seconds" appear when the worker should've been gone already (as
>> pointed out in "INFO Master: localhost:59920 got disassociated,
>> removing it.")?
>>
>> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
>>
>> I started master using "./sbin/start-master.sh -h localhost" and the
>> workers "./sbin/start-slave.sh spark://localhost:7077".
>>
>> p.s. Are such questions appropriate for this mailing list?
>>
>> Pozdrawiam,
>> Jacek
>>
>> --
>> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
>> http://blog.jaceklaskowski.pl
>> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski
>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org