You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Niranda Perera <ni...@gmail.com> on 2016/04/12 09:16:19 UTC

Possible deadlock in registering applications in the recovery mode

Hi all,

I have encountered a small issue in the standalone recovery mode.

Let's say there was an application A running in the cluster. Due to some
issue, the entire cluster, together with the application A goes down.

Then later on, cluster comes back online, and the master then goes into the
'recovering' mode, because it sees some apps, workers and drivers have
already been in the cluster from Persistence Engine. While in the recovery
process, the application comes back online, but now it would have a
different ID, let's say B.

But then, as per the master, application registration logic, this
application B will NOT be added to the 'waitingApps' with the message
""Attempted to re-register application at same address". [1]

  private def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " +
appAddress)
      return
    }


The problem here is, master is trying to recover application A, which is
not in there anymore. Therefore after the recovery process, app A will be
dropped. However app A's successor, app B was also omitted from the
'waitingApps' list because it had the same address as App A previously.

This creates a deadlock in the cluster, app A nor app B is available in the
cluster.

When the master is in the RECOVERING mode, shouldn't it add all the
registering apps to a list first, and then after the recovery is completed
(once the unsuccessful recoveries are removed), deploy the apps which are
new?

This would sort this deadlock IMO?

look forward to hearing from you.

best

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834

-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Re: Possible deadlock in registering applications in the recovery mode

Posted by Niranda Perera <ni...@gmail.com>.
Hi guys,

any update on this?

Best

On Wed, Apr 20, 2016 at 3:00 AM, Niranda Perera <ni...@gmail.com>
wrote:

> Hi Reynold,
>
> I have created a JIRA for this [1]. I have also created a PR for the same
> issue [2].
>
> Would be very grateful if you could look into this, because this is a
> blocker in our spark deployment, which uses number of spark custom
> extension.
>
> thanks
> best
>
> [1] https://issues.apache.org/jira/browse/SPARK-14736
> [2] https://github.com/apache/spark/pull/12506
>
> On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I haven't looked closely at this, but I think your proposal makes sense.
>>
>>
>> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <niranda.perera@gmail.com
>> > wrote:
>>
>>> Hi guys,
>>>
>>> Any update on this?
>>>
>>> Best
>>>
>>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <
>>> niranda.perera@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have encountered a small issue in the standalone recovery mode.
>>>>
>>>> Let's say there was an application A running in the cluster. Due to
>>>> some issue, the entire cluster, together with the application A goes down.
>>>>
>>>> Then later on, cluster comes back online, and the master then goes into
>>>> the 'recovering' mode, because it sees some apps, workers and drivers have
>>>> already been in the cluster from Persistence Engine. While in the recovery
>>>> process, the application comes back online, but now it would have a
>>>> different ID, let's say B.
>>>>
>>>> But then, as per the master, application registration logic, this
>>>> application B will NOT be added to the 'waitingApps' with the message
>>>> ""Attempted to re-register application at same address". [1]
>>>>
>>>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>>>     val appAddress = app.driver.address
>>>>     if (addressToApp.contains(appAddress)) {
>>>>       logInfo("Attempted to re-register application at same address: "
>>>> + appAddress)
>>>>       return
>>>>     }
>>>>
>>>>
>>>> The problem here is, master is trying to recover application A, which
>>>> is not in there anymore. Therefore after the recovery process, app A will
>>>> be dropped. However app A's successor, app B was also omitted from the
>>>> 'waitingApps' list because it had the same address as App A previously.
>>>>
>>>> This creates a deadlock in the cluster, app A nor app B is available in
>>>> the cluster.
>>>>
>>>> When the master is in the RECOVERING mode, shouldn't it add all the
>>>> registering apps to a list first, and then after the recovery is completed
>>>> (once the unsuccessful recoveries are removed), deploy the apps which are
>>>> new?
>>>>
>>>> This would sort this deadlock IMO?
>>>>
>>>> look forward to hearing from you.
>>>>
>>>> best
>>>>
>>>> [1]
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>>>
>>>> --
>>>> Niranda
>>>> @n1r44 <https://twitter.com/N1R44>
>>>> +94-71-554-8430
>>>> https://pythagoreanscript.wordpress.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Niranda
>>> @n1r44 <https://twitter.com/N1R44>
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>



-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Re: Possible deadlock in registering applications in the recovery mode

Posted by Niranda Perera <ni...@gmail.com>.
Hi Reynold,

I have created a JIRA for this [1]. I have also created a PR for the same
issue [2].

Would be very grateful if you could look into this, because this is a
blocker in our spark deployment, which uses number of spark custom
extension.

thanks
best

[1] https://issues.apache.org/jira/browse/SPARK-14736
[2] https://github.com/apache/spark/pull/12506

On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:

> I haven't looked closely at this, but I think your proposal makes sense.
>
>
> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <ni...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> Any update on this?
>>
>> Best
>>
>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <
>> niranda.perera@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have encountered a small issue in the standalone recovery mode.
>>>
>>> Let's say there was an application A running in the cluster. Due to some
>>> issue, the entire cluster, together with the application A goes down.
>>>
>>> Then later on, cluster comes back online, and the master then goes into
>>> the 'recovering' mode, because it sees some apps, workers and drivers have
>>> already been in the cluster from Persistence Engine. While in the recovery
>>> process, the application comes back online, but now it would have a
>>> different ID, let's say B.
>>>
>>> But then, as per the master, application registration logic, this
>>> application B will NOT be added to the 'waitingApps' with the message
>>> ""Attempted to re-register application at same address". [1]
>>>
>>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>>     val appAddress = app.driver.address
>>>     if (addressToApp.contains(appAddress)) {
>>>       logInfo("Attempted to re-register application at same address: " +
>>> appAddress)
>>>       return
>>>     }
>>>
>>>
>>> The problem here is, master is trying to recover application A, which is
>>> not in there anymore. Therefore after the recovery process, app A will be
>>> dropped. However app A's successor, app B was also omitted from the
>>> 'waitingApps' list because it had the same address as App A previously.
>>>
>>> This creates a deadlock in the cluster, app A nor app B is available in
>>> the cluster.
>>>
>>> When the master is in the RECOVERING mode, shouldn't it add all the
>>> registering apps to a list first, and then after the recovery is completed
>>> (once the unsuccessful recoveries are removed), deploy the apps which are
>>> new?
>>>
>>> This would sort this deadlock IMO?
>>>
>>> look forward to hearing from you.
>>>
>>> best
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>>
>>> --
>>> Niranda
>>> @n1r44 <https://twitter.com/N1R44>
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>>
>> --
>> Niranda
>> @n1r44 <https://twitter.com/N1R44>
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Re: Possible deadlock in registering applications in the recovery mode

Posted by Reynold Xin <rx...@databricks.com>.
I haven't looked closely at this, but I think your proposal makes sense.


On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <ni...@gmail.com>
wrote:

> Hi guys,
>
> Any update on this?
>
> Best
>
> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <niranda.perera@gmail.com
> > wrote:
>
>> Hi all,
>>
>> I have encountered a small issue in the standalone recovery mode.
>>
>> Let's say there was an application A running in the cluster. Due to some
>> issue, the entire cluster, together with the application A goes down.
>>
>> Then later on, cluster comes back online, and the master then goes into
>> the 'recovering' mode, because it sees some apps, workers and drivers have
>> already been in the cluster from Persistence Engine. While in the recovery
>> process, the application comes back online, but now it would have a
>> different ID, let's say B.
>>
>> But then, as per the master, application registration logic, this
>> application B will NOT be added to the 'waitingApps' with the message
>> ""Attempted to re-register application at same address". [1]
>>
>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>     val appAddress = app.driver.address
>>     if (addressToApp.contains(appAddress)) {
>>       logInfo("Attempted to re-register application at same address: " +
>> appAddress)
>>       return
>>     }
>>
>>
>> The problem here is, master is trying to recover application A, which is
>> not in there anymore. Therefore after the recovery process, app A will be
>> dropped. However app A's successor, app B was also omitted from the
>> 'waitingApps' list because it had the same address as App A previously.
>>
>> This creates a deadlock in the cluster, app A nor app B is available in
>> the cluster.
>>
>> When the master is in the RECOVERING mode, shouldn't it add all the
>> registering apps to a list first, and then after the recovery is completed
>> (once the unsuccessful recoveries are removed), deploy the apps which are
>> new?
>>
>> This would sort this deadlock IMO?
>>
>> look forward to hearing from you.
>>
>> best
>>
>> [1]
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>
>> --
>> Niranda
>> @n1r44 <https://twitter.com/N1R44>
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>

Re: Possible deadlock in registering applications in the recovery mode

Posted by Niranda Perera <ni...@gmail.com>.
Hi guys,

Any update on this?

Best

On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <ni...@gmail.com>
wrote:

> Hi all,
>
> I have encountered a small issue in the standalone recovery mode.
>
> Let's say there was an application A running in the cluster. Due to some
> issue, the entire cluster, together with the application A goes down.
>
> Then later on, cluster comes back online, and the master then goes into
> the 'recovering' mode, because it sees some apps, workers and drivers have
> already been in the cluster from Persistence Engine. While in the recovery
> process, the application comes back online, but now it would have a
> different ID, let's say B.
>
> But then, as per the master, application registration logic, this
> application B will NOT be added to the 'waitingApps' with the message
> ""Attempted to re-register application at same address". [1]
>
>   private def registerApplication(app: ApplicationInfo): Unit = {
>     val appAddress = app.driver.address
>     if (addressToApp.contains(appAddress)) {
>       logInfo("Attempted to re-register application at same address: " +
> appAddress)
>       return
>     }
>
>
> The problem here is, master is trying to recover application A, which is
> not in there anymore. Therefore after the recovery process, app A will be
> dropped. However app A's successor, app B was also omitted from the
> 'waitingApps' list because it had the same address as App A previously.
>
> This creates a deadlock in the cluster, app A nor app B is available in
> the cluster.
>
> When the master is in the RECOVERING mode, shouldn't it add all the
> registering apps to a list first, and then after the recovery is completed
> (once the unsuccessful recoveries are removed), deploy the apps which are
> new?
>
> This would sort this deadlock IMO?
>
> look forward to hearing from you.
>
> best
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>



-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/