You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@jclouds.apache.org by "Ladislav Thon (JIRA)" <ji...@apache.org> on 2016/04/12 12:55:25 UTC

[jira] [Commented] (JCLOUDS-1092) Azure: ComputeService.resumeNode spins in a timeout loop that doesn't have a chance to exit early

    [ https://issues.apache.org/jira/browse/JCLOUDS-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236984#comment-15236984 ] 

Ladislav Thon commented on JCLOUDS-1092:
----------------------------------------

I'd like to start a discussion here: https://github.com/jclouds/jclouds-labs/pull/257

> Azure: ComputeService.resumeNode spins in a timeout loop that doesn't have a chance to exit early
> -------------------------------------------------------------------------------------------------
>
>                 Key: JCLOUDS-1092
>                 URL: https://issues.apache.org/jira/browse/JCLOUDS-1092
>             Project: jclouds
>          Issue Type: Bug
>          Components: jclouds-labs
>    Affects Versions: 1.9.2
>            Reporter: Ladislav Thon
>              Labels: azurecompute
>
> This is going to be a slightly longer text, so please bear with me.
> Invoking {{ComputeService.resumeNode}} with the Azure provider goes through these layers:
> - {{BaseComputeService.resumeNode}}
> - {{AdaptingComputeServiceStrategies.resumeNode}}
> - {{AzureComputeServiceAdapter.resumeNode}}
> The problem manifests when traversing the callstack back up, so let's assume we got down to {{AzureComputeServiceAdapter.resumeNode}}. Also, the problem only appears for us when calling {{suspendNode}} and then {{resumeNode}} in rapid succession, but that's out of JClouds's control.
> When the {{trackRequest}} method returns (https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L383), it means that the asynchronous operation "start node" succeeded -- but that doesn't mean that the node is already running. In fact, it's only just starting -- I was able to confirm that in the debugger by calling {{api.getDeploymentApiForService(id).get(id)}} and inspecting the {{roleInstanceList}}.
> When we get one layer back up, the {{AdaptingComputeServiceStrategies.resumeNode}} method calls {{getNode}} (see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/strategy/impl/AdaptingComputeServiceStrategies.java#L164), which delegates to {{AzureComputeServiceAdapter.getNode}}.
> {{AzureComputeServiceAdapter.getNode}} only returns non-{{null}} value when all of the deployment's role instances are in a settled state (non-transient), see https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L269 So when the node is only just starting, {{AzureComputeServiceAdapter.getNode}} will return {{null}}.
> Again one layer back up: {{AdaptingComputeServiceStrategies.getNode}} returns {{null}} and hence {{AdaptingComputeServiceStrategies.resumeNode}} also returns {{null}}.
> One more layer back up: {{BaseComputeService.resumeNode}} will call the {{nodeRunning}} predicate with an {{AtomicReference}} of {{null}}, see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/internal/BaseComputeService.java#L470
> The predicate is a {{ComputeServiceTimeoutsModule.RetryablePredicateGuardingNull}} which delegates to {{Predicates2.RetryablePredicate}} and through that to {{AtomicNodeRunning}}. That is a subclass of {{RefreshAndDoubleCheckOnFailUnlessStatusInvalid}}, which will always return {{false}} when the resource is {{null}}, see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/predicates/internal/RefreshAndDoubleCheckOnFailUnlessStatusInvalid.java#L63 There's also some kind of status refreshing, but that will never happen if the resource (node, in this case) is {{null}} (there's nothing to refresh).
> All in all, the {{Predicates2.RetryablePredicate}} will spin on and on, until it times out, because for {{null}}, there's no chance it will exit early.
> After the timeout, {{BaseComputeService.resumeNode}} prints that resuming node was not successful and returns. The problems are:
> - the retrying predicate is spinning uselessly
> - we have actually no idea about the status of the node when {{resumeNode}} returns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)