You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Paolo Di Tommaso <pa...@gmail.com> on 2016/03/01 00:02:01 UTC

Re: "Failed to send message"

Val,

I guess I'm missing something but I was expecting that having a two nodes
cluster, one should still the waiting tasks from the other. What is
defining the task topology or how to control it?


Cheers,
Paolo


On Mon, Feb 29, 2016 at 11:02 PM, vkulichenko <valentin.kulichenko@gmail.com
> wrote:

> Paolo,
>
> This is not an error. This is a debug message which means that there is a
> node in topology (thief candidate) which is not in task topology - this is
> absolutely legal situation. Does it break something for you?
>
> -Val
>
>
>
> --
> View this message in context:
> http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3265.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: "Failed to send message"

Posted by vkulichenko <va...@gmail.com>.

Paolo,

I found the ticket about this issue [1]. How about picking it up and fixing
instead of implementing your own version of the SPI?

Removing the check completely is wrong, because it's possible that a node
doesn't belong to the cluster group on which the task was executed. But we
should check the original predicate instead of collection of nodes sealed
during the map phase.

[1] https://issues.apache.org/jira/browse/IGNITE-1267

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3317.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: "Failed to send message"

Posted by Paolo Di Tommaso <pa...@gmail.com>.

In a considerable manner, because the idea is to resize the cluster
dynamically launching new grid nodes in order to steal jobs starving in
waiting status.

In wondering if it is would be enough to remove this check:

https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/spi/collision/jobstealing/JobStealingCollisionSpi.java#L734-L748

In that case would be so difficult, because I'm already taking in
consideration to write my own collision strategy "cloning" the default one.

Thanks for your help.

Cheers,
Paolo

On Tue, Mar 1, 2016 at 10:28 PM, vkulichenko <va...@gmail.com>
wrote:

> Paolo,
>
> So you're saying that jobs are not stolen by the node that joined after the
> task is executed? I guess this is possible, because the task topology is
> sealed during mapping phase. How critical is this for you?
>
> -Val
>
>
>
> --
> View this message in context:
> http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3311.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: "Failed to send message"

Posted by vkulichenko <va...@gmail.com>.

Paolo,

So you're saying that jobs are not stolen by the node that joined after the
task is executed? I guess this is possible, because the task topology is
sealed during mapping phase. How critical is this for you?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3311.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: "Failed to send message"

Posted by Paolo Di Tommaso <pa...@gmail.com>.

Hi Valentin,

I've checked and it is not a client node because the GridDiscoveryManager
reports in the log "servers=2, clients=0" and I've launched only two
instances.

Also, as cluster group I'm using the default one i.e. tasks are executed
with Ignite#compute() method.

I'm starting to think that this happens because the second node joins the
topology *after* that tasks have been submitted. Could this be the reason?

Let me explain better my use case: I'm trying to use Ignite in a cloud
cluster to execute long running jobs that run system commands. In this
scenario is required that the cluster is resized, adding new nodes,
depending the runtime metrics. In other words I need that when there are a
certain amount of jobs in a waiting status, new cloud instances are started
and they will begin to steal the waiting jobs.

Is this possible?

Cheers,
Paolo

On Tue, Mar 1, 2016 at 12:41 AM, vkulichenko <va...@gmail.com>
wrote:

> Paolo,
>
> From what I see in the code, it can be even a client node (you can check by
> the ID, btw). Task topology is defined by a cluster group that is used to
> get IgniteCompute. By default it's all server nodes.
>
> In any case, this is just a debug message and if job stealing works for you
> as expected, I would not worry about this. If it doesn't, please describe
> the issue you have.
>
> -Val
>
>
>
> --
> View this message in context:
> http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3269.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: "Failed to send message"

Posted by vkulichenko <va...@gmail.com>.

Paolo,

>From what I see in the code, it can be even a client node (you can check by
the ID, btw). Task topology is defined by a cluster group that is used to
get IgniteCompute. By default it's all server nodes.

In any case, this is just a debug message and if job stealing works for you
as expected, I would not worry about this. If it doesn't, please describe
the issue you have.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3269.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.