You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by AndreaKinn <ki...@hotmail.it> on 2017/10/16 16:01:07 UTC

Unbalanced job scheduling

Hi all,
I want to expose you my program flow.

I have the following operators:

kafka-source -> timestamp-extractor -> map -> keyBy -> window -> apply ->
LEARN -> SELECT -> process -> cassandra-sink

the LEARN and SELECT operators belong to an external library supported by
flink. LEARN is a very heavy operation compared to the other operators.

Unfortunately LEARN has a max parallelism of 1, so if I have a cluster of 2
TM with 1 slot each and I set parallelism = 2 I will have one TM which
performs a parallel instances of all the operators and the single instance
of LEARN while the other one TM performs just the second parallel instances
of all the operators (clearly there are no more instance of LEARN).
That's ok and I have no problem with understanding it.

*** The problem:
Actually I have 2 identical flows like this because it matches a situation
where I have two sensor streams so really I have 2 LEARN operators
corresponding to two independent streams.

By the way I noted that even in this case I have one TM which take a load of
the parallel instances of all the operators AND the single instances of
LEARN-1 and LEARN-2 while the other one TM performs just the second parallel
instances of all the operators (no LEARN instances here).

Since LEARN is an heavy operator this lead to a very unbalanced load on the
cluster, so much that the first TM is killed during the execution (looking
at the logs it probably happens because it has not enough memory, in fact
the sink execution is very very slow, it seems like the LEARN is a
bottleneck).

Honestly I can't understand why Flink don't assign 1 LEARN operator to one
TM and the other one LEARN to the other one TM. 
This won't let me to stress the cluster properly because I will have always
one TM super busy and the other one quite "free" and unstressed.

Bye,
Andrea



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unbalanced job scheduling

Posted by AndreaKinn <ki...@hotmail.it>.

I'm in contact with the founder of the library to deal with the problem. I'm
trying also to understand how implement myself slotSharingGroups



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unbalanced job scheduling

Posted by Fabian Hueske <fh...@gmail.com>.

Setting the slot sharing group is Flink's mechanism to solve this issue.
I'd consider this a limitation of the library that provides LEARN and
SELECT.

Did you consider to open an issue at (or contributing to) the library to
support setting the slotSharing group?

2017-10-17 9:38 GMT+02:00 AndreaKinn <ki...@hotmail.it>:

> Yes, I considered them but unfortunately I can't call setSlotSharingGroup
> method on LEARN and SELECT operators.
>
> I can call it on the other operators but this means that the two LEARN
> method will be constrained in the same "unnamed" slot.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: Unbalanced job scheduling

Posted by AndreaKinn <ki...@hotmail.it>.

Yes, I considered them but unfortunately I can't call setSlotSharingGroup
method on LEARN and SELECT operators.

I can call it on the other operators but this means that the two LEARN
method will be constrained in the same "unnamed" slot.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unbalanced job scheduling

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Andrea,

have you looked into assigning slot sharing groups [1]?

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#task-chaining-and-resource-groups

2017-10-16 18:01 GMT+02:00 AndreaKinn <ki...@hotmail.it>:

> Hi all,
> I want to expose you my program flow.
>
> I have the following operators:
>
> kafka-source -> timestamp-extractor -> map -> keyBy -> window -> apply ->
> LEARN -> SELECT -> process -> cassandra-sink
>
> the LEARN and SELECT operators belong to an external library supported by
> flink. LEARN is a very heavy operation compared to the other operators.
>
> Unfortunately LEARN has a max parallelism of 1, so if I have a cluster of 2
> TM with 1 slot each and I set parallelism = 2 I will have one TM which
> performs a parallel instances of all the operators and the single instance
> of LEARN while the other one TM performs just the second parallel instances
> of all the operators (clearly there are no more instance of LEARN).
> That's ok and I have no problem with understanding it.
>
> *** The problem:
> Actually I have 2 identical flows like this because it matches a situation
> where I have two sensor streams so really I have 2 LEARN operators
> corresponding to two independent streams.
>
> By the way I noted that even in this case I have one TM which take a load
> of
> the parallel instances of all the operators AND the single instances of
> LEARN-1 and LEARN-2 while the other one TM performs just the second
> parallel
> instances of all the operators (no LEARN instances here).
>
> Since LEARN is an heavy operator this lead to a very unbalanced load on the
> cluster, so much that the first TM is killed during the execution (looking
> at the logs it probably happens because it has not enough memory, in fact
> the sink execution is very very slow, it seems like the LEARN is a
> bottleneck).
>
> Honestly I can't understand why Flink don't assign 1 LEARN operator to one
> TM and the other one LEARN to the other one TM.
> This won't let me to stress the cluster properly because I will have always
> one TM super busy and the other one quite "free" and unstressed.
>
> Bye,
> Andrea
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>