You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Great Info <gu...@gmail.com> on 2022/06/13 16:01:42 UTC

Flink running same task on different Task Manager

Hi,
I have one flink job which has two tasks
Task1- Source some static data over https and keep it in memory, this keeps
refreshing it every 1 hour
Task2- Process some real-time events from Kafka and uses static data to
validate something and transform, then forward to other Kafka topic.

so far, everything was running on the same Task manager(same node), but due
to some recent scaling requirements need to enable partitioning on Task2
and that will make some partitions run on other task managers. but other
task managers don't have the static data

is there a way to run Task1 on all the task managers? I don't want to
enable broadcasting since it is a little huge and also I can not persist
data in DB due to data regulations.

Re: Flink running same task on different Task Manager

Posted by Lijie Wang <wa...@gmail.com>.
Hi Great,

-> Will these methods work?
I think it will not work. It can control that the slots are evenly
distributed on the TM, but cannot control the correspondence between tasks
and slots.
For example, the Flink cluster has 2 TMs, each TM has 4 slots(so total 8
slots in this cluster), and your job have :2 * *Task1- Source* and 2 * *Task2-
Process* (total 4 tasks). If you set  `cluster.evenly-spread-out-slots:
true`, then the TM1 will run 2 tasks, TM2 will run 2 tasks.
But the task running on TM1 can be any 2 of these 4 tasks(maybe the 2
tasks' type on TM1 are both *Task1- Source,* it‘s not what you need)

-> Also, I did not find a way to set different parallelism for each
slotSharingGourp
The operators in the same slot sharing group can have different
parallelism. See [1] for details.

I think you can refer to the solution provided by Weihua, using Broadcast
to do this.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources

Best,
Lijie

Great Info <gu...@gmail.com> 于2022年8月19日周五 01:20写道:

> Kindly help with this, I got stuck
> -> If so, I think you can set Task1 and Task2 to the same parallelism and
> set them in the same slot sharing group. In this way, Task1 and Task2 will
> be deployed into the same slot(That is, the same task manager).
>
> *Updating task details *
> *Task1- Source some static data over HTTPS and keep it in memory(in static
> memory block), this keeps refreshing it every 1 hour, since this is huge,
> it can not be broadcasted *
>
> *          Task2- Process some real-time events from Kafka and uses
> static data to validate something and transform, then forward to other
> Kafka topic*
>
> Task2 needs more parallelism so deploying both Task1 and Task2 on the same
> node (task manager) is becoming difficult, I am using AWS KDA and that has
> the limitation to run only 8 tasks per node. now I have a requirement to
> run parallelism  of 12 for the Task2
>
> 1. set different SlotSharingGroup for task1 and Task2
> 2. set  parallelism to 12 for the task2 (this real-time task needs to read
> from 12 different Kafka partitions hence setting it to 12)
> 3. set parallelism  of task1 to 2
> 4. then set this cluster.evenly-spread-out-slots: true
>
> Will these methods work? Also, I did not find a way to set
> different parallelism for each slotSharingGourp
>
> On Thu, Jul 14, 2022 at 10:12 PM Great Info <gu...@gmail.com> wrote:
>
>> -> If so, I think you can set Task1 and Task2 to the same parallelism and
>> set them in the same slot sharing group. In this way, Task1 and Task2 will
>> be deployed into the same slot(That is, the same task manager).
>>
>> *Updating task details *
>> *Task1- Source some static data over HTTPS and keep it in memory(in
>> static memory block), this keeps refreshing it every 1 hour, since this is
>> huge, it can not be broadcasted *
>>
>> *          Task2- Process some real-time events from Kafka and uses
>> static data to validate something and transform, then forward to other
>> Kafka topic*
>>
>> Task2 needs more parallelism so deploying both Task1 and Task2 on the
>> same node (task manager) is becoming difficult, I am using AWS KDA and that
>> has the limitation to run only 8 tasks per node. now I have a requirement
>> to run parallelism  of 12 for the Task2
>>
>> 1. set different SlotSharingGroup for task1 and Task2
>> 2. set  parallelism to 12 for the task2 (this real-time task needs to
>> read from 12 different Kafka partitions hence setting it to 12)
>> 3. set parallelism  of task1 to 2
>> 4. then set this cluster.evenly-spread-out-slots: true
>>
>> Will these methods work? Also, I did not find a way to set
>> different parallelism for each slotSharingGourp
>>
>>
>>
>> On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>>> Hi Great,
>>>
>>> -> Is there a way to set the restart strategy so that only tasks in the
>>> same slot group will restart during failure?
>>>
>>> No. On task failover, all tasks in the same region will be restarted at
>>> the same time (to ensure the data consistency).
>>> You can get more details about failover strategy in [1]
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>>>
>>> Best,
>>> Lijie
>>>
>>>
>>> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>>>
>>>> thanks for helping with some inputs
>>>> actually, I have created task1 and task2 in separate slot groups,
>>>> thought it would be good if they run in independent slots. Also now facing
>>>> some issues during restarts. whenever  task1 has any exception entire job
>>>> is restarting.
>>>>
>>>> Is there a way to set the restart strategy so that only tasks in the
>>>> same slot group will restart during failure
>>>> ?
>>>>
>>>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Great,
>>>>>
>>>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>>>
>>>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>>>> will be deployed into the same slot(That is, the same task manager).
>>>>>
>>>>> You can get more details about slot sharing group in [1], and you can
>>>>> get how to set slot sharing group in [2].
>>>>>
>>>>> [1]
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>>>> [2]
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>>>
>>>>> Best,
>>>>> Lijie
>>>>>
>>>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>>>
>>>>>> I don't really understand how task2 reads static data from task1,
>>>>>> but I think you can integrate the logic of getting static data from
>>>>>> http in
>>>>>> task1 into task2 and keep only one kind of task.
>>>>>>
>>>>>> Best,
>>>>>> Weihua
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > thanks for helping with some inputs, yes I am using rich function
>>>>>> and
>>>>>> > handling objects created in open, and also and network calls are
>>>>>> getting
>>>>>> > called in a run.
>>>>>> > but currently, I got stuck running this same task on *all task
>>>>>> managers*
>>>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>>>> only one
>>>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> IMO, Broadcast is a better way to do this, which can reduce the
>>>>>> QPS of
>>>>>> >> external access.
>>>>>> >> If you do not want to use Broadcast, Try using RichFunction, start
>>>>>> a
>>>>>> >> thread in the open() method to refresh the data regularly. but be
>>>>>> careful
>>>>>> >> to clean up your data and threads in the close() method, otherwise
>>>>>> it will
>>>>>> >> lead to leaks.
>>>>>> >>
>>>>>> >> Best,
>>>>>> >> Weihua
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Hi,
>>>>>> >>> I have one flink job which has two tasks
>>>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>>>> this
>>>>>> >>> keeps refreshing it every 1 hour
>>>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>>>> data to
>>>>>> >>> validate something and transform, then forward to other Kafka
>>>>>> topic.
>>>>>> >>>
>>>>>> >>> so far, everything was running on the same Task manager(same
>>>>>> node), but
>>>>>> >>> due to some recent scaling requirements need to enable
>>>>>> partitioning on
>>>>>> >>> Task2 and that will make some partitions run on other task
>>>>>> managers. but
>>>>>> >>> other task managers don't have the static data
>>>>>> >>>
>>>>>> >>> is there a way to run Task1 on all the task managers? I don't
>>>>>> want to
>>>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>>>> persist
>>>>>> >>> data in DB due to data regulations.
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>

Re: Flink running same task on different Task Manager

Posted by Lijie Wang <wa...@gmail.com>.
Hi Great,

-> Will these methods work?
I think it will not work. It can control that the slots are evenly
distributed on the TM, but cannot control the correspondence between tasks
and slots.
For example, the Flink cluster has 2 TMs, each TM has 4 slots(so total 8
slots in this cluster), and your job have :2 * *Task1- Source* and 2 * *Task2-
Process* (total 4 tasks). If you set  `cluster.evenly-spread-out-slots:
true`, then the TM1 will run 2 tasks, TM2 will run 2 tasks.
But the task running on TM1 can be any 2 of these 4 tasks(maybe the 2
tasks' type on TM1 are both *Task1- Source,* it‘s not what you need)

-> Also, I did not find a way to set different parallelism for each
slotSharingGourp
The operators in the same slot sharing group can have different
parallelism. See [1] for details.

I think you can refer to the solution provided by Weihua, using Broadcast
to do this.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources

Best,
Lijie

Great Info <gu...@gmail.com> 于2022年8月19日周五 01:20写道:

> Kindly help with this, I got stuck
> -> If so, I think you can set Task1 and Task2 to the same parallelism and
> set them in the same slot sharing group. In this way, Task1 and Task2 will
> be deployed into the same slot(That is, the same task manager).
>
> *Updating task details *
> *Task1- Source some static data over HTTPS and keep it in memory(in static
> memory block), this keeps refreshing it every 1 hour, since this is huge,
> it can not be broadcasted *
>
> *          Task2- Process some real-time events from Kafka and uses
> static data to validate something and transform, then forward to other
> Kafka topic*
>
> Task2 needs more parallelism so deploying both Task1 and Task2 on the same
> node (task manager) is becoming difficult, I am using AWS KDA and that has
> the limitation to run only 8 tasks per node. now I have a requirement to
> run parallelism  of 12 for the Task2
>
> 1. set different SlotSharingGroup for task1 and Task2
> 2. set  parallelism to 12 for the task2 (this real-time task needs to read
> from 12 different Kafka partitions hence setting it to 12)
> 3. set parallelism  of task1 to 2
> 4. then set this cluster.evenly-spread-out-slots: true
>
> Will these methods work? Also, I did not find a way to set
> different parallelism for each slotSharingGourp
>
> On Thu, Jul 14, 2022 at 10:12 PM Great Info <gu...@gmail.com> wrote:
>
>> -> If so, I think you can set Task1 and Task2 to the same parallelism and
>> set them in the same slot sharing group. In this way, Task1 and Task2 will
>> be deployed into the same slot(That is, the same task manager).
>>
>> *Updating task details *
>> *Task1- Source some static data over HTTPS and keep it in memory(in
>> static memory block), this keeps refreshing it every 1 hour, since this is
>> huge, it can not be broadcasted *
>>
>> *          Task2- Process some real-time events from Kafka and uses
>> static data to validate something and transform, then forward to other
>> Kafka topic*
>>
>> Task2 needs more parallelism so deploying both Task1 and Task2 on the
>> same node (task manager) is becoming difficult, I am using AWS KDA and that
>> has the limitation to run only 8 tasks per node. now I have a requirement
>> to run parallelism  of 12 for the Task2
>>
>> 1. set different SlotSharingGroup for task1 and Task2
>> 2. set  parallelism to 12 for the task2 (this real-time task needs to
>> read from 12 different Kafka partitions hence setting it to 12)
>> 3. set parallelism  of task1 to 2
>> 4. then set this cluster.evenly-spread-out-slots: true
>>
>> Will these methods work? Also, I did not find a way to set
>> different parallelism for each slotSharingGourp
>>
>>
>>
>> On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>>> Hi Great,
>>>
>>> -> Is there a way to set the restart strategy so that only tasks in the
>>> same slot group will restart during failure?
>>>
>>> No. On task failover, all tasks in the same region will be restarted at
>>> the same time (to ensure the data consistency).
>>> You can get more details about failover strategy in [1]
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>>>
>>> Best,
>>> Lijie
>>>
>>>
>>> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>>>
>>>> thanks for helping with some inputs
>>>> actually, I have created task1 and task2 in separate slot groups,
>>>> thought it would be good if they run in independent slots. Also now facing
>>>> some issues during restarts. whenever  task1 has any exception entire job
>>>> is restarting.
>>>>
>>>> Is there a way to set the restart strategy so that only tasks in the
>>>> same slot group will restart during failure
>>>> ?
>>>>
>>>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Great,
>>>>>
>>>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>>>
>>>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>>>> will be deployed into the same slot(That is, the same task manager).
>>>>>
>>>>> You can get more details about slot sharing group in [1], and you can
>>>>> get how to set slot sharing group in [2].
>>>>>
>>>>> [1]
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>>>> [2]
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>>>
>>>>> Best,
>>>>> Lijie
>>>>>
>>>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>>>
>>>>>> I don't really understand how task2 reads static data from task1,
>>>>>> but I think you can integrate the logic of getting static data from
>>>>>> http in
>>>>>> task1 into task2 and keep only one kind of task.
>>>>>>
>>>>>> Best,
>>>>>> Weihua
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > thanks for helping with some inputs, yes I am using rich function
>>>>>> and
>>>>>> > handling objects created in open, and also and network calls are
>>>>>> getting
>>>>>> > called in a run.
>>>>>> > but currently, I got stuck running this same task on *all task
>>>>>> managers*
>>>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>>>> only one
>>>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> IMO, Broadcast is a better way to do this, which can reduce the
>>>>>> QPS of
>>>>>> >> external access.
>>>>>> >> If you do not want to use Broadcast, Try using RichFunction, start
>>>>>> a
>>>>>> >> thread in the open() method to refresh the data regularly. but be
>>>>>> careful
>>>>>> >> to clean up your data and threads in the close() method, otherwise
>>>>>> it will
>>>>>> >> lead to leaks.
>>>>>> >>
>>>>>> >> Best,
>>>>>> >> Weihua
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Hi,
>>>>>> >>> I have one flink job which has two tasks
>>>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>>>> this
>>>>>> >>> keeps refreshing it every 1 hour
>>>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>>>> data to
>>>>>> >>> validate something and transform, then forward to other Kafka
>>>>>> topic.
>>>>>> >>>
>>>>>> >>> so far, everything was running on the same Task manager(same
>>>>>> node), but
>>>>>> >>> due to some recent scaling requirements need to enable
>>>>>> partitioning on
>>>>>> >>> Task2 and that will make some partitions run on other task
>>>>>> managers. but
>>>>>> >>> other task managers don't have the static data
>>>>>> >>>
>>>>>> >>> is there a way to run Task1 on all the task managers? I don't
>>>>>> want to
>>>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>>>> persist
>>>>>> >>> data in DB due to data regulations.
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
Kindly help with this, I got stuck
-> If so, I think you can set Task1 and Task2 to the same parallelism and
set them in the same slot sharing group. In this way, Task1 and Task2 will
be deployed into the same slot(That is, the same task manager).

*Updating task details *
*Task1- Source some static data over HTTPS and keep it in memory(in static
memory block), this keeps refreshing it every 1 hour, since this is huge,
it can not be broadcasted *

*          Task2- Process some real-time events from Kafka and uses
static data to validate something and transform, then forward to other
Kafka topic*

Task2 needs more parallelism so deploying both Task1 and Task2 on the same
node (task manager) is becoming difficult, I am using AWS KDA and that has
the limitation to run only 8 tasks per node. now I have a requirement to
run parallelism  of 12 for the Task2

1. set different SlotSharingGroup for task1 and Task2
2. set  parallelism to 12 for the task2 (this real-time task needs to read
from 12 different Kafka partitions hence setting it to 12)
3. set parallelism  of task1 to 2
4. then set this cluster.evenly-spread-out-slots: true

Will these methods work? Also, I did not find a way to set
different parallelism for each slotSharingGourp

On Thu, Jul 14, 2022 at 10:12 PM Great Info <gu...@gmail.com> wrote:

> -> If so, I think you can set Task1 and Task2 to the same parallelism and
> set them in the same slot sharing group. In this way, Task1 and Task2 will
> be deployed into the same slot(That is, the same task manager).
>
> *Updating task details *
> *Task1- Source some static data over HTTPS and keep it in memory(in static
> memory block), this keeps refreshing it every 1 hour, since this is huge,
> it can not be broadcasted *
>
> *          Task2- Process some real-time events from Kafka and uses
> static data to validate something and transform, then forward to other
> Kafka topic*
>
> Task2 needs more parallelism so deploying both Task1 and Task2 on the same
> node (task manager) is becoming difficult, I am using AWS KDA and that has
> the limitation to run only 8 tasks per node. now I have a requirement to
> run parallelism  of 12 for the Task2
>
> 1. set different SlotSharingGroup for task1 and Task2
> 2. set  parallelism to 12 for the task2 (this real-time task needs to read
> from 12 different Kafka partitions hence setting it to 12)
> 3. set parallelism  of task1 to 2
> 4. then set this cluster.evenly-spread-out-slots: true
>
> Will these methods work? Also, I did not find a way to set
> different parallelism for each slotSharingGourp
>
>
>
> On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com>
> wrote:
>
>> Hi Great,
>>
>> -> Is there a way to set the restart strategy so that only tasks in the
>> same slot group will restart during failure?
>>
>> No. On task failover, all tasks in the same region will be restarted at
>> the same time (to ensure the data consistency).
>> You can get more details about failover strategy in [1]
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>>
>> Best,
>> Lijie
>>
>>
>> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>>
>>> thanks for helping with some inputs
>>> actually, I have created task1 and task2 in separate slot groups,
>>> thought it would be good if they run in independent slots. Also now facing
>>> some issues during restarts. whenever  task1 has any exception entire job
>>> is restarting.
>>>
>>> Is there a way to set the restart strategy so that only tasks in the
>>> same slot group will restart during failure
>>> ?
>>>
>>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Great,
>>>>
>>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>>
>>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>>> will be deployed into the same slot(That is, the same task manager).
>>>>
>>>> You can get more details about slot sharing group in [1], and you can
>>>> get how to set slot sharing group in [2].
>>>>
>>>> [1]
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>>> [2]
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>>
>>>> Best,
>>>> Lijie
>>>>
>>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>>
>>>>> I don't really understand how task2 reads static data from task1,
>>>>> but I think you can integrate the logic of getting static data from
>>>>> http in
>>>>> task1 into task2 and keep only one kind of task.
>>>>>
>>>>> Best,
>>>>> Weihua
>>>>>
>>>>>
>>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>>>>
>>>>> > thanks for helping with some inputs, yes I am using rich function and
>>>>> > handling objects created in open, and also and network calls are
>>>>> getting
>>>>> > called in a run.
>>>>> > but currently, I got stuck running this same task on *all task
>>>>> managers*
>>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>>> only one
>>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>>> >
>>>>> >
>>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS
>>>>> of
>>>>> >> external access.
>>>>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>>>>> >> thread in the open() method to refresh the data regularly. but be
>>>>> careful
>>>>> >> to clean up your data and threads in the close() method, otherwise
>>>>> it will
>>>>> >> lead to leaks.
>>>>> >>
>>>>> >> Best,
>>>>> >> Weihua
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >>> Hi,
>>>>> >>> I have one flink job which has two tasks
>>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>>> this
>>>>> >>> keeps refreshing it every 1 hour
>>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>>> data to
>>>>> >>> validate something and transform, then forward to other Kafka
>>>>> topic.
>>>>> >>>
>>>>> >>> so far, everything was running on the same Task manager(same
>>>>> node), but
>>>>> >>> due to some recent scaling requirements need to enable
>>>>> partitioning on
>>>>> >>> Task2 and that will make some partitions run on other task
>>>>> managers. but
>>>>> >>> other task managers don't have the static data
>>>>> >>>
>>>>> >>> is there a way to run Task1 on all the task managers? I don't want
>>>>> to
>>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>>> persist
>>>>> >>> data in DB due to data regulations.
>>>>> >>>
>>>>> >>>
>>>>>
>>>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
Kindly help with this, I got stuck
-> If so, I think you can set Task1 and Task2 to the same parallelism and
set them in the same slot sharing group. In this way, Task1 and Task2 will
be deployed into the same slot(That is, the same task manager).

*Updating task details *
*Task1- Source some static data over HTTPS and keep it in memory(in static
memory block), this keeps refreshing it every 1 hour, since this is huge,
it can not be broadcasted *

*          Task2- Process some real-time events from Kafka and uses
static data to validate something and transform, then forward to other
Kafka topic*

Task2 needs more parallelism so deploying both Task1 and Task2 on the same
node (task manager) is becoming difficult, I am using AWS KDA and that has
the limitation to run only 8 tasks per node. now I have a requirement to
run parallelism  of 12 for the Task2

1. set different SlotSharingGroup for task1 and Task2
2. set  parallelism to 12 for the task2 (this real-time task needs to read
from 12 different Kafka partitions hence setting it to 12)
3. set parallelism  of task1 to 2
4. then set this cluster.evenly-spread-out-slots: true

Will these methods work? Also, I did not find a way to set
different parallelism for each slotSharingGourp

On Thu, Jul 14, 2022 at 10:12 PM Great Info <gu...@gmail.com> wrote:

> -> If so, I think you can set Task1 and Task2 to the same parallelism and
> set them in the same slot sharing group. In this way, Task1 and Task2 will
> be deployed into the same slot(That is, the same task manager).
>
> *Updating task details *
> *Task1- Source some static data over HTTPS and keep it in memory(in static
> memory block), this keeps refreshing it every 1 hour, since this is huge,
> it can not be broadcasted *
>
> *          Task2- Process some real-time events from Kafka and uses
> static data to validate something and transform, then forward to other
> Kafka topic*
>
> Task2 needs more parallelism so deploying both Task1 and Task2 on the same
> node (task manager) is becoming difficult, I am using AWS KDA and that has
> the limitation to run only 8 tasks per node. now I have a requirement to
> run parallelism  of 12 for the Task2
>
> 1. set different SlotSharingGroup for task1 and Task2
> 2. set  parallelism to 12 for the task2 (this real-time task needs to read
> from 12 different Kafka partitions hence setting it to 12)
> 3. set parallelism  of task1 to 2
> 4. then set this cluster.evenly-spread-out-slots: true
>
> Will these methods work? Also, I did not find a way to set
> different parallelism for each slotSharingGourp
>
>
>
> On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com>
> wrote:
>
>> Hi Great,
>>
>> -> Is there a way to set the restart strategy so that only tasks in the
>> same slot group will restart during failure?
>>
>> No. On task failover, all tasks in the same region will be restarted at
>> the same time (to ensure the data consistency).
>> You can get more details about failover strategy in [1]
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>>
>> Best,
>> Lijie
>>
>>
>> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>>
>>> thanks for helping with some inputs
>>> actually, I have created task1 and task2 in separate slot groups,
>>> thought it would be good if they run in independent slots. Also now facing
>>> some issues during restarts. whenever  task1 has any exception entire job
>>> is restarting.
>>>
>>> Is there a way to set the restart strategy so that only tasks in the
>>> same slot group will restart during failure
>>> ?
>>>
>>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Great,
>>>>
>>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>>
>>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>>> will be deployed into the same slot(That is, the same task manager).
>>>>
>>>> You can get more details about slot sharing group in [1], and you can
>>>> get how to set slot sharing group in [2].
>>>>
>>>> [1]
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>>> [2]
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>>
>>>> Best,
>>>> Lijie
>>>>
>>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>>
>>>>> I don't really understand how task2 reads static data from task1,
>>>>> but I think you can integrate the logic of getting static data from
>>>>> http in
>>>>> task1 into task2 and keep only one kind of task.
>>>>>
>>>>> Best,
>>>>> Weihua
>>>>>
>>>>>
>>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>>>>
>>>>> > thanks for helping with some inputs, yes I am using rich function and
>>>>> > handling objects created in open, and also and network calls are
>>>>> getting
>>>>> > called in a run.
>>>>> > but currently, I got stuck running this same task on *all task
>>>>> managers*
>>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>>> only one
>>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>>> >
>>>>> >
>>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS
>>>>> of
>>>>> >> external access.
>>>>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>>>>> >> thread in the open() method to refresh the data regularly. but be
>>>>> careful
>>>>> >> to clean up your data and threads in the close() method, otherwise
>>>>> it will
>>>>> >> lead to leaks.
>>>>> >>
>>>>> >> Best,
>>>>> >> Weihua
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >>> Hi,
>>>>> >>> I have one flink job which has two tasks
>>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>>> this
>>>>> >>> keeps refreshing it every 1 hour
>>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>>> data to
>>>>> >>> validate something and transform, then forward to other Kafka
>>>>> topic.
>>>>> >>>
>>>>> >>> so far, everything was running on the same Task manager(same
>>>>> node), but
>>>>> >>> due to some recent scaling requirements need to enable
>>>>> partitioning on
>>>>> >>> Task2 and that will make some partitions run on other task
>>>>> managers. but
>>>>> >>> other task managers don't have the static data
>>>>> >>>
>>>>> >>> is there a way to run Task1 on all the task managers? I don't want
>>>>> to
>>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>>> persist
>>>>> >>> data in DB due to data regulations.
>>>>> >>>
>>>>> >>>
>>>>>
>>>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
-> If so, I think you can set Task1 and Task2 to the same parallelism and
set them in the same slot sharing group. In this way, Task1 and Task2 will
be deployed into the same slot(That is, the same task manager).

*Updating task details *
*Task1- Source some static data over HTTPS and keep it in memory(in static
memory block), this keeps refreshing it every 1 hour, since this is huge,
it can not be broadcasted *

*          Task2- Process some real-time events from Kafka and uses
static data to validate something and transform, then forward to other
Kafka topic*

Task2 needs more parallelism so deploying both Task1 and Task2 on the same
node (task manager) is becoming difficult, I am using AWS KDA and that has
the limitation to run only 8 tasks per node. now I have a requirement to
run parallelism  of 12 for the Task2

1. set different SlotSharingGroup for task1 and Task2
2. set  parallelism to 12 for the task2 (this real-time task needs to read
from 12 different Kafka partitions hence setting it to 12)
3. set parallelism  of task1 to 2
4. then set this cluster.evenly-spread-out-slots: true

Will these methods work? Also, I did not find a way to set
different parallelism for each slotSharingGourp



On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com> wrote:

> Hi Great,
>
> -> Is there a way to set the restart strategy so that only tasks in the
> same slot group will restart during failure?
>
> No. On task failover, all tasks in the same region will be restarted at
> the same time (to ensure the data consistency).
> You can get more details about failover strategy in [1]
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>
> Best,
> Lijie
>
>
> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>
>> thanks for helping with some inputs
>> actually, I have created task1 and task2 in separate slot groups,
>> thought it would be good if they run in independent slots. Also now facing
>> some issues during restarts. whenever  task1 has any exception entire job
>> is restarting.
>>
>> Is there a way to set the restart strategy so that only tasks in the same
>> slot group will restart during failure
>> ?
>>
>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>>> Hi Great,
>>>
>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>
>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>> will be deployed into the same slot(That is, the same task manager).
>>>
>>> You can get more details about slot sharing group in [1], and you can
>>> get how to set slot sharing group in [2].
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>> [2]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>
>>> Best,
>>> Lijie
>>>
>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>
>>>> I don't really understand how task2 reads static data from task1,
>>>> but I think you can integrate the logic of getting static data from
>>>> http in
>>>> task1 into task2 and keep only one kind of task.
>>>>
>>>> Best,
>>>> Weihua
>>>>
>>>>
>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>>>
>>>> > thanks for helping with some inputs, yes I am using rich function and
>>>> > handling objects created in open, and also and network calls are
>>>> getting
>>>> > called in a run.
>>>> > but currently, I got stuck running this same task on *all task
>>>> managers*
>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>> only one
>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>> >
>>>> >
>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS
>>>> of
>>>> >> external access.
>>>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>>>> >> thread in the open() method to refresh the data regularly. but be
>>>> careful
>>>> >> to clean up your data and threads in the close() method, otherwise
>>>> it will
>>>> >> lead to leaks.
>>>> >>
>>>> >> Best,
>>>> >> Weihua
>>>> >>
>>>> >>
>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>> I have one flink job which has two tasks
>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>> this
>>>> >>> keeps refreshing it every 1 hour
>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>> data to
>>>> >>> validate something and transform, then forward to other Kafka topic.
>>>> >>>
>>>> >>> so far, everything was running on the same Task manager(same node),
>>>> but
>>>> >>> due to some recent scaling requirements need to enable partitioning
>>>> on
>>>> >>> Task2 and that will make some partitions run on other task
>>>> managers. but
>>>> >>> other task managers don't have the static data
>>>> >>>
>>>> >>> is there a way to run Task1 on all the task managers? I don't want
>>>> to
>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>> persist
>>>> >>> data in DB due to data regulations.
>>>> >>>
>>>> >>>
>>>>
>>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
-> If so, I think you can set Task1 and Task2 to the same parallelism and
set them in the same slot sharing group. In this way, Task1 and Task2 will
be deployed into the same slot(That is, the same task manager).

*Updating task details *
*Task1- Source some static data over HTTPS and keep it in memory(in static
memory block), this keeps refreshing it every 1 hour, since this is huge,
it can not be broadcasted *

*          Task2- Process some real-time events from Kafka and uses
static data to validate something and transform, then forward to other
Kafka topic*

Task2 needs more parallelism so deploying both Task1 and Task2 on the same
node (task manager) is becoming difficult, I am using AWS KDA and that has
the limitation to run only 8 tasks per node. now I have a requirement to
run parallelism  of 12 for the Task2

1. set different SlotSharingGroup for task1 and Task2
2. set  parallelism to 12 for the task2 (this real-time task needs to read
from 12 different Kafka partitions hence setting it to 12)
3. set parallelism  of task1 to 2
4. then set this cluster.evenly-spread-out-slots: true

Will these methods work? Also, I did not find a way to set
different parallelism for each slotSharingGourp



On Thu, Jul 14, 2022 at 7:54 AM Lijie Wang <wa...@gmail.com> wrote:

> Hi Great,
>
> -> Is there a way to set the restart strategy so that only tasks in the
> same slot group will restart during failure?
>
> No. On task failover, all tasks in the same region will be restarted at
> the same time (to ensure the data consistency).
> You can get more details about failover strategy in [1]
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies
>
> Best,
> Lijie
>
>
> Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:
>
>> thanks for helping with some inputs
>> actually, I have created task1 and task2 in separate slot groups,
>> thought it would be good if they run in independent slots. Also now facing
>> some issues during restarts. whenever  task1 has any exception entire job
>> is restarting.
>>
>> Is there a way to set the restart strategy so that only tasks in the same
>> slot group will restart during failure
>> ?
>>
>> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>>> Hi Great,
>>>
>>> Do you mean there is a Task1 and a Task2 on each task manager?
>>>
>>> If so, I think you can set Task1 and Task2 to the same parallelism and
>>> set them in the same slot sharing group. In this way, the Task1 and Task2
>>> will be deployed into the same slot(That is, the same task manager).
>>>
>>> You can get more details about slot sharing group in [1], and you can
>>> get how to set slot sharing group in [2].
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>>> [2]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>>
>>> Best,
>>> Lijie
>>>
>>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>>
>>>> I don't really understand how task2 reads static data from task1,
>>>> but I think you can integrate the logic of getting static data from
>>>> http in
>>>> task1 into task2 and keep only one kind of task.
>>>>
>>>> Best,
>>>> Weihua
>>>>
>>>>
>>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>>>
>>>> > thanks for helping with some inputs, yes I am using rich function and
>>>> > handling objects created in open, and also and network calls are
>>>> getting
>>>> > called in a run.
>>>> > but currently, I got stuck running this same task on *all task
>>>> managers*
>>>> > (nodes), when I submit the job, this task1(static data task) runs
>>>> only one
>>>> > task manager, I have 3 task managers in my Flink cluster.
>>>> >
>>>> >
>>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS
>>>> of
>>>> >> external access.
>>>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>>>> >> thread in the open() method to refresh the data regularly. but be
>>>> careful
>>>> >> to clean up your data and threads in the close() method, otherwise
>>>> it will
>>>> >> lead to leaks.
>>>> >>
>>>> >> Best,
>>>> >> Weihua
>>>> >>
>>>> >>
>>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>> I have one flink job which has two tasks
>>>> >>> Task1- Source some static data over https and keep it in memory,
>>>> this
>>>> >>> keeps refreshing it every 1 hour
>>>> >>> Task2- Process some real-time events from Kafka and uses static
>>>> data to
>>>> >>> validate something and transform, then forward to other Kafka topic.
>>>> >>>
>>>> >>> so far, everything was running on the same Task manager(same node),
>>>> but
>>>> >>> due to some recent scaling requirements need to enable partitioning
>>>> on
>>>> >>> Task2 and that will make some partitions run on other task
>>>> managers. but
>>>> >>> other task managers don't have the static data
>>>> >>>
>>>> >>> is there a way to run Task1 on all the task managers? I don't want
>>>> to
>>>> >>> enable broadcasting since it is a little huge and also I can not
>>>> persist
>>>> >>> data in DB due to data regulations.
>>>> >>>
>>>> >>>
>>>>
>>>

Re: Flink running same task on different Task Manager

Posted by Lijie Wang <wa...@gmail.com>.
Hi Great,

-> Is there a way to set the restart strategy so that only tasks in the
same slot group will restart during failure?

No. On task failover, all tasks in the same region will be restarted at the
same time (to ensure the data consistency).
You can get more details about failover strategy in [1]

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/#failover-strategies

Best,
Lijie


Great Info <gu...@gmail.com> 于2022年7月13日周三 23:11写道:

> thanks for helping with some inputs
> actually, I have created task1 and task2 in separate slot groups,
> thought it would be good if they run in independent slots. Also now facing
> some issues during restarts. whenever  task1 has any exception entire job
> is restarting.
>
> Is there a way to set the restart strategy so that only tasks in the same
> slot group will restart during failure
> ?
>
> On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com>
> wrote:
>
>> Hi Great,
>>
>> Do you mean there is a Task1 and a Task2 on each task manager?
>>
>> If so, I think you can set Task1 and Task2 to the same parallelism and
>> set them in the same slot sharing group. In this way, the Task1 and Task2
>> will be deployed into the same slot(That is, the same task manager).
>>
>> You can get more details about slot sharing group in [1], and you can get
>> how to set slot sharing group in [2].
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
>> [2]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>>
>> Best,
>> Lijie
>>
>> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>>
>>> I don't really understand how task2 reads static data from task1,
>>> but I think you can integrate the logic of getting static data from http
>>> in
>>> task1 into task2 and keep only one kind of task.
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>>
>>> > thanks for helping with some inputs, yes I am using rich function and
>>> > handling objects created in open, and also and network calls are
>>> getting
>>> > called in a run.
>>> > but currently, I got stuck running this same task on *all task
>>> managers*
>>> > (nodes), when I submit the job, this task1(static data task) runs only
>>> one
>>> > task manager, I have 3 task managers in my Flink cluster.
>>> >
>>> >
>>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>>> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS of
>>> >> external access.
>>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>>> >> thread in the open() method to refresh the data regularly. but be
>>> careful
>>> >> to clean up your data and threads in the close() method, otherwise it
>>> will
>>> >> lead to leaks.
>>> >>
>>> >> Best,
>>> >> Weihua
>>> >>
>>> >>
>>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com>
>>> wrote:
>>> >>
>>> >>> Hi,
>>> >>> I have one flink job which has two tasks
>>> >>> Task1- Source some static data over https and keep it in memory, this
>>> >>> keeps refreshing it every 1 hour
>>> >>> Task2- Process some real-time events from Kafka and uses static data
>>> to
>>> >>> validate something and transform, then forward to other Kafka topic.
>>> >>>
>>> >>> so far, everything was running on the same Task manager(same node),
>>> but
>>> >>> due to some recent scaling requirements need to enable partitioning
>>> on
>>> >>> Task2 and that will make some partitions run on other task managers.
>>> but
>>> >>> other task managers don't have the static data
>>> >>>
>>> >>> is there a way to run Task1 on all the task managers? I don't want to
>>> >>> enable broadcasting since it is a little huge and also I can not
>>> persist
>>> >>> data in DB due to data regulations.
>>> >>>
>>> >>>
>>>
>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
thanks for helping with some inputs
actually, I have created task1 and task2 in separate slot groups,
thought it would be good if they run in independent slots. Also now facing
some issues during restarts. whenever  task1 has any exception entire job
is restarting.

Is there a way to set the restart strategy so that only tasks in the same
slot group will restart during failure
?

On Wed, Jun 15, 2022 at 6:13 PM Lijie Wang <wa...@gmail.com> wrote:

> Hi Great,
>
> Do you mean there is a Task1 and a Task2 on each task manager?
>
> If so, I think you can set Task1 and Task2 to the same parallelism and set
> them in the same slot sharing group. In this way, the Task1 and Task2 will
> be deployed into the same slot(That is, the same task manager).
>
> You can get more details about slot sharing group in [1], and you can get
> how to set slot sharing group in [2].
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group
>
> Best,
> Lijie
>
> Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:
>
>> I don't really understand how task2 reads static data from task1,
>> but I think you can integrate the logic of getting static data from http
>> in
>> task1 into task2 and keep only one kind of task.
>>
>> Best,
>> Weihua
>>
>>
>> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>>
>> > thanks for helping with some inputs, yes I am using rich function and
>> > handling objects created in open, and also and network calls are getting
>> > called in a run.
>> > but currently, I got stuck running this same task on *all task managers*
>> > (nodes), when I submit the job, this task1(static data task) runs only
>> one
>> > task manager, I have 3 task managers in my Flink cluster.
>> >
>> >
>> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> IMO, Broadcast is a better way to do this, which can reduce the QPS of
>> >> external access.
>> >> If you do not want to use Broadcast, Try using RichFunction, start a
>> >> thread in the open() method to refresh the data regularly. but be
>> careful
>> >> to clean up your data and threads in the close() method, otherwise it
>> will
>> >> lead to leaks.
>> >>
>> >> Best,
>> >> Weihua
>> >>
>> >>
>> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com> wrote:
>> >>
>> >>> Hi,
>> >>> I have one flink job which has two tasks
>> >>> Task1- Source some static data over https and keep it in memory, this
>> >>> keeps refreshing it every 1 hour
>> >>> Task2- Process some real-time events from Kafka and uses static data
>> to
>> >>> validate something and transform, then forward to other Kafka topic.
>> >>>
>> >>> so far, everything was running on the same Task manager(same node),
>> but
>> >>> due to some recent scaling requirements need to enable partitioning on
>> >>> Task2 and that will make some partitions run on other task managers.
>> but
>> >>> other task managers don't have the static data
>> >>>
>> >>> is there a way to run Task1 on all the task managers? I don't want to
>> >>> enable broadcasting since it is a little huge and also I can not
>> persist
>> >>> data in DB due to data regulations.
>> >>>
>> >>>
>>
>

Re: Flink running same task on different Task Manager

Posted by Lijie Wang <wa...@gmail.com>.
Hi Great,

Do you mean there is a Task1 and a Task2 on each task manager?

If so, I think you can set Task1 and Task2 to the same parallelism and set
them in the same slot sharing group. In this way, the Task1 and Task2 will
be deployed into the same slot(That is, the same task manager).

You can get more details about slot sharing group in [1], and you can get
how to set slot sharing group in [2].

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#set-slot-sharing-group

Best,
Lijie

Weihua Hu <hu...@gmail.com> 于2022年6月15日周三 13:16写道:

> I don't really understand how task2 reads static data from task1,
> but I think you can integrate the logic of getting static data from http in
> task1 into task2 and keep only one kind of task.
>
> Best,
> Weihua
>
>
> On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:
>
> > thanks for helping with some inputs, yes I am using rich function and
> > handling objects created in open, and also and network calls are getting
> > called in a run.
> > but currently, I got stuck running this same task on *all task managers*
> > (nodes), when I submit the job, this task1(static data task) runs only
> one
> > task manager, I have 3 task managers in my Flink cluster.
> >
> >
> > On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> IMO, Broadcast is a better way to do this, which can reduce the QPS of
> >> external access.
> >> If you do not want to use Broadcast, Try using RichFunction, start a
> >> thread in the open() method to refresh the data regularly. but be
> careful
> >> to clean up your data and threads in the close() method, otherwise it
> will
> >> lead to leaks.
> >>
> >> Best,
> >> Weihua
> >>
> >>
> >> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com> wrote:
> >>
> >>> Hi,
> >>> I have one flink job which has two tasks
> >>> Task1- Source some static data over https and keep it in memory, this
> >>> keeps refreshing it every 1 hour
> >>> Task2- Process some real-time events from Kafka and uses static data to
> >>> validate something and transform, then forward to other Kafka topic.
> >>>
> >>> so far, everything was running on the same Task manager(same node), but
> >>> due to some recent scaling requirements need to enable partitioning on
> >>> Task2 and that will make some partitions run on other task managers.
> but
> >>> other task managers don't have the static data
> >>>
> >>> is there a way to run Task1 on all the task managers? I don't want to
> >>> enable broadcasting since it is a little huge and also I can not
> persist
> >>> data in DB due to data regulations.
> >>>
> >>>
>

Re: Flink running same task on different Task Manager

Posted by Weihua Hu <hu...@gmail.com>.
I don't really understand how task2 reads static data from task1,
but I think you can integrate the logic of getting static data from http in
task1 into task2 and keep only one kind of task.

Best,
Weihua


On Wed, Jun 15, 2022 at 10:07 AM Great Info <gu...@gmail.com> wrote:

> thanks for helping with some inputs, yes I am using rich function and
> handling objects created in open, and also and network calls are getting
> called in a run.
> but currently, I got stuck running this same task on *all task managers*
> (nodes), when I submit the job, this task1(static data task) runs only one
> task manager, I have 3 task managers in my Flink cluster.
>
>
> On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com> wrote:
>
>> Hi,
>>
>> IMO, Broadcast is a better way to do this, which can reduce the QPS of
>> external access.
>> If you do not want to use Broadcast, Try using RichFunction, start a
>> thread in the open() method to refresh the data regularly. but be careful
>> to clean up your data and threads in the close() method, otherwise it will
>> lead to leaks.
>>
>> Best,
>> Weihua
>>
>>
>> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com> wrote:
>>
>>> Hi,
>>> I have one flink job which has two tasks
>>> Task1- Source some static data over https and keep it in memory, this
>>> keeps refreshing it every 1 hour
>>> Task2- Process some real-time events from Kafka and uses static data to
>>> validate something and transform, then forward to other Kafka topic.
>>>
>>> so far, everything was running on the same Task manager(same node), but
>>> due to some recent scaling requirements need to enable partitioning on
>>> Task2 and that will make some partitions run on other task managers. but
>>> other task managers don't have the static data
>>>
>>> is there a way to run Task1 on all the task managers? I don't want to
>>> enable broadcasting since it is a little huge and also I can not persist
>>> data in DB due to data regulations.
>>>
>>>

Re: Flink running same task on different Task Manager

Posted by Great Info <gu...@gmail.com>.
thanks for helping with some inputs, yes I am using rich function and
handling objects created in open, and also and network calls are getting
called in a run.
but currently, I got stuck running this same task on *all task managers*
(nodes), when I submit the job, this task1(static data task) runs only one
task manager, I have 3 task managers in my Flink cluster.


On Tue, Jun 14, 2022 at 7:20 PM Weihua Hu <hu...@gmail.com> wrote:

> Hi,
>
> IMO, Broadcast is a better way to do this, which can reduce the QPS of
> external access.
> If you do not want to use Broadcast, Try using RichFunction, start a
> thread in the open() method to refresh the data regularly. but be careful
> to clean up your data and threads in the close() method, otherwise it will
> lead to leaks.
>
> Best,
> Weihua
>
>
> On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com> wrote:
>
>> Hi,
>> I have one flink job which has two tasks
>> Task1- Source some static data over https and keep it in memory, this
>> keeps refreshing it every 1 hour
>> Task2- Process some real-time events from Kafka and uses static data to
>> validate something and transform, then forward to other Kafka topic.
>>
>> so far, everything was running on the same Task manager(same node), but
>> due to some recent scaling requirements need to enable partitioning on
>> Task2 and that will make some partitions run on other task managers. but
>> other task managers don't have the static data
>>
>> is there a way to run Task1 on all the task managers? I don't want to
>> enable broadcasting since it is a little huge and also I can not persist
>> data in DB due to data regulations.
>>
>>

Re: Flink running same task on different Task Manager

Posted by Weihua Hu <hu...@gmail.com>.
Hi,

IMO, Broadcast is a better way to do this, which can reduce the QPS of
external access.
If you do not want to use Broadcast, Try using RichFunction, start a thread
in the open() method to refresh the data regularly. but be careful to clean
up your data and threads in the close() method, otherwise it will lead to
leaks.

Best,
Weihua


On Tue, Jun 14, 2022 at 12:04 AM Great Info <gu...@gmail.com> wrote:

> Hi,
> I have one flink job which has two tasks
> Task1- Source some static data over https and keep it in memory, this
> keeps refreshing it every 1 hour
> Task2- Process some real-time events from Kafka and uses static data to
> validate something and transform, then forward to other Kafka topic.
>
> so far, everything was running on the same Task manager(same node), but
> due to some recent scaling requirements need to enable partitioning on
> Task2 and that will make some partitions run on other task managers. but
> other task managers don't have the static data
>
> is there a way to run Task1 on all the task managers? I don't want to
> enable broadcasting since it is a little huge and also I can not persist
> data in DB due to data regulations.
>
>