You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by lec ssmi <sh...@gmail.com> on 2021/09/07 06:11:44 UTC

When using the batch api, the sink task is always in the created state.

Hi:
   I'm not familar with batch api .And I write a program  just like
"insert  into tab_a select  * from tab_b".
   From the picture, there are only two tasks, one is the source task which
is in RUNNING state. And the other one is sink task which is always in
CREATE state.
   According to logs, I found that  source task is reading the file I
specified now, in other words, it is working normally.
   Doesn't flink work after all operators are initialized?


[image: image.png]

Re: When using the batch api, the sink task is always in the created state.

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

In streaming / table / SQL API level Flink allows user to use the same code
to run streaming or batch jobs, however in the actual runtime level these
user code will be compiled to different operators for a more efficient
execution.

lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午6:32写道:

> Thanks for your reply!
> But doesn't flink use stream to perform batch calculations? As you said
> above, to some extent, it is  same as  real batch computing  eg.spark .
>
> Caizhi Weng <ts...@gmail.com> 于2021年9月7日周二 下午2:53写道:
>
>> My previous mail intends to answer what is needed for all subtasks in a
>> batch job to run simultaneously. To just run a batch job the number of task
>> slots can be as small as 1. In this case each parallelism of each subtask
>> will run one by one.
>>
>> Also the scheduling of the subtasks depends on the shuffling mode
>> (table.exec.shuffle-mode). By default all network shuffles in batch jobs
>> are blocking, which means that the downstream subtasks will only start
>> running after the upstream subtasks finish. To run all subtasks
>> simultaneously you should set that to "pipelined" (Flink <= 1.11) or
>> "ALL_EDGES_PIPELINED" (Flink >= 1.12).
>>
>> Caizhi Weng <ts...@gmail.com> 于2021年9月7日周二 下午2:47写道:
>>
>>> Hi!
>>>
>>> If you mean batch SQL then you'll need to prepare enough task slots for
>>> all subtasks. The number of task slots needed is the sum of parallelism of
>>> all subtasks as there is no slot reusing in batch jobs.
>>>
>>> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:13写道:
>>>
>>>> And My flink version is 1.11.0
>>>>
>>>> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:11写道:
>>>>
>>>>> Hi:
>>>>>    I'm not familar with batch api .And I write a program  just like
>>>>> "insert  into tab_a select  * from tab_b".
>>>>>    From the picture, there are only two tasks, one is the source task
>>>>> which is in RUNNING state. And the other one is sink task which is
>>>>> always in CREATE state.
>>>>>    According to logs, I found that  source task is reading the file I
>>>>> specified now, in other words, it is working normally.
>>>>>    Doesn't flink work after all operators are initialized?
>>>>>
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>

Re: When using the batch api, the sink task is always in the created state.

Posted by lec ssmi <sh...@gmail.com>.
Thanks for your reply!
But doesn't flink use stream to perform batch calculations? As you said
above, to some extent, it is  same as  real batch computing  eg.spark .

Caizhi Weng <ts...@gmail.com> 于2021年9月7日周二 下午2:53写道:

> My previous mail intends to answer what is needed for all subtasks in a
> batch job to run simultaneously. To just run a batch job the number of task
> slots can be as small as 1. In this case each parallelism of each subtask
> will run one by one.
>
> Also the scheduling of the subtasks depends on the shuffling mode
> (table.exec.shuffle-mode). By default all network shuffles in batch jobs
> are blocking, which means that the downstream subtasks will only start
> running after the upstream subtasks finish. To run all subtasks
> simultaneously you should set that to "pipelined" (Flink <= 1.11) or
> "ALL_EDGES_PIPELINED" (Flink >= 1.12).
>
> Caizhi Weng <ts...@gmail.com> 于2021年9月7日周二 下午2:47写道:
>
>> Hi!
>>
>> If you mean batch SQL then you'll need to prepare enough task slots for
>> all subtasks. The number of task slots needed is the sum of parallelism of
>> all subtasks as there is no slot reusing in batch jobs.
>>
>> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:13写道:
>>
>>> And My flink version is 1.11.0
>>>
>>> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:11写道:
>>>
>>>> Hi:
>>>>    I'm not familar with batch api .And I write a program  just like
>>>> "insert  into tab_a select  * from tab_b".
>>>>    From the picture, there are only two tasks, one is the source task
>>>> which is in RUNNING state. And the other one is sink task which is
>>>> always in CREATE state.
>>>>    According to logs, I found that  source task is reading the file I
>>>> specified now, in other words, it is working normally.
>>>>    Doesn't flink work after all operators are initialized?
>>>>
>>>>
>>>> [image: image.png]
>>>>
>>>

Re: When using the batch api, the sink task is always in the created state.

Posted by Caizhi Weng <ts...@gmail.com>.
My previous mail intends to answer what is needed for all subtasks in a
batch job to run simultaneously. To just run a batch job the number of task
slots can be as small as 1. In this case each parallelism of each subtask
will run one by one.

Also the scheduling of the subtasks depends on the shuffling mode
(table.exec.shuffle-mode). By default all network shuffles in batch jobs
are blocking, which means that the downstream subtasks will only start
running after the upstream subtasks finish. To run all subtasks
simultaneously you should set that to "pipelined" (Flink <= 1.11) or
"ALL_EDGES_PIPELINED" (Flink >= 1.12).

Caizhi Weng <ts...@gmail.com> 于2021年9月7日周二 下午2:47写道:

> Hi!
>
> If you mean batch SQL then you'll need to prepare enough task slots for
> all subtasks. The number of task slots needed is the sum of parallelism of
> all subtasks as there is no slot reusing in batch jobs.
>
> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:13写道:
>
>> And My flink version is 1.11.0
>>
>> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:11写道:
>>
>>> Hi:
>>>    I'm not familar with batch api .And I write a program  just like
>>> "insert  into tab_a select  * from tab_b".
>>>    From the picture, there are only two tasks, one is the source task
>>> which is in RUNNING state. And the other one is sink task which is
>>> always in CREATE state.
>>>    According to logs, I found that  source task is reading the file I
>>> specified now, in other words, it is working normally.
>>>    Doesn't flink work after all operators are initialized?
>>>
>>>
>>> [image: image.png]
>>>
>>

Re: When using the batch api, the sink task is always in the created state.

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

If you mean batch SQL then you'll need to prepare enough task slots for all
subtasks. The number of task slots needed is the sum of parallelism of all
subtasks as there is no slot reusing in batch jobs.

lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:13写道:

> And My flink version is 1.11.0
>
> lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:11写道:
>
>> Hi:
>>    I'm not familar with batch api .And I write a program  just like
>> "insert  into tab_a select  * from tab_b".
>>    From the picture, there are only two tasks, one is the source task
>> which is in RUNNING state. And the other one is sink task which is
>> always in CREATE state.
>>    According to logs, I found that  source task is reading the file I
>> specified now, in other words, it is working normally.
>>    Doesn't flink work after all operators are initialized?
>>
>>
>> [image: image.png]
>>
>

Re: When using the batch api, the sink task is always in the created state.

Posted by lec ssmi <sh...@gmail.com>.
And My flink version is 1.11.0

lec ssmi <sh...@gmail.com> 于2021年9月7日周二 下午2:11写道:

> Hi:
>    I'm not familar with batch api .And I write a program  just like
> "insert  into tab_a select  * from tab_b".
>    From the picture, there are only two tasks, one is the source task
> which is in RUNNING state. And the other one is sink task which is
> always in CREATE state.
>    According to logs, I found that  source task is reading the file I
> specified now, in other words, it is working normally.
>    Doesn't flink work after all operators are initialized?
>
>
> [image: image.png]
>