You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Anyang Hu <hu...@gmail.com> on 2019/09/05 14:02:03 UTC

suggestion of FLINK-10868

Hi Peter&Till:

As commented in the issue
<https://issues.apache.org/jira/browse/FLINK-10868#>，We have introduced the
FLINK-10868 <https://issues.apache.org/jira/browse/FLINK-10868> patch
(mainly batch tasks) online, what do you think of the following two
suggestions:

1) Parameter control time interval. At present, the default time interval
of 1 min is used, which is too short for batch tasks;

2)Parameter Control When the failed Container number reaches
MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang

Re: suggestion of FLINK-10868

Posted by Peter Huang <hu...@gmail.com>.

Hi Anyang and Till,

I think we agreed on making the interval configurable in this case. Let me
revise the current PR. You can review it after that.



Best Regards
Peter Huang

On Thu, Sep 12, 2019 at 12:53 AM Anyang Hu <hu...@gmail.com> wrote:

> Thanks Till, I will continue to follow this issue and see what we can do.
>
> Best regards,
> Anyang
>
> Till Rohrmann <tr...@apache.org> 于2019年9月11日周三 下午5:12写道：
>
>> Suggestion 1 makes sense. For the quick termination I think we need to
>> think a bit more about it to find a good solution also to support strict
>> SLA requirements.
>>
>> Cheers,
>> Till
>>
>> On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <hu...@gmail.com>
>> wrote:
>>
>>> Hi Till,
>>>
>>> Some of our online batch tasks have strict SLA requirements, and they
>>> are not allowed to be stuck for a long time. Therefore, we take a rude way
>>> to make the job exit immediately. The way to wait for connection recovery
>>> is a better solution. Maybe we need to add a timeout to wait for JM to
>>> restore the connection?
>>>
>>> For suggestion 1, make interval configurable, given that we have done
>>> it, and if we can, we hope to give back to the community.
>>>
>>> Best regards,
>>> Anyang
>>>
>>> Till Rohrmann <tr...@apache.org> 于2019年9月9日周一 下午3:09写道：
>>>
>>>> Hi Anyang,
>>>>
>>>> I think we cannot take your proposal because this means that whenever
>>>> we want to call notifyAllocationFailure when there is a connection problem
>>>> between the RM and the JM, then we fail the whole cluster. This is
>>>> something a robust and resilient system should not do because connection
>>>> problems are expected and need to be handled gracefully. Instead if one
>>>> deems the notifyAllocationFailure message to be very important, then one
>>>> would need to keep it and tell the JM once it has connected back.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Peter,
>>>>>
>>>>> For our online batch task, there is a scene where the failed Container
>>>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>>>> exit (the probability of JM loss is greatly improved when thousands of
>>>>> Containers is to be started). It is found that the JM disconnection (the
>>>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>>>> to take effect.
>>>>>
>>>>> After the introduction of FLINK-13184
>>>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>>>>> container with multi-threaded, the JM disconnection situation has been
>>>>> alleviated. In order to stably implement the client immediate exit, we use
>>>>> the following code to determine  whether call onFatalError when
>>>>> MaximumFailedTaskManagerExceedingException is occurd:
>>>>>
>>>>> @Override
>>>>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
>>>>>    validateRunsInMainThread();
>>>>>
>>>>>    JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
>>>>>    if (jobManagerRegistration != null) {
>>>>>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
>>>>>    } else {
>>>>>       if (exitProcessOnJobManagerTimedout) {
>>>>>          ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
>>>>>          onFatalError(exception);
>>>>>       }
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Anyang
>>>>>
>>>>>

Re: suggestion of FLINK-10868

Posted by Anyang Hu <hu...@gmail.com>.

Thanks Till, I will continue to follow this issue and see what we can do.

Best regards,
Anyang

Till Rohrmann <tr...@apache.org> 于2019年9月11日周三 下午5:12写道：

> Suggestion 1 makes sense. For the quick termination I think we need to
> think a bit more about it to find a good solution also to support strict
> SLA requirements.
>
> Cheers,
> Till
>
> On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <hu...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Some of our online batch tasks have strict SLA requirements, and they are
>> not allowed to be stuck for a long time. Therefore, we take a rude way to
>> make the job exit immediately. The way to wait for connection recovery is a
>> better solution. Maybe we need to add a timeout to wait for JM to restore
>> the connection?
>>
>> For suggestion 1, make interval configurable, given that we have done it,
>> and if we can, we hope to give back to the community.
>>
>> Best regards,
>> Anyang
>>
>> Till Rohrmann <tr...@apache.org> 于2019年9月9日周一 下午3:09写道：
>>
>>> Hi Anyang,
>>>
>>> I think we cannot take your proposal because this means that whenever we
>>> want to call notifyAllocationFailure when there is a connection problem
>>> between the RM and the JM, then we fail the whole cluster. This is
>>> something a robust and resilient system should not do because connection
>>> problems are expected and need to be handled gracefully. Instead if one
>>> deems the notifyAllocationFailure message to be very important, then one
>>> would need to keep it and tell the JM once it has connected back.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <hu...@gmail.com>
>>> wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> For our online batch task, there is a scene where the failed Container
>>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>>> exit (the probability of JM loss is greatly improved when thousands of
>>>> Containers is to be started). It is found that the JM disconnection (the
>>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>>> to take effect.
>>>>
>>>> After the introduction of FLINK-13184
>>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>>>> container with multi-threaded, the JM disconnection situation has been
>>>> alleviated. In order to stably implement the client immediate exit, we use
>>>> the following code to determine  whether call onFatalError when
>>>> MaximumFailedTaskManagerExceedingException is occurd:
>>>>
>>>> @Override
>>>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
>>>>    validateRunsInMainThread();
>>>>
>>>>    JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
>>>>    if (jobManagerRegistration != null) {
>>>>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
>>>>    } else {
>>>>       if (exitProcessOnJobManagerTimedout) {
>>>>          ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
>>>>          onFatalError(exception);
>>>>       }
>>>>    }
>>>> }
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Anyang
>>>>
>>>>

Re: suggestion of FLINK-10868

Posted by Till Rohrmann <tr...@apache.org>.

Suggestion 1 makes sense. For the quick termination I think we need to
think a bit more about it to find a good solution also to support strict
SLA requirements.

Cheers,
Till

On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <hu...@gmail.com> wrote:

> Hi Till,
>
> Some of our online batch tasks have strict SLA requirements, and they are
> not allowed to be stuck for a long time. Therefore, we take a rude way to
> make the job exit immediately. The way to wait for connection recovery is a
> better solution. Maybe we need to add a timeout to wait for JM to restore
> the connection?
>
> For suggestion 1, make interval configurable, given that we have done it,
> and if we can, we hope to give back to the community.
>
> Best regards,
> Anyang
>
> Till Rohrmann <tr...@apache.org> 于2019年9月9日周一 下午3:09写道：
>
>> Hi Anyang,
>>
>> I think we cannot take your proposal because this means that whenever we
>> want to call notifyAllocationFailure when there is a connection problem
>> between the RM and the JM, then we fail the whole cluster. This is
>> something a robust and resilient system should not do because connection
>> problems are expected and need to be handled gracefully. Instead if one
>> deems the notifyAllocationFailure message to be very important, then one
>> would need to keep it and tell the JM once it has connected back.
>>
>> Cheers,
>> Till
>>
>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <hu...@gmail.com> wrote:
>>
>>> Hi Peter,
>>>
>>> For our online batch task, there is a scene where the failed Container
>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>> exit (the probability of JM loss is greatly improved when thousands of
>>> Containers is to be started). It is found that the JM disconnection (the
>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>> to take effect.
>>>
>>> After the introduction of FLINK-13184
>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>>> container with multi-threaded, the JM disconnection situation has been
>>> alleviated. In order to stably implement the client immediate exit, we use
>>> the following code to determine  whether call onFatalError when
>>> MaximumFailedTaskManagerExceedingException is occurd:
>>>
>>> @Override
>>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
>>>    validateRunsInMainThread();
>>>
>>>    JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
>>>    if (jobManagerRegistration != null) {
>>>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
>>>    } else {
>>>       if (exitProcessOnJobManagerTimedout) {
>>>          ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
>>>          onFatalError(exception);
>>>       }
>>>    }
>>> }
>>>
>>>
>>> Best regards,
>>>
>>> Anyang
>>>
>>>

Re: suggestion of FLINK-10868

Posted by Anyang Hu <hu...@gmail.com>.

Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are
not allowed to be stuck for a long time. Therefore, we take a rude way to
make the job exit immediately. The way to wait for connection recovery is a
better solution. Maybe we need to add a timeout to wait for JM to restore
the connection?

For suggestion 1, make interval configurable, given that we have done it,
and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <tr...@apache.org> 于2019年9月9日周一 下午3:09写道：

> Hi Anyang,
>
> I think we cannot take your proposal because this means that whenever we
> want to call notifyAllocationFailure when there is a connection problem
> between the RM and the JM, then we fail the whole cluster. This is
> something a robust and resilient system should not do because connection
> problems are expected and need to be handled gracefully. Instead if one
> deems the notifyAllocationFailure message to be very important, then one
> would need to keep it and tell the JM once it has connected back.
>
> Cheers,
> Till
>
> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <hu...@gmail.com> wrote:
>
>> Hi Peter,
>>
>> For our online batch task, there is a scene where the failed Container
>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>> exit (the probability of JM loss is greatly improved when thousands of
>> Containers is to be started). It is found that the JM disconnection (the
>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>> to take effect.
>>
>> After the introduction of FLINK-13184
>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>> container with multi-threaded, the JM disconnection situation has been
>> alleviated. In order to stably implement the client immediate exit, we use
>> the following code to determine  whether call onFatalError when
>> MaximumFailedTaskManagerExceedingException is occurd:
>>
>> @Override
>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
>>    validateRunsInMainThread();
>>
>>    JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
>>    if (jobManagerRegistration != null) {
>>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
>>    } else {
>>       if (exitProcessOnJobManagerTimedout) {
>>          ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
>>          onFatalError(exception);
>>       }
>>    }
>> }
>>
>>
>> Best regards,
>>
>> Anyang
>>
>>

Re: suggestion of FLINK-10868

Posted by Till Rohrmann <tr...@apache.org>.

Hi Anyang,

I think we cannot take your proposal because this means that whenever we
want to call notifyAllocationFailure when there is a connection problem
between the RM and the JM, then we fail the whole cluster. This is
something a robust and resilient system should not do because connection
problems are expected and need to be handled gracefully. Instead if one
deems the notifyAllocationFailure message to be very important, then one
would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <hu...@gmail.com> wrote:

> Hi Peter,
>
> For our online batch task, there is a scene where the failed Container
> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
> exit (the probability of JM loss is greatly improved when thousands of
> Containers is to be started). It is found that the JM disconnection (the
> reason for JM loss is unknown) will cause the notifyAllocationFailure not
> to take effect.
>
> After the introduction of FLINK-13184
> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the container
> with multi-threaded, the JM disconnection situation has been alleviated. In
> order to stably implement the client immediate exit, we use the following
> code to determine  whether call onFatalError when
> MaximumFailedTaskManagerExceedingException is occurd:
>
> @Override
> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
>    validateRunsInMainThread();
>
>    JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
>    if (jobManagerRegistration != null) {
>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
>    } else {
>       if (exitProcessOnJobManagerTimedout) {
>          ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
>          onFatalError(exception);
>       }
>    }
> }
>
>
> Best regards,
>
> Anyang
>
>

Re: suggestion of FLINK-10868

Posted by Anyang Hu <hu...@gmail.com>.

Hi Peter,

For our online batch task, there is a scene where the failed Container
reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
exit (the probability of JM loss is greatly improved when thousands of
Containers is to be started). It is found that the JM disconnection (the
reason for JM loss is unknown) will cause the notifyAllocationFailure not
to take effect.

After the introduction of FLINK-13184
<https://jira.apache.org/jira/browse/FLINK-13184> to start  the container
with multi-threaded, the JM disconnection situation has been alleviated. In
order to stably implement the client immediate exit, we use the following
code to determine  whether call onFatalError when
MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID
allocationId, Exception cause) {
   validateRunsInMainThread();

   JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
   if (jobManagerRegistration != null) {
      jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
cause);
   } else {
      if (exitProcessOnJobManagerTimedout) {
         ResourceManagerException exception = new
ResourceManagerException("Job Manager is lost, can not notify
allocation failure.");
         onFatalError(exception);
      }
   }
}


Best regards,

Anyang

Re: suggestion of FLINK-10868

Posted by Peter Huang <hu...@gmail.com>.

Hi Till,

1) From Anyang's request, I think it is reasonable to use two parameters
for the rate as a batch job runs for a while. The failure rate in a small
interval is meaningless.
I think they need a failure count from the beginning as the failure
condition.

@Anyang Hu <hu...@gmail.com>
2) In the current implementation, the
MaximumFailedTaskManagerExceedingException is SuppressRestartsException. It
will exit immediately.


Best Regards
Peter Huang




On Sun, Sep 8, 2019 at 1:27 AM Anyang Hu <hu...@gmail.com> wrote:

> Hi Till,
> Thank you for the reply.
>
> 1. The batch processing may be customized according to the usage scenario.
> For our online batch jobs, we set the interval parameter to 8h.
> 2. For our usage scenario, we need the client to exit immediately when the
> failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE.
>
> Best Regards,
> Anyang
>
> Till Rohrmann <tr...@apache.org> 于2019年9月6日周五 下午9:33写道：
>
>> Hi Anyang,
>>
>> thanks for your suggestions.
>>
>> 1) I guess one needs to make this interval configurable. A session
>> cluster could theoretically execute batch as well as streaming tasks and,
>> hence, I doubt that there is an optimal value. Maybe the default could be a
>> bit longer than 1 min, though.
>>
>> 2) Which component to do you want to let terminate immediately?
>>
>> I think we can consider your input while reviewing the PR. If it would be
>> a bigger change, then it would be best to create a follow up issue once
>> FLINK-10868 has been merged.
>>
>> Cheers,
>> Till
>>
>> On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <hu...@gmail.com> wrote:
>>
>>> Thank you for the reply and look forward to the advice of Till.
>>>
>>> Anyang
>>>
>>> Peter Huang <hu...@gmail.com> 于2019年9月5日周四 下午11:53写道：
>>>
>>>> Hi Anyang,
>>>>
>>>> Thanks for raising it up. I think it is reasonable as what you
>>>> requested is needed for batch. Let's wait for Till to give some more input.
>>>>
>>>>
>>>>
>>>> Best Regards
>>>> Peter Huang
>>>>
>>>> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Peter&Till:
>>>>>
>>>>> As commented in the issue
>>>>> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have
>>>>> introduced the FLINK-10868
>>>>> <https://issues.apache.org/jira/browse/FLINK-10868> patch (mainly
>>>>> batch tasks) online, what do you think of the following two suggestions:
>>>>>
>>>>> 1) Parameter control time interval. At present, the default time
>>>>> interval of 1 min is used, which is too short for batch tasks;
>>>>>
>>>>> 2)Parameter Control When the failed Container number reaches
>>>>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
>>>>> OnFatalError so that the batch tasks can exit as soon as possible.
>>>>>
>>>>> Best regards,
>>>>> Anyang
>>>>>
>>>>

Re: suggestion of FLINK-10868

Posted by Anyang Hu <hu...@gmail.com>.

Hi Till,
Thank you for the reply.

1. The batch processing may be customized according to the usage scenario.
For our online batch jobs, we set the interval parameter to 8h.
2. For our usage scenario, we need the client to exit immediately when the
failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE.

Best Regards,
Anyang

Till Rohrmann <tr...@apache.org> 于2019年9月6日周五 下午9:33写道：

> Hi Anyang,
>
> thanks for your suggestions.
>
> 1) I guess one needs to make this interval configurable. A session cluster
> could theoretically execute batch as well as streaming tasks and, hence, I
> doubt that there is an optimal value. Maybe the default could be a bit
> longer than 1 min, though.
>
> 2) Which component to do you want to let terminate immediately?
>
> I think we can consider your input while reviewing the PR. If it would be
> a bigger change, then it would be best to create a follow up issue once
> FLINK-10868 has been merged.
>
> Cheers,
> Till
>
> On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <hu...@gmail.com> wrote:
>
>> Thank you for the reply and look forward to the advice of Till.
>>
>> Anyang
>>
>> Peter Huang <hu...@gmail.com> 于2019年9月5日周四 下午11:53写道：
>>
>>> Hi Anyang,
>>>
>>> Thanks for raising it up. I think it is reasonable as what you requested
>>> is needed for batch. Let's wait for Till to give some more input.
>>>
>>>
>>>
>>> Best Regards
>>> Peter Huang
>>>
>>> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <hu...@gmail.com> wrote:
>>>
>>>> Hi Peter&Till:
>>>>
>>>> As commented in the issue
>>>> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have
>>>> introduced the FLINK-10868
>>>> <https://issues.apache.org/jira/browse/FLINK-10868> patch (mainly
>>>> batch tasks) online, what do you think of the following two suggestions:
>>>>
>>>> 1) Parameter control time interval. At present, the default time
>>>> interval of 1 min is used, which is too short for batch tasks;
>>>>
>>>> 2)Parameter Control When the failed Container number reaches
>>>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
>>>> OnFatalError so that the batch tasks can exit as soon as possible.
>>>>
>>>> Best regards,
>>>> Anyang
>>>>
>>>

Re: suggestion of FLINK-10868

Posted by Till Rohrmann <tr...@apache.org>.

Hi Anyang,

thanks for your suggestions.

1) I guess one needs to make this interval configurable. A session cluster
could theoretically execute batch as well as streaming tasks and, hence, I
doubt that there is an optimal value. Maybe the default could be a bit
longer than 1 min, though.

2) Which component to do you want to let terminate immediately?

I think we can consider your input while reviewing the PR. If it would be a
bigger change, then it would be best to create a follow up issue once
FLINK-10868 has been merged.

Cheers,
Till

On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <hu...@gmail.com> wrote:

> Thank you for the reply and look forward to the advice of Till.
>
> Anyang
>
> Peter Huang <hu...@gmail.com> 于2019年9月5日周四 下午11:53写道：
>
>> Hi Anyang,
>>
>> Thanks for raising it up. I think it is reasonable as what you requested
>> is needed for batch. Let's wait for Till to give some more input.
>>
>>
>>
>> Best Regards
>> Peter Huang
>>
>> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <hu...@gmail.com> wrote:
>>
>>> Hi Peter&Till:
>>>
>>> As commented in the issue
>>> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have introduced
>>> the FLINK-10868 <https://issues.apache.org/jira/browse/FLINK-10868> patch
>>> (mainly batch tasks) online, what do you think of the following two
>>> suggestions:
>>>
>>> 1) Parameter control time interval. At present, the default time
>>> interval of 1 min is used, which is too short for batch tasks;
>>>
>>> 2)Parameter Control When the failed Container number reaches
>>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
>>> OnFatalError so that the batch tasks can exit as soon as possible.
>>>
>>> Best regards,
>>> Anyang
>>>
>>

Re: suggestion of FLINK-10868

Posted by Anyang Hu <hu...@gmail.com>.

Thank you for the reply and look forward to the advice of Till.

Anyang

Peter Huang <hu...@gmail.com> 于2019年9月5日周四 下午11:53写道：

> Hi Anyang,
>
> Thanks for raising it up. I think it is reasonable as what you requested
> is needed for batch. Let's wait for Till to give some more input.
>
>
>
> Best Regards
> Peter Huang
>
> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <hu...@gmail.com> wrote:
>
>> Hi Peter&Till:
>>
>> As commented in the issue
>> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have introduced
>> the FLINK-10868 <https://issues.apache.org/jira/browse/FLINK-10868> patch
>> (mainly batch tasks) online, what do you think of the following two
>> suggestions:
>>
>> 1) Parameter control time interval. At present, the default time
>> interval of 1 min is used, which is too short for batch tasks;
>>
>> 2)Parameter Control When the failed Container number reaches
>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
>> OnFatalError so that the batch tasks can exit as soon as possible.
>>
>> Best regards,
>> Anyang
>>
>

Re: suggestion of FLINK-10868

Posted by Peter Huang <hu...@gmail.com>.

Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is
needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <hu...@gmail.com> wrote:

> Hi Peter&Till:
>
> As commented in the issue
> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have introduced
> the FLINK-10868 <https://issues.apache.org/jira/browse/FLINK-10868> patch
> (mainly batch tasks) online, what do you think of the following two
> suggestions:
>
> 1) Parameter control time interval. At present, the default time interval
> of 1 min is used, which is too short for batch tasks;
>
> 2)Parameter Control When the failed Container number reaches
> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
> OnFatalError so that the batch tasks can exit as soon as possible.
>
> Best regards,
> Anyang
>