You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Cliff Resnick <cr...@gmail.com> on 2018/11/08 20:59:13 UTC

Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
configuration of 3 slots per TM. The cluster is dedicated to a single job
that runs at full capacity in "FLIP6" mode. So in this cluster, the
parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected.
But if I run with 1.6.2 only four Task Managers spin up and the job hangs
waiting for more resources.

Our Flink distribution is set up by script after building from source. So
aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The
job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config
change?

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Posted by Till Rohrmann <tr...@apache.org>.
Good to hear Cliff.

You're right that it's not a nice user experience. The problem with
queryable state is that one would need to take a look at the actual user
job to decide whether the user uses queryable state or not. But then it's
already too late for starting the respective infrastructure needed for
querying the state. You're right, though, that we should at least take a
random port per default. I've created a corresponding issue for this:
https://issues.apache.org/jira/browse/FLINK-10866.

Cheers,
Till

On Mon, Nov 12, 2018 at 11:16 PM Cliff Resnick <cr...@gmail.com> wrote:

> Hi Till,
>
> Yes, it turns out the problem was
> having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess
> Queriable State bootstraps itself and, in my situation, it brought the task
> manager down when it found no available ports. What's a little troubling is
> that I had not configured Queriable State at all, so I would not expect it
> to get in the way. I haven't looked further into it but I think that if
> Queriable State wants to enable itself then it should at worst take an
> unused port by default, especially since many folks will be running in
> shared environments like YARN.
>
> But anyway, thanks for that! I'm now up with 1.6.2.
>
> Cliff
>
> On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Cliff,
>>
>> the TaskManger fail to start with exit code 31 which indicates an
>> initialization error on startup. If you check the TaskManager logs via
>> `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs
>> don't start up.
>>
>> Cheers,
>> Till
>>
>> On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> Hi Till,
>>>
>>> Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG
>>> level. I saw several errors in 1.6.2, hope it's informative!
>>>
>>> Cliff
>>>
>>> On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Cliff,
>>>>
>>>> this sounds not right. Could you share the logs of the Yarn cluster
>>>> entrypoint with the community for further debugging? Ideally on DEBUG
>>>> level. The Yarn logs would also be helpful to fully understand the problem.
>>>> Thanks a lot!
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>>
>>>>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>>>>> configuration of 3 slots per TM. The cluster is dedicated to a single job
>>>>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>>>>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>>>>
>>>>> When I run the job in 1.6.0, seven Task Managers are spun up as
>>>>> expected. But if I run with 1.6.2 only four Task Managers spin up and the
>>>>> job hangs waiting for more resources.
>>>>>
>>>>> Our Flink distribution is set up by script after building from source.
>>>>> So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical.
>>>>> The job is the same, restarting from savepoint. The problem is repeatable.
>>>>>
>>>>> Has something changed in 1.6.2, and if so can it be remedied with a
>>>>> config change?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Posted by Cliff Resnick <cr...@gmail.com>.
Hi Till,

Yes, it turns out the problem was
having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess
Queriable State bootstraps itself and, in my situation, it brought the task
manager down when it found no available ports. What's a little troubling is
that I had not configured Queriable State at all, so I would not expect it
to get in the way. I haven't looked further into it but I think that if
Queriable State wants to enable itself then it should at worst take an
unused port by default, especially since many folks will be running in
shared environments like YARN.

But anyway, thanks for that! I'm now up with 1.6.2.

Cliff

On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Cliff,
>
> the TaskManger fail to start with exit code 31 which indicates an
> initialization error on startup. If you check the TaskManager logs via
> `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs
> don't start up.
>
> Cheers,
> Till
>
> On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <cr...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG
>> level. I saw several errors in 1.6.2, hope it's informative!
>>
>> Cliff
>>
>> On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Cliff,
>>>
>>> this sounds not right. Could you share the logs of the Yarn cluster
>>> entrypoint with the community for further debugging? Ideally on DEBUG
>>> level. The Yarn logs would also be helpful to fully understand the problem.
>>> Thanks a lot!
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>
>>>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>>>> configuration of 3 slots per TM. The cluster is dedicated to a single job
>>>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>>>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>>>
>>>> When I run the job in 1.6.0, seven Task Managers are spun up as
>>>> expected. But if I run with 1.6.2 only four Task Managers spin up and the
>>>> job hangs waiting for more resources.
>>>>
>>>> Our Flink distribution is set up by script after building from source.
>>>> So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical.
>>>> The job is the same, restarting from savepoint. The problem is repeatable.
>>>>
>>>> Has something changed in 1.6.2, and if so can it be remedied with a
>>>> config change?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Posted by Till Rohrmann <tr...@apache.org>.
Hi Cliff,

the TaskManger fail to start with exit code 31 which indicates an
initialization error on startup. If you check the TaskManager logs via
`yarn logs -applicationId <APP_ID>` you should see the problem why the TMs
don't start up.

Cheers,
Till

On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <cr...@gmail.com> wrote:

> Hi Till,
>
> Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG
> level. I saw several errors in 1.6.2, hope it's informative!
>
> Cliff
>
> On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi Cliff,
>>
>> this sounds not right. Could you share the logs of the Yarn cluster
>> entrypoint with the community for further debugging? Ideally on DEBUG
>> level. The Yarn logs would also be helpful to fully understand the problem.
>> Thanks a lot!
>>
>> Cheers,
>> Till
>>
>> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>>> configuration of 3 slots per TM. The cluster is dedicated to a single job
>>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>>
>>> When I run the job in 1.6.0, seven Task Managers are spun up as
>>> expected. But if I run with 1.6.2 only four Task Managers spin up and the
>>> job hangs waiting for more resources.
>>>
>>> Our Flink distribution is set up by script after building from source.
>>> So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical.
>>> The job is the same, restarting from savepoint. The problem is repeatable.
>>>
>>> Has something changed in 1.6.2, and if so can it be remedied with a
>>> config change?
>>>
>>>
>>>
>>>
>>>
>>>

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Posted by Cliff Resnick <cr...@gmail.com>.
Hi Till,

Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG level.
I saw several errors in 1.6.2, hope it's informative!

Cliff

On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Cliff,
>
> this sounds not right. Could you share the logs of the Yarn cluster
> entrypoint with the community for further debugging? Ideally on DEBUG
> level. The Yarn logs would also be helpful to fully understand the problem.
> Thanks a lot!
>
> Cheers,
> Till
>
> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cr...@gmail.com> wrote:
>
>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>> configuration of 3 slots per TM. The cluster is dedicated to a single job
>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>
>> When I run the job in 1.6.0, seven Task Managers are spun up as expected.
>> But if I run with 1.6.2 only four Task Managers spin up and the job hangs
>> waiting for more resources.
>>
>> Our Flink distribution is set up by script after building from source. So
>> aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The
>> job is the same, restarting from savepoint. The problem is repeatable.
>>
>> Has something changed in 1.6.2, and if so can it be remedied with a
>> config change?
>>
>>
>>
>>
>>
>>

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Posted by Till Rohrmann <tr...@apache.org>.
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster
entrypoint with the community for further debugging? Ideally on DEBUG
level. The Yarn logs would also be helpful to fully understand the problem.
Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cr...@gmail.com> wrote:

> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
> configuration of 3 slots per TM. The cluster is dedicated to a single job
> that runs at full capacity in "FLIP6" mode. So in this cluster, the
> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>
> When I run the job in 1.6.0, seven Task Managers are spun up as expected.
> But if I run with 1.6.2 only four Task Managers spin up and the job hangs
> waiting for more resources.
>
> Our Flink distribution is set up by script after building from source. So
> aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The
> job is the same, restarting from savepoint. The problem is repeatable.
>
> Has something changed in 1.6.2, and if so can it be remedied with a config
> change?
>
>
>
>
>
>