You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Renjie Liu <li...@gmail.com> on 2017/03/23 10:09:02 UTC

Task manager number mismatch container number on mesos

Hi, all:
We are using flink 1.2.0 on mesos. We found the number of task managers
mismatches with container number occasinally. That's the mesos container
still exists but it can't be found on the monitor web page of flink master.
This case doesn't happen frequently and it's hard to reproduce.
-- 
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Renjie Liu <li...@gmail.com>.
Attached is task manager's log, jstack, jstack mixed mode, heap usage.
[image: pasted1]
It seems that threads are active threads blocked on allocating memory, but
no gc is triggered and memory usage is low.

On Mon, Apr 10, 2017 at 2:06 PM Renjie Liu <li...@gmail.com> wrote:

> I'm using mesos 1.0.1 client but our cluster is mesos 0.26.0, is this may
> be the cause?
>
> On Mon, Apr 10, 2017 at 2:05 PM Renjie Liu <li...@gmail.com>
> wrote:
>
> This happens again.
> I've checked job manager's log and it reports the lost of task manager as
> expected.
> However, there's nothing valuable in the task manager's log. I've checked
> the output of jstack and what's interesting is that several threads get
> blocked when allocating memory. But the jvm heap usage is low and no gc
> happens.
>
>
>
>
>
>
> On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <li...@gmail.com>
> wrote:
>
> I'm not sure how to reproduce this bug, and I'll post it next time it
> happens.
>
> On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <rm...@apache.org>
> wrote:
>
> Could you provide the logs of the task manager that still runs as a
> container but doesn't show up as a Taskmanager?
>
> On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <li...@gmail.com>
> wrote:
>
> Permanent. I've waited for several minutes and the task manager is still
> lost.
>
> On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:
>
> When it happens, is it temporary or permanent?
>
> Looping in Till and Eron who worked on the Mesos runner.
>
> – Ufuk
>
> On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
> wrote:
> > Hi, all:
> > We are using flink 1.2.0 on mesos. We found the number of task managers
> > mismatches with container number occasinally. That's the mesos container
> > still exists but it can't be found on the monitor web page of flink
> master.
> > This case doesn't happen frequently and it's hard to reproduce.
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
-- 
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Renjie Liu <li...@gmail.com>.
I'm using mesos 1.0.1 client but our cluster is mesos 0.26.0, is this may
be the cause?

On Mon, Apr 10, 2017 at 2:05 PM Renjie Liu <li...@gmail.com> wrote:

> This happens again.
> I've checked job manager's log and it reports the lost of task manager as
> expected.
> However, there's nothing valuable in the task manager's log. I've checked
> the output of jstack and what's interesting is that several threads get
> blocked when allocating memory. But the jvm heap usage is low and no gc
> happens.
>
>
>
>
>
>
> On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <li...@gmail.com>
> wrote:
>
> I'm not sure how to reproduce this bug, and I'll post it next time it
> happens.
>
> On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <rm...@apache.org>
> wrote:
>
> Could you provide the logs of the task manager that still runs as a
> container but doesn't show up as a Taskmanager?
>
> On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <li...@gmail.com>
> wrote:
>
> Permanent. I've waited for several minutes and the task manager is still
> lost.
>
> On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:
>
> When it happens, is it temporary or permanent?
>
> Looping in Till and Eron who worked on the Mesos runner.
>
> – Ufuk
>
> On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
> wrote:
> > Hi, all:
> > We are using flink 1.2.0 on mesos. We found the number of task managers
> > mismatches with container number occasinally. That's the mesos container
> > still exists but it can't be found on the monitor web page of flink
> master.
> > This case doesn't happen frequently and it's hard to reproduce.
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
-- 
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Renjie Liu <li...@gmail.com>.
This happens again.
I've checked job manager's log and it reports the lost of task manager as
expected.
However, there's nothing valuable in the task manager's log. I've checked
the output of jstack and what's interesting is that several threads get
blocked when allocating memory. But the jvm heap usage is low and no gc
happens.






On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <li...@gmail.com> wrote:

I'm not sure how to reproduce this bug, and I'll post it next time it
happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <rm...@apache.org> wrote:

Could you provide the logs of the task manager that still runs as a
container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <li...@gmail.com>
wrote:

Permanent. I've waited for several minutes and the task manager is still
lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:

When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink
master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD

-- 
Liu, Renjie
Software Engineer, MVAD


-- 
Liu, Renjie
Software Engineer, MVAD

-- 
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Renjie Liu <li...@gmail.com>.
I'm not sure how to reproduce this bug, and I'll post it next time it
happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <rm...@apache.org> wrote:

> Could you provide the logs of the task manager that still runs as a
> container but doesn't show up as a Taskmanager?
>
> On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <li...@gmail.com>
> wrote:
>
> Permanent. I've waited for several minutes and the task manager is still
> lost.
>
> On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:
>
> When it happens, is it temporary or permanent?
>
> Looping in Till and Eron who worked on the Mesos runner.
>
> – Ufuk
>
> On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
> wrote:
> > Hi, all:
> > We are using flink 1.2.0 on mesos. We found the number of task managers
> > mismatches with container number occasinally. That's the mesos container
> > still exists but it can't be found on the monitor web page of flink
> master.
> > This case doesn't happen frequently and it's hard to reproduce.
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
>
> --
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Robert Metzger <rm...@apache.org>.
Could you provide the logs of the task manager that still runs as a
container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <li...@gmail.com>
wrote:

> Permanent. I've waited for several minutes and the task manager is still
> lost.
>
> On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:
>
>> When it happens, is it temporary or permanent?
>>
>> Looping in Till and Eron who worked on the Mesos runner.
>>
>> – Ufuk
>>
>> On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
>> wrote:
>> > Hi, all:
>> > We are using flink 1.2.0 on mesos. We found the number of task managers
>> > mismatches with container number occasinally. That's the mesos container
>> > still exists but it can't be found on the monitor web page of flink
>> master.
>> > This case doesn't happen frequently and it's hard to reproduce.
>> > --
>> > Liu, Renjie
>> > Software Engineer, MVAD
>>
> --
> Liu, Renjie
> Software Engineer, MVAD
>

Re: Task manager number mismatch container number on mesos

Posted by Renjie Liu <li...@gmail.com>.
Permanent. I've waited for several minutes and the task manager is still
lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <uc...@apache.org> wrote:

> When it happens, is it temporary or permanent?
>
> Looping in Till and Eron who worked on the Mesos runner.
>
> – Ufuk
>
> On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com>
> wrote:
> > Hi, all:
> > We are using flink 1.2.0 on mesos. We found the number of task managers
> > mismatches with container number occasinally. That's the mesos container
> > still exists but it can't be found on the monitor web page of flink
> master.
> > This case doesn't happen frequently and it's hard to reproduce.
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
>
-- 
Liu, Renjie
Software Engineer, MVAD

Re: Task manager number mismatch container number on mesos

Posted by Ufuk Celebi <uc...@apache.org>.
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <li...@gmail.com> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD