You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "dyana.rose" <dy...@salecycle.com> on 2019/05/01 08:17:38 UTC

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Like all the best problems, I can't get this to reproduce locally.

Everything has worked as expected. I started up a test job with 5 retained checkpoints, let it run and watched the nodes in zookeeper.

Then shut down and restarted the Flink cluster.

The ephemeral lock nodes in the retained checkpoints transitioned from one lock id to another without a hitch.

So that's good.

As I understand it, if the Zookeeper cluster is having a sync issue, ephemeral nodes may not get deleted when the session becomes inactive. We're new to running our own zookeeper so it may be down to that.

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Posted by Till Rohrmann <tr...@apache.org>.

Great to hear Dyana. Thanks for the update.

Cheers,
Till

On Fri, Jun 7, 2019 at 2:48 PM dyana.rose <dy...@salecycle.com> wrote:

> Just wanted to give an update on this.
>
> Our ops team and myself independently came to the same conclusion that our
> ZooKeeper quorum was having syncing issues.
>
> After a bit more research, they have updated the initLimit and syncLimit
> in the quorum configs to:
> initLimit=10
> syncLimit=5
>
> After this change we no longer saw any of the issues we were having.
>
> Thanks,
> Dyana
>
> On 2019/05/02 08:43:14, Till Rohrmann <tr...@apache.org> wrote:
> > Thanks for the update Dyana. I'm also not an expert in running one's own
> > ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
> > properly up. Maybe someone else from the community has experience with
> > this. Therefore, I'm cross posting this thread to the user ML again to
> have
> > a wider reach.
> >
> > Cheers,
> > Till
> >
> > On Wed, May 1, 2019 at 10:17 AM dyana.rose <dy...@salecycle.com>
> wrote:
> >
> > > Like all the best problems, I can't get this to reproduce locally.
> > >
> > > Everything has worked as expected. I started up a test job with 5
> retained
> > > checkpoints, let it run and watched the nodes in zookeeper.
> > >
> > > Then shut down and restarted the Flink cluster.
> > >
> > > The ephemeral lock nodes in the retained checkpoints transitioned from
> one
> > > lock id to another without a hitch.
> > >
> > > So that's good.
> > >
> > > As I understand it, if the Zookeeper cluster is having a sync issue,
> > > ephemeral nodes may not get deleted when the session becomes inactive.
> > > We're new to running our own zookeeper so it may be down to that.
> > >
> >
>

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Posted by "dyana.rose" <dy...@salecycle.com>.

Just wanted to give an update on this.

Our ops team and myself independently came to the same conclusion that our ZooKeeper quorum was having syncing issues.

After a bit more research, they have updated the initLimit and syncLimit in the quorum configs to:
initLimit=10
syncLimit=5

After this change we no longer saw any of the issues we were having.

Thanks,
Dyana

On 2019/05/02 08:43:14, Till Rohrmann <tr...@apache.org> wrote: 
> Thanks for the update Dyana. I'm also not an expert in running one's own
> ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
> properly up. Maybe someone else from the community has experience with
> this. Therefore, I'm cross posting this thread to the user ML again to have
> a wider reach.
> 
> Cheers,
> Till
> 
> On Wed, May 1, 2019 at 10:17 AM dyana.rose <dy...@salecycle.com> wrote:
> 
> > Like all the best problems, I can't get this to reproduce locally.
> >
> > Everything has worked as expected. I started up a test job with 5 retained
> > checkpoints, let it run and watched the nodes in zookeeper.
> >
> > Then shut down and restarted the Flink cluster.
> >
> > The ephemeral lock nodes in the retained checkpoints transitioned from one
> > lock id to another without a hitch.
> >
> > So that's good.
> >
> > As I understand it, if the Zookeeper cluster is having a sync issue,
> > ephemeral nodes may not get deleted when the session becomes inactive.
> > We're new to running our own zookeeper so it may be down to that.
> >
>

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for the update Dyana. I'm also not an expert in running one's own
ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
properly up. Maybe someone else from the community has experience with
this. Therefore, I'm cross posting this thread to the user ML again to have
a wider reach.

Cheers,
Till

On Wed, May 1, 2019 at 10:17 AM dyana.rose <dy...@salecycle.com> wrote:

> Like all the best problems, I can't get this to reproduce locally.
>
> Everything has worked as expected. I started up a test job with 5 retained
> checkpoints, let it run and watched the nodes in zookeeper.
>
> Then shut down and restarted the Flink cluster.
>
> The ephemeral lock nodes in the retained checkpoints transitioned from one
> lock id to another without a hitch.
>
> So that's good.
>
> As I understand it, if the Zookeeper cluster is having a sync issue,
> ephemeral nodes may not get deleted when the session becomes inactive.
> We're new to running our own zookeeper so it may be down to that.
>

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for the update Dyana. I'm also not an expert in running one's own
ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
properly up. Maybe someone else from the community has experience with
this. Therefore, I'm cross posting this thread to the user ML again to have
a wider reach.

Cheers,
Till

On Wed, May 1, 2019 at 10:17 AM dyana.rose <dy...@salecycle.com> wrote:

> Like all the best problems, I can't get this to reproduce locally.
>
> Everything has worked as expected. I started up a test job with 5 retained
> checkpoints, let it run and watched the nodes in zookeeper.
>
> Then shut down and restarted the Flink cluster.
>
> The ephemeral lock nodes in the retained checkpoints transitioned from one
> lock id to another without a hitch.
>
> So that's good.
>
> As I understand it, if the Zookeeper cluster is having a sync issue,
> ephemeral nodes may not get deleted when the session becomes inactive.
> We're new to running our own zookeeper so it may be down to that.
>