You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Abhishek Tiwari (JIRA)" <ji...@apache.org> on 2017/08/15 20:57:00 UTC

[jira] [Resolved] (GOBBLIN-159) Gobblin Cluster graceful shutdown of master and workers

     [ https://issues.apache.org/jira/browse/GOBBLIN-159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Abhishek Tiwari resolved GOBBLIN-159.
-------------------------------------
    Resolution: Fixed

> Gobblin Cluster graceful shutdown of master and workers
> -------------------------------------------------------
>
>                 Key: GOBBLIN-159
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-159
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Abhishek Tiwari
>            Assignee: Zhixiong Chen
>
> Relevant chat from Gitter channel: 
> *Joel Baranick @kadaan Jun 30 10:47*
> Up scaling seems to work great. But down scaling caused problems with the cluster.
> Basically, once the cpu dropped enough to start down scaling, something broke where it stopped processing jobs.
> I’m concerned that the down scaling is not graceful and that the cluster doesn’t respond nicely to workers leaving the cluster in the middle of processing.
> There are a couple problems I see. One is that the workers down gracefully stop running tasks and allow them to be picked up by other nodes.
> The other is that if task publishing is used, partial data might be published when the node goes away. How does the task get completed without possibly duplicating data?
> *Joel Baranick @kadaan Jun 30 12:07*
> @abti What I'm wondering is how we can shutdown a worker node and have it gracefully stop working.
> *Joel Baranick @kadaan Jun 30 12:52*
> Also, seems like .../taskstates/... as well as the job...job.state file in NFS don't get purged.
> Our NFS is experiencing unbounded growth. Are we missing a setting or service?
> *Abhishek Tiwari @abti Jun 30 15:36*
> I didn’t fully understand the issue. Did you see the workers abruptly cancel the task or did they wait for it to finish before shutting down? If the worker waits around enough for Task to finish, the task level publish should be fine?
> *Joel Baranick @kadaan Jun 30 15:37*
> The workers never shut down.
> *Abhishek Tiwari @abti Jun 30 15:38*
> could be because they wait for graceful shutdown but do not leave cluster and are assigned new tasks by helix?
> *Joel Baranick @kadaan Jun 30 15:39*
> I think one issue is that there is an org.quartz.UnableToInterruptJobException in JobScheduler.shutDown which causes it to never run ExecutorsUtils.shutdownExecutorService(this.jobExecutor, Optional.of(LOG));
> *Abhishek Tiwari @abti Jun 30 15:40*
> also taskstates should get cleaned up, check with @htran1 too .. only wu probably should be left around
> we need to add some cleaning mechanism for that
> we dont recall seeing the lurking state files
> *Joel Baranick @kadaan Jun 30 15:47*
> In my EFS/NFS, I have tons (> 6000) of files remaining under .../_taskstates/... for jobs/tasks that have been completed for ages.
> *Abhishek Tiwari @abti Jun 30 16:29*
> wow thats unexpected, did master switch while several jobs were going on?
> *Joel Baranick @kadaan Jun 30 17:23*
> There isn't a way for master to switch without jobs running as they don't cancel correctly.
> *Joel Baranick @kadaan Jul 05 14:22*
> @abti I was looking at fixing the cancellation problem.
> From what I can tell, GobblinHelixJob needs to implement InterruptableJob.
> And it needs to call jobLauncher.cancelJob(jobListener); when it is invoked.
> Does this seem right? Anything I'm missing?
> *Abhishek Tiwari @abti Jul 06 00:34*
> looks about right



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)