You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2018/01/05 06:50:00 UTC

[jira] [Assigned] (STORM-2879) Supervisor collapse continuously when there is a expired assignment for overdue storm

     [ https://issues.apache.org/jira/browse/STORM-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim reassigned STORM-2879:
-----------------------------------

    Assignee: Yuzhao Chen

> Supervisor collapse continuously when there is a expired assignment for overdue storm
> -------------------------------------------------------------------------------------
>
>                 Key: STORM-2879
>                 URL: https://issues.apache.org/jira/browse/STORM-2879
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core, storm-server
>    Affects Versions: 2.0.0, 1.x
>            Reporter: Yuzhao Chen
>            Assignee: Yuzhao Chen
>            Priority: Critical
>              Labels: patch, pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> For now, when a topology is reassigned or killed for a cluster, supervisor will delete 4 files for an overdue storm:
> - storm-code
> - storm-ser
> - storm-jar
> - LocalAssignment
> Slot.java
> static DynamicState cleanupCurrentContainer(DynamicState dynamicState, StaticState staticState, MachineState nextState) throws Exception {
>         assert(dynamicState.container != null);
>         assert(dynamicState.currentAssignment != null);
>         assert(dynamicState.container.areAllProcessesDead());
>         
>         dynamicState.container.cleanUp();
>         staticState.localizer.releaseSlotFor(dynamicState.currentAssignment, staticState.port);
>         DynamicState ret = dynamicState.withCurrentAssignment(null, null);
>         if (nextState != null) {
>             ret = ret.withState(nextState);
>         }
>         return ret;
>     }
> But we do not make a transaction to do this, if an exception occurred during deleting storm-code/ser/jar, an overdue local assignment will be left on disk.
> Then when supervisor restart from the exception above, the slots will be initial and container will be recovered from LocalAssignments, the blob store will fetch the files from Nimbus/Master, but will get a KeyNotFoundException, and supervisor collapses again.
> This will happens continuously and supervisor will never recover until we clean up all the local assignments manually.
> This is the stack:
> 2017-12-27 14:15:04.434 o.a.s.l.AsyncLocalizer [INFO] Cleaning up unused topologies in /opt/meituan/storm/data/supervisor/stormdist
> 2017-12-27 14:15:04.434 o.a.s.d.s.AdvancedFSOps [INFO] Deleting path /opt/meituan/storm/data/supervisor/stormdist/app_dpsr_realtime_shop_vane_allcates-14-1513685785
> 2017-12-27 14:15:04.445 o.a.s.d.s.Slot [INFO] STATE EMPTY msInState: 109 -> WAITING_FOR_BASIC_LOCALIZATION msInState: 1
> 2017-12-27 14:15:04.471 o.a.s.d.s.Supervisor [INFO] Starting supervisor with id 255d3fed-f3ee-4c7e-8a08-b693c9a6a072 at host gq-data-rt48.gq.sankuai.com.
> 2017-12-27 14:15:04.502 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:123) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:04.611 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:123) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:04.718 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormcode.ser from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:124) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:04.825 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormcode.ser from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:124) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:04.932 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormconf.ser from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:125) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:05.039 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormconf.ser from blob store.
> org.apache.storm.generated.KeyNotFoundException: null
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:125) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
> 2017-12-27 14:15:05.140 o.a.s.u.Utils [INFO] Could not extract resources from /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar
> 2017-12-27 14:15:05.142 o.a.s.d.s.Slot [INFO] STATE WAITING_FOR_BASIC_LOCALIZATION msInState: 697 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
> 2017-12-27 14:15:05.142 o.a.s.l.AsyncLocalizer [WARN] Caught Exception While Downloading (rethrowing)... 
> java.io.FileNotFoundException: File '/opt/meituan/storm/data/supervisor/stormdist/app_dpsr_realtime_shop_vane_allcates-14-1513685785/stormconf.ser' does not exist
> 	at org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:292) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1815) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.ConfigUtils.readSupervisorStormConfGivenPath(ConfigUtils.java:264) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.ConfigUtils.readSupervisorStormConfImpl(ConfigUtils.java:376) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.utils.ConfigUtils.readSupervisorStormConf(ConfigUtils.java:370) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBlobs.call(AsyncLocalizer.java:226) ~[storm-core-1.1.2-mt001.jar:?]
> 	at org.apache.storm.localizer.AsyncLocalizer$DownloadBlobs.call(AsyncLocalizer.java:213) ~[storm-core-1.1.2-mt001.jar:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)