You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Yuzhao Chen (JIRA)" <ji...@apache.org> on 2018/03/31 06:57:00 UTC

[jira] [Updated] (STORM-2905) Supervisor still downloads storm blob files when the topology was killed.

     [ https://issues.apache.org/jira/browse/STORM-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuzhao Chen updated STORM-2905:
-------------------------------
    Description: 
When we kill a topology, at the moment of topology blob-files be removed, Supervisor executor still request blob-files and get an KeyNotFoundException.

I stepped in and found the reason:
1. We do not add a guarded lock on `topologyBlobs` of AsyncLocalizer which is singleton to a supervisor node.
2. And we remove jar/code/conf blob keys in `topologyBlobs` of killed storm only in a timer task: cleanUp() method of AsyncLocalizer, the remove condition is :[no one reference the blobs] AND [ blobs removed by master OR exceeds the max configured size ], the default scheduling interval is 30 seconds.
3. When we kill a storm on a node[ which means that the slot container are empty], 

This is the partial of Supervisor log:

2018-03-31 13:41:17.089 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/pids/1115
 2018-03-31 13:41:17.090 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/heartbeats
 2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/pids
 2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/tmp
 2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0
 2018-03-31 13:41:17.103 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user b50aa089-6584-498e-a5cc-85cba13e4cb0
 2018-03-31 13:41:17.103 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers-users/b50aa089-6584-498e-a5cc-85cba13e4cb0
 2018-03-31 13:41:17.104 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID b50aa089-6584-498e-a5cc-85cba13e4cb0
 2018-03-31 13:41:17.105 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 25 topo:word_count_fk_11-2-1522472558 worker:null -&gt; EMPTY msInState: 0
 2018-03-31 13:41:17.105 o.a.s.d.s.Slot SLOT_6700 [INFO] SLOT 6700: Changing current assignment from LocalAssignment(topology_id:word_count_fk_11-2-1522472558, executors:[ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:1, task_end:1)], resources:WorkerResources(mem_on_heap:896.0, mem_off_heap:0.0, cpu:70.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=896.0, cpu.pcore.percent=70.0}, shared_resources:{}), owner:danny0405) to null
 2018-03-31 13:41:18.001 o.a.s.d.s.t.SupervisorHealthCheck timer [INFO] Running supervisor healthchecks...
 2018-03-31 13:41:18.002 o.a.s.h.HealthChecker timer [INFO] The supervisor healthchecks succeeded.
 2018-03-31 13:41:36.837 o.a.s.l.AsyncLocalizer AsyncLocalizer Executor - 1 [WARN] Failed to download blob LOCAL TOPO BLOB TOPO_JAR word_count_fk_11-2-1522472558 will try again in 100 ms
 org.apache.storm.generated.KeyNotFoundException: null
         at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:25225) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:25193) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.generated.Nimbus$getBlobMeta_result.read(Nimbus.java:25124) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[libthrift-0.9.3.jar:0.9.3]
         at org.apache.storm.generated.Nimbus$Client.recv_getBlobMeta(Nimbus.java:825) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.generated.Nimbus$Client.getBlobMeta(Nimbus.java:812) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.blobstore.NimbusBlobStore.getBlobMeta(NimbusBlobStore.java:318) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.localizer.LocallyCachedTopologyBlob.getRemoteVersion(LocallyCachedTopologyBlob.java:176) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at org.apache.storm.localizer.AsyncLocalizer.lambda$downloadOrUpdate$5(AsyncLocalizer.java:249) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
         at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) [?:1.8.0_151]
         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_151]
         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_151]
         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_151]
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

  was:
When we kill a topology, at the moment of topology blob-files be removed, Supervisor executor still request blob-files and get an KeyNotFoundException, this is the partial of Supervisor log:


2017-11-21 17:32:00.435 o.a.s.d.s.Slot SLOT_9703 [INFO] SLOT 9703: Changing current assignment from LocalAssignment(topology_id:word_count_fk_iii-3-1511256346, executors:[ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:1, task_end:1)], resources:WorkerResources(mem_on_heap:512.0, mem_off_heap:0.0, cpu:40.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:\{offheap.memory.mb=0.0, onheap.memory.mb=512.0, cpu.pcore.percent=40.0}, shared_resources:{}), owner:sankuai) to null
2017-11-21 17:32:10.520 o.a.s.l.AsyncLocalizer AsyncLocalizer Executor - 3 [WARN] Failed to download blob LOCAL TOPO BLOB TOPO_CODE word_count_fk_iii-3-1511256346 will try again in 100 ms
org.apache.storm.generated.KeyNotFoundException: null
 at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:24935) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:24903) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.generated.Nimbus$getBlobMeta_result.read(Nimbus.java:24834) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[libthrift-0.9.3.jar:0.9.3]
 at org.apache.storm.generated.Nimbus$Client.recv_getBlobMeta(Nimbus.java:812) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.generated.Nimbus$Client.getBlobMeta(Nimbus.java:799) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.blobstore.NimbusBlobStore.getBlobMeta(NimbusBlobStore.java:306) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.localizer.LocallyCachedTopologyBlob.getRemoteVersion(LocallyCachedTopologyBlob.java:176) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at org.apache.storm.localizer.AsyncLocalizer.lambda$downloadOrUpdate$5(AsyncLocalizer.java:249) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) [?:1.8.0_151]
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_151]
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_151]
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_151]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]


> Supervisor still downloads storm blob files when the topology was killed.
> -------------------------------------------------------------------------
>
>                 Key: STORM-2905
>                 URL: https://issues.apache.org/jira/browse/STORM-2905
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-server
>    Affects Versions: 2.0.0
>            Reporter: Yuzhao Chen
>            Assignee: Yuzhao Chen
>            Priority: Major
>
> When we kill a topology, at the moment of topology blob-files be removed, Supervisor executor still request blob-files and get an KeyNotFoundException.
> I stepped in and found the reason:
> 1. We do not add a guarded lock on `topologyBlobs` of AsyncLocalizer which is singleton to a supervisor node.
> 2. And we remove jar/code/conf blob keys in `topologyBlobs` of killed storm only in a timer task: cleanUp() method of AsyncLocalizer, the remove condition is :[no one reference the blobs] AND [ blobs removed by master OR exceeds the max configured size ], the default scheduling interval is 30 seconds.
> 3. When we kill a storm on a node[ which means that the slot container are empty], 
> This is the partial of Supervisor log:
> 2018-03-31 13:41:17.089 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/pids/1115
>  2018-03-31 13:41:17.090 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/heartbeats
>  2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/pids
>  2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0/tmp
>  2018-03-31 13:41:17.102 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers/b50aa089-6584-498e-a5cc-85cba13e4cb0
>  2018-03-31 13:41:17.103 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user b50aa089-6584-498e-a5cc-85cba13e4cb0
>  2018-03-31 13:41:17.103 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /Users/danny0405/workspace/storm-2.x-test/supervisor1/apache-storm-2.0.0-SNAPSHOT/storm-local/workers-users/b50aa089-6584-498e-a5cc-85cba13e4cb0
>  2018-03-31 13:41:17.104 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID b50aa089-6584-498e-a5cc-85cba13e4cb0
>  2018-03-31 13:41:17.105 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 25 topo:word_count_fk_11-2-1522472558 worker:null -&gt; EMPTY msInState: 0
>  2018-03-31 13:41:17.105 o.a.s.d.s.Slot SLOT_6700 [INFO] SLOT 6700: Changing current assignment from LocalAssignment(topology_id:word_count_fk_11-2-1522472558, executors:[ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:1, task_end:1)], resources:WorkerResources(mem_on_heap:896.0, mem_off_heap:0.0, cpu:70.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=896.0, cpu.pcore.percent=70.0}, shared_resources:{}), owner:danny0405) to null
>  2018-03-31 13:41:18.001 o.a.s.d.s.t.SupervisorHealthCheck timer [INFO] Running supervisor healthchecks...
>  2018-03-31 13:41:18.002 o.a.s.h.HealthChecker timer [INFO] The supervisor healthchecks succeeded.
>  2018-03-31 13:41:36.837 o.a.s.l.AsyncLocalizer AsyncLocalizer Executor - 1 [WARN] Failed to download blob LOCAL TOPO BLOB TOPO_JAR word_count_fk_11-2-1522472558 will try again in 100 ms
>  org.apache.storm.generated.KeyNotFoundException: null
>          at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:25225) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.generated.Nimbus$getBlobMeta_result$getBlobMeta_resultStandardScheme.read(Nimbus.java:25193) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.generated.Nimbus$getBlobMeta_result.read(Nimbus.java:25124) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[libthrift-0.9.3.jar:0.9.3]
>          at org.apache.storm.generated.Nimbus$Client.recv_getBlobMeta(Nimbus.java:825) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.generated.Nimbus$Client.getBlobMeta(Nimbus.java:812) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.blobstore.NimbusBlobStore.getBlobMeta(NimbusBlobStore.java:318) ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.localizer.LocallyCachedTopologyBlob.getRemoteVersion(LocallyCachedTopologyBlob.java:176) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at org.apache.storm.localizer.AsyncLocalizer.lambda$downloadOrUpdate$5(AsyncLocalizer.java:249) ~[storm-server-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>          at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) [?:1.8.0_151]
>          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_151]
>          at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
>          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_151]
>          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_151]
>          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
>          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
>          at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)