You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by GitBox <gi...@apache.org> on 2020/09/16 15:49:42 UTC

[GitHub] [storm] RuiLi8080 opened a new pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

RuiLi8080 opened a new pull request #3336:
URL: https://github.com/apache/storm/pull/3336


   ## What is the purpose of the change
   
   When AsyncLocalizer cleanup happens, make sure it check against the latest topology cache information.
   Currently, the `safeTopologyIds` set might become stale due to delay at around https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L641 . Newly created topology entries might not appear here and clean-up will delete them by fault and cause further FNF issue.
   
   ## How was the change tested
   Reproduce the FNF issue with additional sleep. After the change, same delay won't cause the FNF anymore.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] RuiLi8080 commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
RuiLi8080 commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r491122961



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       @agresch Technically there is a chance. But with this change it would be more unlikely to happen than before.
   Before this change, we compare topo dir to `safeTopologyIds` which is essentially a much older version of `topologyBlobs  snapshot since it could take 10s for the `forEachTopologyDistDir` loop to finish. In this case, race condition is much more likely to happen due to the gap. 
   Now we directly check the `topologyBlobs` map right before deletion. So the `forEachTopologyDistDir` loop could be aware of the concurrent additions from slots and further prevent most of the wrongful deletion to happen.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r491163677



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       I guess your question is if line 650 (`&& !topologyBlobs.containsKey(topoConfKey)) {`) executes and before line 651 (`fsOps.deleteIfExists(p.toFile());`) runs, getTopoJar(topologyId) is called?
   Yes there is a chance. But current change is much better than what we have before.
   
   
   Just for future reference, one thing about Rui's comment that I want to note is that we still don't know where the 10s delay we observed from the incident we had in our production cluster between where `safeTopologyIds` is populated and `fsOps.deleteIfExists` is invoked comes from. It could come from `forEachTopologyDistDir` file operations. But we don't know yet. Also this issue doesn't alway happen. 
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r490986544



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       When that happens, the directory will not be deleted. It is okay because when `getTopoJar` executes, it means the directory is needed. 
   
   The bottom line here is the code will delete the directory for dead topologies eventually (cleanup thread runs every 30s). But if there is anywhere that needs this directory, this code will not delete the directory
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm merged pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm merged pull request #3336:
URL: https://github.com/apache/storm/pull/3336


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r490986857



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -474,7 +474,7 @@ public void releaseSlotFor(LocalAssignment assignment, int port) throws IOExcept
             LOG.info("Port and assignment info: {}", pna);
             if (e instanceof FileNotFoundException) {
                 localResourceFileNotFoundWhenReleasingSlot.mark();
-                LOG.warn("Local base blobs have not been downloaded yet. ", e);
+                LOG.warn("Local base blobs are not available. ", e);

Review comment:
       Please update comments at line 470-471 too




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] RuiLi8080 commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
RuiLi8080 commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r491122961



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       @agresch Technically there is a chance. But with this change it would be more unlikely to happen than before.
   Before this change, we compare topo dir to `safeTopologyIds` which is essentially a much older version of `topologyBlobs`  snapshot since it could take about 10s for the `forEachTopologyDistDir` loop to finish. In this case, race condition is much more likely to happen since the deletion loop keep looking at the snapshot 10s ago.
    
   Now we directly check the `topologyBlobs` map right before deletion. So the `forEachTopologyDistDir` loop could be aware of the concurrent additions from slots and further prevent most of the wrongful deletion to happen.

##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       @agresch Technically there is a chance. But with this change it would be more unlikely to happen than before.
   
   Before this change, we compare topo dir to `safeTopologyIds` which is essentially a much older version of `topologyBlobs`  snapshot since it could take about 10s for the `forEachTopologyDistDir` loop to finish. In this case, race condition is much more likely to happen since the deletion loop keep looking at the snapshot 10s ago.
    
   Now we directly check the `topologyBlobs` map right before deletion. So the `forEachTopologyDistDir` loop could be aware of the concurrent additions from slots and further prevent most of the wrongful deletion to happen.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] agresch commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
agresch commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r489551013



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       It seems like it may be possible to still have a race condition here where a topology is added to topologyBlobs?
   
   If so it seems like we should add synchronization to this code and the adding of a blob?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm merged pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm merged pull request #3336:
URL: https://github.com/apache/storm/pull/3336


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] RuiLi8080 commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
RuiLi8080 commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r491123691



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -474,7 +474,7 @@ public void releaseSlotFor(LocalAssignment assignment, int port) throws IOExcept
             LOG.info("Port and assignment info: {}", pna);
             if (e instanceof FileNotFoundException) {
                 localResourceFileNotFoundWhenReleasingSlot.mark();
-                LOG.warn("Local base blobs have not been downloaded yet. ", e);
+                LOG.warn("Local base blobs are not available. ", e);

Review comment:
       Updated.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] RuiLi8080 commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
RuiLi8080 commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r489756042



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       My understanding is topo dir is always created after the entries of `topologyBlobs` are added. So whenever the topology entry is found, we should not delete the dir. 
   Also, `topologyBlobs` is `ConcurrentHashMap` which allows the slot and cleanup/update threads to access in parallel. 
   If we added synchronization, the file system dir cleanup might take seconds and block the slot threads to progress.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] Ethanlm commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
Ethanlm commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r491163677



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       I guess your question is if line 650 executes and before line 651 runs, getTopoJar(topologyId) is called?
   Yes there is a chance. But current change is much better than what we have before.
   
   
   Just for future reference, one thing about Rui's comment that I want to note is that we still don't know where the 10s delay we observed from the incident we had in our production cluster between where `safeTopologyIds` is populated and `fsOps.deleteIfExists` is invoked comes from. It could come from `forEachTopologyDistDir` file operations. But we don't know yet. Also this issue doesn't alway happen. 
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [storm] agresch commented on a change in pull request #3336: [STORM-3701] clean-up topo directory with the check against latest topo blob cache

Posted by GitBox <gi...@apache.org>.
agresch commented on a change in pull request #3336:
URL: https://github.com/apache/storm/pull/3336#discussion_r490428155



##########
File path: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java
##########
@@ -641,7 +642,12 @@ void cleanup() {
 
             try {
                 forEachTopologyDistDir((p, topologyId) -> {
-                    if (!safeTopologyIds.contains(topologyId)) {
+                    String topoJarKey = ConfigUtils.masterStormJarKey(topologyId);
+                    String topoCodeKey = ConfigUtils.masterStormCodeKey(topologyId);
+                    String topoConfKey = ConfigUtils.masterStormConfKey(topologyId);
+                    if (!topologyBlobs.containsKey(topoJarKey)

Review comment:
       What happens if line 648 executes and before line 649 runs getTopoJar(topologyId) is called?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org