You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "anishshri-db (via GitHub)" <gi...@apache.org> on 2023/05/08 05:54:38 UTC

[GitHub] [spark] anishshri-db opened a new pull request, #41089: [SPARK-43404] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

anishshri-db opened a new pull request, #41089:
URL: https://github.com/apache/spark/pull/41089

   ### What changes were proposed in this pull request?
   Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error
   
   ### Why are the changes needed?
   In case of task retry on the same executor, its possible that the original task completed the phase of creating the SST files and uploading them to the object store. In this case, we also might have added an entry to the in-memory map for `versionToRocksDBFiles` for the given version. When the retry task creates the local checkpoint, its possible the file name and size is the same, but the metadata ID embedded within the file may be different. So, when we try to load this version on successful commit, the metadata zip file points to the old SST file which results in a RocksDB mismatch id error.
   
   ```
   Mismatch in unique ID on table file 24220. Expected: {9692563551998415634,4655083329411385714} Actual: {9692563551998415639,10299185534092933087} in file /local_disk0/spark-f58a741d-576f-400c-9b56-53497745ac01/executor-18e08e59-20e8-4a00-bd7e-94ad4599150b/spark-5d980399-3425-4951-894a-808b943054ea/StateStoreId(opId=2147483648,partId=53,name=default)-d89e082e-4e33-4371-8efd-78d927ad3ba3/workingDir-9928750e-f648-4013-a300-ac96cb6ec139/MANIFEST-024212
   ```
   
   This change avoids reusing files for the same version on the same host based on the map entries to reduce the chance of running into the error above.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit test
   
   RocksDBSuite
   ```
   [info] Run completed in 35 seconds, 995 milliseconds.
   [info] Total number of tests run: 33
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1539189818

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR closed pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error
URL: https://github.com/apache/spark/pull/41089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] anishshri-db commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "anishshri-db (via GitHub)" <gi...@apache.org>.
anishshri-db commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1537792038

   @HeartSaVioR - please take a look. Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1584108322

   PR for 3.4: https://github.com/apache/spark/pull/41530


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on code in PR #41089:
URL: https://github.com/apache/spark/pull/41089#discussion_r1187106223


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala:
##########
@@ -371,7 +371,8 @@ class RocksDBFileManager(
     // Get the immutable files used in previous versions, as some of those uploaded files can be
     // reused for this version
     logInfo(s"Saving RocksDB files to DFS for $version")
-    val prevFilesToSizes = versionToRocksDBFiles.values.asScala.flatten.map { f =>
+    val prevFilesToSizes = versionToRocksDBFiles.asScala.filterKeys(_ != version)

Review Comment:
   `_ < version` to fit with above code comment.
   
   Btw, my preference is actually invalidating entries of cache for the version if we figure out the version will be reprocessed. (Say, invalidating all entries of cache for versions where version in entry > parameter in RocksDB.load().) But I agree this is broader fix which may not be necessary, since listing up cache entries are only performed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] anishshri-db commented on a diff in pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "anishshri-db (via GitHub)" <gi...@apache.org>.
anishshri-db commented on code in PR #41089:
URL: https://github.com/apache/spark/pull/41089#discussion_r1187648615


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala:
##########
@@ -371,7 +371,8 @@ class RocksDBFileManager(
     // Get the immutable files used in previous versions, as some of those uploaded files can be
     // reused for this version
     logInfo(s"Saving RocksDB files to DFS for $version")
-    val prevFilesToSizes = versionToRocksDBFiles.values.asScala.flatten.map { f =>
+    val prevFilesToSizes = versionToRocksDBFiles.asScala.filterKeys(_ != version)

Review Comment:
   Done. Yea, I thought about this too. But listing happens only here and we prob don't expect for this to happen too often. So went with the point fix for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1584104290

   I'll take care of it. If the new PR ends up with pure cherry-pick I can probably self approve and go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Kimahriman commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "Kimahriman (via GitHub)" <gi...@apache.org>.
Kimahriman commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1573719968

   Should this be backported to 3.4? We just encountered this as well and it doesn't seem like there's anyway to recover from it beyond deleting or hacking your checkpoint


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1573809706

   @Kimahriman Why not? I missed that this is a non-trivial bugfix which is expected to port back to affected version(s). Thanks for the reminder.
   
   @anishshri-db Shall we have a cherry-pick PR against branch-3.4? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Kimahriman commented on pull request #41089: [SPARK-43404][SS] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error

Posted by "Kimahriman (via GitHub)" <gi...@apache.org>.
Kimahriman commented on PR #41089:
URL: https://github.com/apache/spark/pull/41089#issuecomment-1573905188

   The error happens on the read during the batch after a checkpoint partition gets "corrupted" (where the ID in the metadata and the SST file don't match). So once you encounter this error as far as I know your checkpoint is corrupted beyond repair. Can only try to re-run the previous version by deleting the `commit` entry, but that would only work if your downstream is idempotent, which in our case writing to a Delta table downstream is not (without trying to rollback the table version which is an option). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org