You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gaël Renoux (Jira)" <ji...@apache.org> on 2022/05/18 16:54:00 UTC

[jira] [Created] (FLINK-27687) SpanningWrapper shouldn't assume temp folder exists

Gaël Renoux created FLINK-27687:
-----------------------------------

             Summary: SpanningWrapper shouldn't assume temp folder exists
                 Key: FLINK-27687
                 URL: https://issues.apache.org/jira/browse/FLINK-27687
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Network
    Affects Versions: 1.14.4
            Reporter: Gaël Renoux


In SpanningWrapper.createSpillingChannel, it assumes that the folder in which we create the file exists. However, this is not the case in the following scenario (which actually happened to us today):
 * The temp folders were created a while ago (I assume on startup of the task-manager) in the /tmp folder. They weren't used for a while, probably because we didn't have any record big enough to trigger it.
 * The cleanup cron for /tmp did its job and deleted those old folders in /tmp.
 * We deployed a new version of the job that actually needed the folders, and it crashed.

=> Not sure if it should be SpanningWrapper's responsability to create the folder if it doesn't exist anymore, though, but I'm not familiar enough with Flink's internal to make a guess as to what class should do it. The problem occurred to us on SpanningWrapper, but it can probably happen in other places as well.

More generally, assuming that folders and files in /tmp won't get deleted at some point doesn't seem correct to me. The [documentation for io.tmp.dirs|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/] recommands that it shouldn't be purged, but we do need to clean up at some point. If that is not the case, then the documentation should be updated to indicate that this is not a recommendation but mandatory, and that purges will break the jobs (not just trigger a recovery).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)