You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2008/02/22 23:33:19 UTC

[jira] Created: (PIG-116) pig leaves temp files behind

pig leaves temp files behind
----------------------------

Key: PIG-116
URL: https://issues.apache.org/jira/browse/PIG-116
Project: Pig
Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich

Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook.

The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.

The short term, I would like to propose the approach below:

(1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
(2) If trash is not enabled on the cluster, the location that we use now will be used
(3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)

Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815

Comments? Suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Posted by "Pi Song (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571880#action_12571880 ] 

Pi Song commented on PIG-116:
-----------------------------

In short term that will be fine but I would like to give a bit more thought here that this approach seems relying on the trash feature of HDFS which is not common among other file systems.

By looking at commercial DBMSs, what they do is:-

- Rely on a temp folder and temp files *on top of* file systems.
- User can customize location and size limit.
- The folder is clean-up every time the system restarts (comparable to when pig restarts, not hadoop restarts so this can be more frequent)
- The old temp files are removed when the system running low on temp space
- Temp files where lifecycle is explicitly known may be marked for collection at appropriate time (like the concept of GC)

This way looks good to me that 
- temp file management will not rely on file system implementation 
- good when running on a file system that has disk quota enabled (I saw a discussion about disk quota on HDFS a few months ago). 
- If JVM crashes during operation, it's still ok because when running pig next time the temp folder will be clean up. 

However this doesn't consider checkpointing/resumed run after crash.

My 2 cents.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Posted by "Pi Song (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574022#action_12574022 ] 

Pi Song commented on PIG-116:
-----------------------------

Again just want to emphasize that registering a clean-up function is not a common file system feature. We should stick to POSIX for file manipulation to keep it generic. Then I believe this can be implemented as a Hadoop specific backend configuration.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572283#action_12572283 ] 

Olga Natkovich commented on PIG-116:
------------------------------------

Pi, I agree that temp space solution is the right way to go. In fact, I already discussed this with hadoop team and they agreed in principle that this is the right thing to do though it is not clear when that might happen.

Short term, we agreed on a better solution as well. They would provide the application a way to register cleanup functions that are guaranteed to run before DFS shuts down.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573971#action_12573971 ] 

Olga Natkovich commented on PIG-116:
------------------------------------

Update: we would add the call to the clean up the files just before the main exit. This will solve the issue for normally completeing pig jobs but not for jobs that were killed.

Hadoop team will consider fixing this in the 0.17 release

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-116) pig leaves temp files behind

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571620#action_12571620 ] 

Olga Natkovich commented on PIG-116:
------------------------------------

The following config params in hadoop tells if trash is enabled and where:

<property>
  <name>fs.trash.root</name>
  <value>${hadoop.tmp.dir}/Trash</value>
  <description>The trash directory, used by FsShell's 'rm' command.
  </description>
</property>

<property>
  <name>fs.trash.interval</name>
  <value>0</value>
  <description>Number of minutes between trash checkpoints.
  If zero, the trash feature is disabled.
  </description>
</property>

The format of directories to create is yyMMddHHmm.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown hooks, in some cases, the DFS is already closed when we try to delete the files in which case a substention amount of data can be left in DFS. I see this issue more frequently with hadoop 0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash is enabled, we let trash handle it; otherwise we provide the list of locations to the user to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.