You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/11/03 00:14:16 UTC

[jira] Created: (HADOOP-673) the task execution environment should have a current working directory that is task specific

the task execution environment should have a current working directory that is task specific
--------------------------------------------------------------------------------------------

                 Key: HADOOP-673
                 URL: http://issues.apache.org/jira/browse/HADOOP-673
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.7.2
            Reporter: Owen O'Malley
         Assigned To: Owen O'Malley
             Fix For: 0.9.0


The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12451451 ] 
            
Doug Cutting commented on HADOOP-673:
-------------------------------------

> if there was a convenient way to make the job cache files read only [ ... ]

runtime.exec(new String[] { "chmod" "-w", file });

Works under cygwin!

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Doug Cutting updated HADOOP-673:
--------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.10.0
       Resolution: Fixed

I just committed this.  Thanks, Mahadev!

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.10.0
>
>         Attachments: cwd.patch, cwd_1.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Assigned: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Mahadev konar reassigned HADOOP-673:
------------------------------------

    Assignee: Mahadev konar  (was: Owen O'Malley)

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12457484 ] 
            
Hadoop QA commented on HADOOP-673:
----------------------------------

+1, http://issues.apache.org/jira/secure/attachment/12346718/cwd_1.patch applied and successfully tested against trunk revision r483772

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch, cwd_1.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455819 ] 
            
Owen O'Malley commented on HADOOP-673:
--------------------------------------

I'm worried about the performance of using getCanonicalFile on every directory during the recursion, since it must traverse the entire path up to '/' looking for symlinks.

Another solution would be to just delete the directory first using File.delete() and if it is a symlink, it will work. (With the slight assumption, which could be checked, that the parent directory was writable.)

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12451450 ] 
            
Owen O'Malley commented on HADOOP-673:
--------------------------------------

Ok, I propose that we handle this as a streaming option. So:

1. For all map/reduce jobs the task has current working directory set to the task specific directory.
2. For streaming if the property "mapred.create.symlink" is true, the contents of the job directory (other than the task directories) are symlinked into the task directory. Note that this property is already used for the file cache for a similar purpose and in streaming defaults to true.

If there was a convenient way to make the job cache files read only in Java, that would make me more comfortable about violating the sandbox isolation. But as it is, I guess it is upto the application writer to make sure they don't change any of the shared files.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12446770 ] 
            
Mahadev konar commented on HADOOP-673:
--------------------------------------

I do think the suggestion is right-- only that it would involve changes to hadoop-streaming and might break other applications that assume the current working directory to be the directory where the unjarred files lie. That was the only reason I chose the current working directory to be the same for all tasks, but it does seem ugly that all the tasks would share the same workspace. I can make the required changes if people agree on it.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Richard Kasperski (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12451359 ] 
            
Richard Kasperski commented on HADOOP-673:
------------------------------------------

Owen,
  There are two primary reasons. 
1. Maybe I don't have access to the source of all programs that I might want to run, or I have the source and don't desire to make any changes. Having the data for an application be rooted in the current directory allows many more programs to be run with no changes.
2. I think of hadoop as an infrastructure to run other progams. not as something to write programs too. This means that I do not want to write into my progams hadoop'isms. 

Running streaming apps on hadoop should be as transparent as possible.

Files being a shared resource is an efficiency consideration not one required for proper operation. Is there an operational requirement for the files being shared such as I need to mmap the same file? Is this likely the way that most programs will be. 

It's not that I'm absolutely against the .. structure, just that it seems to me not to be the most useful structure. I would contend that if you desire the .. placement you need also, some how, provide for placement in the current working directory.

The above comments only apply to streaming. 

A more general comment, perhaps philosophical, perhaps religious.

The purpose of hadoop is to allow work to be divided and then executed with a belief of correctness that is the same as would have if I ran it as a single job on a single machine. Sharing data in a way that allows inadvertant/unintended manipulation of 'SHARED' data violates this. It seems to me that this should be used when efficiency is needed and the user can take an appropriate explicit action to enable it. 




> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Mahadev konar updated HADOOP-673:
---------------------------------

    Attachment: cwd.patch

this patch fixes the problem. It implements owens second proposal ie.
1) The current working dir for each task is the task directory
2) for streaming all the contents of job.jar are symlinked into this current working directory for the tasks.

This patch depends upon HADOOP-748. I have included the patch for HADOOP-748 in this patch.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.
+1 for symlinks.

On Nov 2, 2006, at 3:31 PM, Dick King (JIRA) wrote:

>     [  
> http://issues.apache.org/jira/browse/HADOOP-673? 
> page=comments#action_12446772 ]
>
> Dick King commented on HADOOP-673:
> ----------------------------------
>
> If you create links from the task-specific directory to the  
> per-job-directory files and make the task-specific directory you'll  
> avoid breaking existing code.  This assumes people don't write in the  
> per-job-directory but it's tough to do that in a principled manner so  
> I wouldn't expect that to happen.
>
>> the task execution environment should have a current working  
>> directory that is task specific
>> ---------------------------------------------------------------------- 
>> ----------------------
>>
>>                 Key: HADOOP-673
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>>             Project: Hadoop
>>          Issue Type: Bug
>>          Components: mapred
>>    Affects Versions: 0.7.2
>>            Reporter: Owen O'Malley
>>         Assigned To: Owen O'Malley
>>             Fix For: 0.9.0
>>
>>
>> The tasks should be run in a work directory that is specific to a  
>> single task. In particular, I'd suggest using the  
>> <local>/jobcache/<jobid>/<taskid> as the current working directory.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:  
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:  
> http://www.atlassian.com/software/jira
>
>


[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Dick King (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12446772 ] 
            
Dick King commented on HADOOP-673:
----------------------------------

If you create links from the task-specific directory to the per-job-directory files and make the task-specific directory you'll avoid breaking existing code.  This assumes people don't write in the per-job-directory but it's tough to do that in a principled manner so I wouldn't expect that to happen.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455833 ] 
            
Andrzej Bialecki  commented on HADOOP-673:
------------------------------------------

One note re: Cygwin - I discovered that you can't access special filesystems provided by Cygwin, such as /proc, /tmp, mountpoints (/cygdrive/c) etc. exactly because the filesystem layer is implemented in JVM and is unaware of Cygwin, its path translations and its synthetic filesystems ...

So, when I tried to resolve "/proc/self", which is a symbolic link to the current PID under Cygwin, I got nothing because this path is inside Cygwin's "procfs", which invisible to JVM.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455814 ] 
            
Sameer Paranjpye commented on HADOOP-673:
-----------------------------------------

Can the difference between getAbsoluteFile() and getCanonicalFile() be used reliably? What happens under cygwin?

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455852 ] 
            
Doug Cutting commented on HADOOP-673:
-------------------------------------

We depend on cygwin only to access OS features that are not otherwise available to Java.  This way we can access such features uniformly on unix and windows systems, with identical code in most places, letting cygwin figure out how to implement things on windows, rather than maintaining two implementations of each.

> when I tried to resolve "/proc/self" [...] under Cygwin, I got nothing

Right, that path is not a Windows path, and hence only makes sense to cygwin binaries, so it couldn't be accessed directly from Java, which only sees the Windows filesystem.  It could be accessed, e.g., by running 'cat /proc/self' in a sub-process, but that would probably get the sub-process, not the parent.


> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Doug Cutting updated HADOOP-673:
--------------------------------

    Status: Open  (was: Patch Available)

This patch changes fullyDelete() to run 'rm -rf'.  We should only rely on unix command-line tools when we absolutely must, and I don't think we have to here.  Instead, before listing a directory, we should compare getAbsoluteFile() to getCanonicalFile() to determine whether it's a symbolic link.  If it is a symbolic link, then we should ignore it.

Also, perhaps most of the code added to TaskRunner could be instead added to DistributedCache, since it's cache-specific stuff?

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455834 ] 
            
Raghu Angadi commented on HADOOP-673:
-------------------------------------

If we depend on Cygwin functionality, then shouldn't JVM also be Cygwin's JVM package (if one exists)? 

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Mahadev konar updated HADOOP-673:
---------------------------------

    Status: Patch Available  (was: Open)

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Richard Kasperski (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12451795 ] 
            
Richard Kasperski commented on HADOOP-673:
------------------------------------------

Owen, 
 What you have proposed sounds great! It allows external programs written to run on the cluster and legacy apps to run. thanks

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12450942 ] 
            
Owen O'Malley commented on HADOOP-673:
--------------------------------------

Richard, I don't understand why having the jar expanded in ".." is unacceptable to you. It seems to me the requirements are:
  1. there be an easy way to get to the expanded jar
  2. each task run in sandbox
In previous systems we've taken this approach and it has worked very well.

The problem with links is that they are not supported in Java, which is our implementation language, or in Windows, which is a supported platform. Furthermore, since the files are a SHARED resource it is not appropriate for them to be hardlinked into the sandbox. If the application uses "../foo.txt" they are aware that they are leaving the sandbox and that they need to play nicely.



> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.
+1  on Richard's comments
+1  on Dick's (MR should symlink the files that belong to the task into  
the task running directory)
-10 on ../work

On Nov 17, 2006, at 4:25 PM, Richard Kasperski (JIRA) wrote:

>     [  
> http://issues.apache.org/jira/browse/HADOOP-673? 
> page=comments#action_12450936 ]
>
> Richard Kasperski commented on HADOOP-673:
> ------------------------------------------
>
> I think that it is very important that there be a way for an  
> application to run in a sandbox <local>/jobcache/<jobid>/<taskid>   
> that contains the contents of the jar file and is the current working  
> directory. How one accomplishes this is an implementation detail. If  
> the only way to do it is to unjar the archive more than once then I  
> guess that would have to be the solution. This saves the potential for  
> a lot of grief. No shared files are ever modified because they aren't  
> actually shared. This causes more unpacking of jars but I don't really  
> see that as a problem. How the jar's are copied to a node and the   
> subsequent reuse of the jar is important.
>
> That is the sandbox. Then there is the shared sandbox which is also  
> two different instance to memory map a file and pay a single cost.  
> This is best handled by either softlinks or hardlinks under  
> unix/linux.
>
> Why do I think these are important models? Most of the programs that I  
> write and that I use run out of the current directory and expect all  
> of the their configuration files/resources can be read from there. For  
> programs that have more sophisticated models of deployment the models  
> above are still ok. For the simpler programs the proposed external  
> repository doesn't work.
>
> OTOH I can always run a script that will hard link the files from the  
> directory where the jar to my current working directory. I just don't  
> believe that this should be forced on the users. Having the users do  
> system'ish things is potentially dangerous.
>
> Even more restrictive then the above sandbox would be one in which the  
> application is run chroot'd. That way there is no way it could muck  
> with anything system like on the nodes. This is an important  
> consideration when one lets arbitrary programs to be run.
>
>> the task execution environment should have a current working  
>> directory that is task specific
>> ---------------------------------------------------------------------- 
>> ----------------------
>>
>>                 Key: HADOOP-673
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>>             Project: Hadoop
>>          Issue Type: Bug
>>          Components: mapred
>>    Affects Versions: 0.7.2
>>            Reporter: Owen O'Malley
>>         Assigned To: Mahadev konar
>>             Fix For: 0.9.0
>>
>>
>> The tasks should be run in a work directory that is specific to a  
>> single task. In particular, I'd suggest using the  
>> <local>/jobcache/<jobid>/<taskid> as the current working directory.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:  
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:  
> http://www.atlassian.com/software/jira
>
>


[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Richard Kasperski (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12450936 ] 
            
Richard Kasperski commented on HADOOP-673:
------------------------------------------

I think that it is very important that there be a way for an application to run in a sandbox <local>/jobcache/<jobid>/<taskid>  that contains the contents of the jar file and is the current working directory. How one accomplishes this is an implementation detail. If the only way to do it is to unjar the archive more than once then I guess that would have to be the solution. This saves the potential for a lot of grief. No shared files are ever modified because they aren't actually shared. This causes more unpacking of jars but I don't really see that as a problem. How the jar's are copied to a node and the  subsequent reuse of the jar is important. 

That is the sandbox. Then there is the shared sandbox which is also two different instance to memory map a file and pay a single cost. This is best handled by either softlinks or hardlinks under unix/linux. 

Why do I think these are important models? Most of the programs that I write and that I use run out of the current directory and expect all of the their configuration files/resources can be read from there. For programs that have more sophisticated models of deployment the models above are still ok. For the simpler programs the proposed external repository doesn't work.

OTOH I can always run a script that will hard link the files from the directory where the jar to my current working directory. I just don't believe that this should be forced on the users. Having the users do system'ish things is potentially dangerous.

Even more restrictive then the above sandbox would be one in which the application is run chroot'd. That way there is no way it could muck with anything system like on the nodes. This is an important consideration when one lets arbitrary programs to be run. 

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Mahadev konar updated HADOOP-673:
---------------------------------

    Status: Patch Available  (was: Open)

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch, cwd_1.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12450931 ] 
            
Mahadev konar commented on HADOOP-673:
--------------------------------------

I have the following proposal:

The tasks run in the task specific directory that is 
<local>/jobcache/<jobid>/<taskid> to create a virtual isolated environment for the tasks 

and streaming and all other applications access the job.jar files 
using 
../work/

does that seem ok?
I will be implementing this soon so please comment if you have any suggestions or problems with the above proposal.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455829 ] 
            
Doug Cutting commented on HADOOP-673:
-------------------------------------

> Can the difference between getAbsoluteFile() and getCanonicalFile() be used reliably?

I think so.  The docs for getCanonicalFile() explicitly mention symbolic links, while getAbsoluteFile() is merely supposed to resolve relative paths against CWD.

> What happens under cygwin?

Cygwin shouldn't impact this, since these are implemented by the JVM.


> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-673?page=all ]

Mahadev konar updated HADOOP-673:
---------------------------------

    Attachment: cwd_1.patch

This patch uses java for deleteting the symlinks and have moved the creating of symliks to DistributedCache. The java code tries deleting a directory first which succeeds if its empty or a symlink and if it does not succeed it tries deleting the directory recursively. This works both on cygwin and linux.

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch, cwd_1.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HADOOP-673) the task execution environment should have a current working directory that is task specific

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-673?page=comments#action_12455734 ] 
            
Hadoop QA commented on HADOOP-673:
----------------------------------

+1, http://issues.apache.org/jira/secure/attachment/12346398/cwd.patch applied and successfully tested against trunk revision r481432

> the task execution environment should have a current working directory that is task specific
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-673
>                 URL: http://issues.apache.org/jira/browse/HADOOP-673
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.7.2
>            Reporter: Owen O'Malley
>         Assigned To: Mahadev konar
>         Attachments: cwd.patch
>
>
> The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira