You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2009/04/15 14:37:15 UTC

[jira] Issue Comment Edited: (HADOOP-3578) mapred.system.dir should be accessible only to hadoop daemons

    [ https://issues.apache.org/jira/browse/HADOOP-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699170#action_12699170 ] 

Amar Kamat edited comment on HADOOP-3578 at 4/15/09 5:37 AM:
-------------------------------------------------------------

Some more details
# The jobclient requests the jobtracker for a new job id
# Along with the libs/archives, the jobclient also uploads the job.jar to the DistributedCache and creates a symlink to it (here the TaskRunner will localize the jars). With HADOOP-4490 (and security in distributed cache), the taskrunner will run under the user permission and hence will be able to securely localize the job jar
# The jobclient now starts the transaction with the jobtracker by passing the jobconf to the jobtracker. We expect the jobconf is be lightweight and hence pass it completely over the rpc.
  ## If the job (jobconf) fails the checks (acls etc) at the jobtracker, this job is ignored
  ## The jobtracker now maintains the jobid to user mapping for this job. This is done to make sure that only the user who owns the job can upload/add the splits
  ## finally the jt localizes the job to system-dir/jobid/job.xml so that the tasks are able to load the conf.
# The jobclient now uploads the job splits (in chunks of 1000 splits) to the jobtracker
  ## The jobtracker will check if the user is the owner of the job
  ## The jobtracker will maintain a mapping from jobid to the (split) file handle for that job
  ## This split file is opened as system-dir/jobid/job.split
  ## The jobtracker will stream all the splits passed by the client to this file
# The jobclient now finishes the transaction by invoking submitJob().
  ## The jobtracker will first close the open file handle for the jobsplit 
  ## jt will cleanup the structures maintained for the transaction
  ## do what is done today upon a submit job (note that by now job.split and job.jar are both present in the system dir)

Questions :
# What if the jobconf is of large size? Do we need to page it too?
# How many files(job-split) to support in parallel (as number of open file handles can lead  to issues)?
   ## One way to do it would be to cap it 200 uploads in parallel
# How to take care of dead jobclients?
  ## Start a expiry thread that will cleanup dead/hung job submissions (every 5 mins)
# How to prevent the jobclients from passing more splits (say 1,00,000 splits) in one rpc call?
  ## Looks like this should be capped at the rpc level. I am not sure if there is any provision for something like this. For now we can leave it as it as.

----
Thoughts?

      was (Author: amar_kamat):
    Some more details
# The jobclient requests the jobtracker for a new job id
# Along with the libs/archives, the jobclient also uploads the job.jar to the DistributedCache and creates a symlink to it (here the TaskRunner will localize the jars). With HADOOP-4990 (and security in distributed cache), the taskrunner will run under the user permission and hence will be able to securely localize the job jar
# The jobclient now starts the transaction with the jobtracker by passing the jobconf to the jobtracker. We expect the jobconf is be lightweight and hence pass it completely over the rpc.
  ## If the job (jobconf) fails the checks (acls etc) at the jobtracker, this job is ignored
  ## The jobtracker now maintains the jobid to user mapping for this job. This is done to make sure that only the user who owns the job can upload/add the splits
  ## finally the jt localizes the job to system-dir/jobid/job.xml so that the tasks are able to load the conf.
# The jobclient now uploads the job splits (in chunks of 1000 splits) to the jobtracker
  ## The jobtracker will check if the user is the owner of the job
  ## The jobtracker will maintain a mapping from jobid to the (split) file handle for that job
  ## This split file is opened as system-dir/jobid/job.split
  ## The jobtracker will stream all the splits passed by the client to this file
# The jobclient now finishes the transaction by invoking submitJob().
  ## The jobtracker will first close the open file handle for the jobsplit 
  ## jt will cleanup the structures maintained for the transaction
  ## do what is done today upon a submit job (note that by now job.split and job.jar are both present in the system dir)

Questions :
# What if the jobconf is of large size? Do we need to page it too?
# How many files(job-split) to support in parallel (as number of open file handles can lead  to issues)?
   ## One way to do it would be to cap it 200 uploads in parallel
# How to take care of dead jobclients?
  ## Start a expiry thread that will cleanup dead/hung job submissions (every 5 mins)
# How to prevent the jobclients from passing more splits (say 1,00,000 splits) in one rpc call?
  ## Looks like this should be capped at the rpc level. I am not sure if there is any provision for something like this. For now we can leave it as it as.

----
Thoughts?
  
> mapred.system.dir should be accessible only to hadoop daemons 
> --------------------------------------------------------------
>
>                 Key: HADOOP-3578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3578
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>
> Currently the jobclient accesses the {{mapred.system.dir}} to add job details. Hence the {{mapred.system.dir}} has the permissions of {{rwx-wx-wx}}. This could be a security loophole where the job files might get overwritten/tampered after the job submission. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.