You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Zhenhua (Gerald) Guo" <je...@gmail.com> on 2011/10/07 19:09:52 UTC

Is it possible to run multiple MapReduce against the same HDFS?

I plan to deploy a HDFS cluster which will be shared by multiple
MapReduce clusters.
I wonder whether this is possible.  Will it incur any conflicts among
MapReduce (e.g. different MapReduce clusters try to use the same temp
directory in HDFS)?
If it is possible, how should the security parameters be set up (e.g.
user identity, file permission)?

Thanks,

Gerald

Re: Is it possible to run multiple MapReduce against the same HDFS?

Posted by Robert Evans <ev...@yahoo-inc.com>.

I am not positive how all of that works and I may get some of this wrong, but I believe that the map reduce user has special privileges in relation to HDFS that allows it to become another user and read the data on that users behalf.  I think that these privileges are granted by the user when it connects to the JT. I am not an expert on how the security in Hadoop works and I am likely to have gotten some of this wrong, so if there is someone on the list that wants to correct me or confirm what I have said that would be great.

--
Bobby Evans

On 10/10/11 9:56 PM, "Zhenhua (Gerald) Guo" <je...@gmail.com> wrote:

Thanks, Robert.  I will look into hod.

When MapReduce framework accesses data stored in HDFS, which account
is used, the account which MapReduce daemons (e.g. job tracker) run as
or the account of the user who submits the job?  If HDFS and MapReduce
clusters are run with different accounts, can MapReduce cluster be
able to access HDFS directories and files (if authentication in HDFS
is enabled)?

Thanks!

Gerald

On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> It should be possible to use multiple map/reduce clusters sharing the same HDFS, you can look at hod where it launches a JT on demand.  The only change of collision that I can think of would be if by some odd chance both Job Trackers were started at exactly the same millisecond.   The JT uses the time it was started as part of the job id for all jobs.  Those job ids are assumed to be unique and used to create files/directories in HDFS to store data for that job.
>
> --Bobby Evans
>
> On 10/7/11 12:09 PM, "Zhenhua (Gerald) Guo" <je...@gmail.com> wrote:
>
> I plan to deploy a HDFS cluster which will be shared by multiple
> MapReduce clusters.
> I wonder whether this is possible.  Will it incur any conflicts among
> MapReduce (e.g. different MapReduce clusters try to use the same temp
> directory in HDFS)?
> If it is possible, how should the security parameters be set up (e.g.
> user identity, file permission)?
>
> Thanks,
>
> Gerald
>
>

Re: Is it possible to run multiple MapReduce against the same HDFS?

Posted by "Zhenhua (Gerald) Guo" <je...@gmail.com>.

Thanks, Robert.  I will look into hod.

When MapReduce framework accesses data stored in HDFS, which account
is used, the account which MapReduce daemons (e.g. job tracker) run as
or the account of the user who submits the job?  If HDFS and MapReduce
clusters are run with different accounts, can MapReduce cluster be
able to access HDFS directories and files (if authentication in HDFS
is enabled)?

Thanks!

Gerald

On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> It should be possible to use multiple map/reduce clusters sharing the same HDFS, you can look at hod where it launches a JT on demand.  The only change of collision that I can think of would be if by some odd chance both Job Trackers were started at exactly the same millisecond.   The JT uses the time it was started as part of the job id for all jobs.  Those job ids are assumed to be unique and used to create files/directories in HDFS to store data for that job.
>
> --Bobby Evans
>
> On 10/7/11 12:09 PM, "Zhenhua (Gerald) Guo" <je...@gmail.com> wrote:
>
> I plan to deploy a HDFS cluster which will be shared by multiple
> MapReduce clusters.
> I wonder whether this is possible.  Will it incur any conflicts among
> MapReduce (e.g. different MapReduce clusters try to use the same temp
> directory in HDFS)?
> If it is possible, how should the security parameters be set up (e.g.
> user identity, file permission)?
>
> Thanks,
>
> Gerald
>
>

Re: Is it possible to run multiple MapReduce against the same HDFS?

Posted by Robert Evans <ev...@yahoo-inc.com>.

It should be possible to use multiple map/reduce clusters sharing the same HDFS, you can look at hod where it launches a JT on demand.  The only change of collision that I can think of would be if by some odd chance both Job Trackers were started at exactly the same millisecond.   The JT uses the time it was started as part of the job id for all jobs.  Those job ids are assumed to be unique and used to create files/directories in HDFS to store data for that job.

--Bobby Evans

On 10/7/11 12:09 PM, "Zhenhua (Gerald) Guo" <je...@gmail.com> wrote:

I plan to deploy a HDFS cluster which will be shared by multiple
MapReduce clusters.
I wonder whether this is possible.  Will it incur any conflicts among
MapReduce (e.g. different MapReduce clusters try to use the same temp
directory in HDFS)?
If it is possible, how should the security parameters be set up (e.g.
user identity, file permission)?

Thanks,

Gerald