You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Anfernee Xu <an...@gmail.com> on 2014/01/28 19:04:11 UTC

HDFS Federation address performance issue

Hi,

Based on
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
the overall performance can be improved by federation, but I'm not sure
federation address my usercase, could someone elaborate it?

My usercase is I have one single NM and several DN, and I have bunch of
concurrent MR jobs which will create new files(plan files and
sub-directory) under the same parent directory, the questions are:

1) Will these concurrent writes(new file, plan files and sub-directory
under the same parent directory) run in sequential because WRITE-once
control govened by single NM?

I need this answer to estimate the necessity of moving to HDFS federation.

Thanks

-- 
--Anfernee

Re: HDFS Federation address performance issue

Posted by Anfernee Xu <an...@gmail.com>.
Thanks Daryn,  I just want to confirm I can get performance improvement if
I go with federation before I start the effort(I have to re-design my data
schema so that they can have different namespace).


On Tue, Jan 28, 2014 at 10:53 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

>  Hi Anfernee,
>
>  You will achieve improved performance with federation only if you stripe
> files across the multiple NNs.  Federation basically shares DN storage with
> multiple NNs with the expectation the namespace load will be distributed
> across the multiple NNs.  If everything writes to the exact same parent
> directory then no benefit is achieved over a single NN.  You will need to
> partition your jobs so some write to one NN, other jobs write to the other
> NN(s).
>
>  I hope this helps!
>
>  Daryn
>
>  On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>
>  wrote:
>
>    Hi,
>
>  Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
>  My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>
>  I need this answer to estimate the necessity of moving to HDFS federation.
>
>  Thanks
>
> --
> --Anfernee
>
>
>


-- 
--Anfernee

Re: HDFS Federation address performance issue

Posted by Anfernee Xu <an...@gmail.com>.
Thanks Daryn,  I just want to confirm I can get performance improvement if
I go with federation before I start the effort(I have to re-design my data
schema so that they can have different namespace).


On Tue, Jan 28, 2014 at 10:53 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

>  Hi Anfernee,
>
>  You will achieve improved performance with federation only if you stripe
> files across the multiple NNs.  Federation basically shares DN storage with
> multiple NNs with the expectation the namespace load will be distributed
> across the multiple NNs.  If everything writes to the exact same parent
> directory then no benefit is achieved over a single NN.  You will need to
> partition your jobs so some write to one NN, other jobs write to the other
> NN(s).
>
>  I hope this helps!
>
>  Daryn
>
>  On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>
>  wrote:
>
>    Hi,
>
>  Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
>  My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>
>  I need this answer to estimate the necessity of moving to HDFS federation.
>
>  Thanks
>
> --
> --Anfernee
>
>
>


-- 
--Anfernee

Re: HDFS Federation address performance issue

Posted by Anfernee Xu <an...@gmail.com>.
Thanks Daryn,  I just want to confirm I can get performance improvement if
I go with federation before I start the effort(I have to re-design my data
schema so that they can have different namespace).


On Tue, Jan 28, 2014 at 10:53 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

>  Hi Anfernee,
>
>  You will achieve improved performance with federation only if you stripe
> files across the multiple NNs.  Federation basically shares DN storage with
> multiple NNs with the expectation the namespace load will be distributed
> across the multiple NNs.  If everything writes to the exact same parent
> directory then no benefit is achieved over a single NN.  You will need to
> partition your jobs so some write to one NN, other jobs write to the other
> NN(s).
>
>  I hope this helps!
>
>  Daryn
>
>  On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>
>  wrote:
>
>    Hi,
>
>  Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
>  My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>
>  I need this answer to estimate the necessity of moving to HDFS federation.
>
>  Thanks
>
> --
> --Anfernee
>
>
>


-- 
--Anfernee

Re: HDFS Federation address performance issue

Posted by Anfernee Xu <an...@gmail.com>.
Thanks Daryn,  I just want to confirm I can get performance improvement if
I go with federation before I start the effort(I have to re-design my data
schema so that they can have different namespace).


On Tue, Jan 28, 2014 at 10:53 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

>  Hi Anfernee,
>
>  You will achieve improved performance with federation only if you stripe
> files across the multiple NNs.  Federation basically shares DN storage with
> multiple NNs with the expectation the namespace load will be distributed
> across the multiple NNs.  If everything writes to the exact same parent
> directory then no benefit is achieved over a single NN.  You will need to
> partition your jobs so some write to one NN, other jobs write to the other
> NN(s).
>
>  I hope this helps!
>
>  Daryn
>
>  On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>
>  wrote:
>
>    Hi,
>
>  Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
>  My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>
>  I need this answer to estimate the necessity of moving to HDFS federation.
>
>  Thanks
>
> --
> --Anfernee
>
>
>


-- 
--Anfernee

Re: HDFS Federation address performance issue

Posted by Daryn Sharp <da...@yahoo-inc.com>.
Hi Anfernee,

You will achieve improved performance with federation only if you stripe files across the multiple NNs.  Federation basically shares DN storage with multiple NNs with the expectation the namespace load will be distributed across the multiple NNs.  If everything writes to the exact same parent directory then no benefit is achieved over a single NN.  You will need to partition your jobs so some write to one NN, other jobs write to the other NN(s).

I hope this helps!

Daryn

On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>>
 wrote:

Hi,

Based on http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits, the overall performance can be improved by federation, but I'm not sure federation address my usercase, could someone elaborate it?

My usercase is I have one single NM and several DN, and I have bunch of concurrent MR jobs which will create new files(plan files and sub-directory) under the same parent directory, the questions are:

1) Will these concurrent writes(new file, plan files and sub-directory under the same parent directory) run in sequential because WRITE-once control govened by single NM?

I need this answer to estimate the necessity of moving to HDFS federation.

Thanks

--
--Anfernee


Re: HDFS Federation address performance issue

Posted by Suresh Srinivas <su...@hortonworks.com>.
Response inline...


On Tue, Jan 28, 2014 at 10:04 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
> My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>

Namenode commits multiple requests in a batch. In Namenode it self, the
lock for write operations make them sequential. But this is a short
duration lock and hence will make from the multiple clients perspective,
the creation of files as simultaneous.

If you are talking about a single client, with a single thread, then it
would be sequential.

Hope that makes sense.

>
> I need this answer to estimate the necessity of moving to HDFS federation.
>
> Thanks
>
> --
> --Anfernee
>



-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS Federation address performance issue

Posted by Suresh Srinivas <su...@hortonworks.com>.
Response inline...


On Tue, Jan 28, 2014 at 10:04 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
> My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>

Namenode commits multiple requests in a batch. In Namenode it self, the
lock for write operations make them sequential. But this is a short
duration lock and hence will make from the multiple clients perspective,
the creation of files as simultaneous.

If you are talking about a single client, with a single thread, then it
would be sequential.

Hope that makes sense.

>
> I need this answer to estimate the necessity of moving to HDFS federation.
>
> Thanks
>
> --
> --Anfernee
>



-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS Federation address performance issue

Posted by Daryn Sharp <da...@yahoo-inc.com>.
Hi Anfernee,

You will achieve improved performance with federation only if you stripe files across the multiple NNs.  Federation basically shares DN storage with multiple NNs with the expectation the namespace load will be distributed across the multiple NNs.  If everything writes to the exact same parent directory then no benefit is achieved over a single NN.  You will need to partition your jobs so some write to one NN, other jobs write to the other NN(s).

I hope this helps!

Daryn

On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>>
 wrote:

Hi,

Based on http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits, the overall performance can be improved by federation, but I'm not sure federation address my usercase, could someone elaborate it?

My usercase is I have one single NM and several DN, and I have bunch of concurrent MR jobs which will create new files(plan files and sub-directory) under the same parent directory, the questions are:

1) Will these concurrent writes(new file, plan files and sub-directory under the same parent directory) run in sequential because WRITE-once control govened by single NM?

I need this answer to estimate the necessity of moving to HDFS federation.

Thanks

--
--Anfernee


Re: HDFS Federation address performance issue

Posted by Daryn Sharp <da...@yahoo-inc.com>.
Hi Anfernee,

You will achieve improved performance with federation only if you stripe files across the multiple NNs.  Federation basically shares DN storage with multiple NNs with the expectation the namespace load will be distributed across the multiple NNs.  If everything writes to the exact same parent directory then no benefit is achieved over a single NN.  You will need to partition your jobs so some write to one NN, other jobs write to the other NN(s).

I hope this helps!

Daryn

On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>>
 wrote:

Hi,

Based on http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits, the overall performance can be improved by federation, but I'm not sure federation address my usercase, could someone elaborate it?

My usercase is I have one single NM and several DN, and I have bunch of concurrent MR jobs which will create new files(plan files and sub-directory) under the same parent directory, the questions are:

1) Will these concurrent writes(new file, plan files and sub-directory under the same parent directory) run in sequential because WRITE-once control govened by single NM?

I need this answer to estimate the necessity of moving to HDFS federation.

Thanks

--
--Anfernee


Re: HDFS Federation address performance issue

Posted by Suresh Srinivas <su...@hortonworks.com>.
Response inline...


On Tue, Jan 28, 2014 at 10:04 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
> My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>

Namenode commits multiple requests in a batch. In Namenode it self, the
lock for write operations make them sequential. But this is a short
duration lock and hence will make from the multiple clients perspective,
the creation of files as simultaneous.

If you are talking about a single client, with a single thread, then it
would be sequential.

Hope that makes sense.

>
> I need this answer to estimate the necessity of moving to HDFS federation.
>
> Thanks
>
> --
> --Anfernee
>



-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS Federation address performance issue

Posted by Daryn Sharp <da...@yahoo-inc.com>.
Hi Anfernee,

You will achieve improved performance with federation only if you stripe files across the multiple NNs.  Federation basically shares DN storage with multiple NNs with the expectation the namespace load will be distributed across the multiple NNs.  If everything writes to the exact same parent directory then no benefit is achieved over a single NN.  You will need to partition your jobs so some write to one NN, other jobs write to the other NN(s).

I hope this helps!

Daryn

On Jan 28, 2014, at 12:04 PM, Anfernee Xu <an...@gmail.com>>
 wrote:

Hi,

Based on http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits, the overall performance can be improved by federation, but I'm not sure federation address my usercase, could someone elaborate it?

My usercase is I have one single NM and several DN, and I have bunch of concurrent MR jobs which will create new files(plan files and sub-directory) under the same parent directory, the questions are:

1) Will these concurrent writes(new file, plan files and sub-directory under the same parent directory) run in sequential because WRITE-once control govened by single NM?

I need this answer to estimate the necessity of moving to HDFS federation.

Thanks

--
--Anfernee


Re: HDFS Federation address performance issue

Posted by Suresh Srinivas <su...@hortonworks.com>.
Response inline...


On Tue, Jan 28, 2014 at 10:04 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> Based on
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html#Key_Benefits,
> the overall performance can be improved by federation, but I'm not sure
> federation address my usercase, could someone elaborate it?
>
> My usercase is I have one single NM and several DN, and I have bunch of
> concurrent MR jobs which will create new files(plan files and
> sub-directory) under the same parent directory, the questions are:
>
> 1) Will these concurrent writes(new file, plan files and sub-directory
> under the same parent directory) run in sequential because WRITE-once
> control govened by single NM?
>

Namenode commits multiple requests in a batch. In Namenode it self, the
lock for write operations make them sequential. But this is a short
duration lock and hence will make from the multiple clients perspective,
the creation of files as simultaneous.

If you are talking about a single client, with a single thread, then it
would be sequential.

Hope that makes sense.

>
> I need this answer to estimate the necessity of moving to HDFS federation.
>
> Thanks
>
> --
> --Anfernee
>



-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.