You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Razen Al Harbi <ra...@gmail.com> on 2013/06/10 15:36:37 UTC

ALL HDFS Blocks on the Same Machine if Replication factor = 1

Hello,

I have deployed Hadoop on a cluster of 20 machines. I set the replication
factor to one. When I put a file (larger than HDFS block size) into HDFS,
all the blocks are stored on the machine where the Hadoop put command is
invoked.

For higher replication factor, I see the same behavior but the replicated
blocks are stored randomly on all the other machines.

Is this a normal behavior, if not what would be the cause?

Thanks,

Razen

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Shahab Yunus <sh...@gmail.com>.

Yeah Kai si right.

You can read more details for your understanding at:

http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication

and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false


On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:

> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Shahab Yunus <sh...@gmail.com>.

Yeah Kai si right.

You can read more details for your understanding at:

http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication

and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false


On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:

> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Shahab Yunus <sh...@gmail.com>.

Yeah Kai si right.

You can read more details for your understanding at:

http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication

and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false


On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:

> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Shahab Yunus <sh...@gmail.com>.

Yeah Kai si right.

You can read more details for your understanding at:

http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication

and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false


On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:

> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:

> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.

Kai

-- 
Kai Voigt
k@123.org

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Daryn Sharp <da...@yahoo-inc.com>.

It's normal.  The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node.  Using a replication factor of 1 is not advised if you value your data.  However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.

Daryn

On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:

> Hello,
> 
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?
> 
> Thanks, 
> 
> Razen

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Daryn Sharp <da...@yahoo-inc.com>.

It's normal.  The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node.  Using a replication factor of 1 is not advised if you value your data.  However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.

Daryn

On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:

> Hello,
> 
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?
> 
> Thanks, 
> 
> Razen

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:

> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.

Kai

-- 
Kai Voigt
k@123.org

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:

> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.

Kai

-- 
Kai Voigt
k@123.org

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:

> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.

Kai

-- 
Kai Voigt
k@123.org

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Daryn Sharp <da...@yahoo-inc.com>.

It's normal.  The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node.  Using a replication factor of 1 is not advised if you value your data.  However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.

Daryn

On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:

> Hello,
> 
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?
> 
> Thanks, 
> 
> Razen

Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1

Posted by Daryn Sharp <da...@yahoo-inc.com>.

It's normal.  The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node.  Using a replication factor of 1 is not advised if you value your data.  However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.

Daryn

On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:

> Hello,
> 
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked. 
> 
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> 
> Is this a normal behavior, if not what would be the cause?
> 
> Thanks, 
> 
> Razen