You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Razen Al Harbi <ra...@gmail.com> on 2013/06/10 15:36:37 UTC
ALL HDFS Blocks on the Same Machine if Replication factor = 1
Hello,
I have deployed Hadoop on a cluster of 20 machines. I set the replication
factor to one. When I put a file (larger than HDFS block size) into HDFS,
all the blocks are stored on the machine where the Hadoop put command is
invoked.
For higher replication factor, I see the same behavior but the replicated
blocks are stored randomly on all the other machines.
Is this a normal behavior, if not what would be the cause?
Thanks,
Razen
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Shahab Yunus <sh...@gmail.com>.
Yeah Kai si right.
You can read more details for your understanding at:
http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication
and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false
On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:
> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Shahab Yunus <sh...@gmail.com>.
Yeah Kai si right.
You can read more details for your understanding at:
http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication
and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false
On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:
> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Shahab Yunus <sh...@gmail.com>.
Yeah Kai si right.
You can read more details for your understanding at:
http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication
and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false
On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:
> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Shahab Yunus <sh...@gmail.com>.
Yeah Kai si right.
You can read more details for your understanding at:
http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication
and right from the horse's mouth (Pgs 70-75):
http://books.google.com/books?id=drbI_aro20oC&pg=PA51&lpg=PA51&dq=hadoop+replication+factor+1+definitive+guide&source=bl&ots=tZDeyhhZj1&sig=Xq-0WrYhOKnER1SDbnBTmbaEfdk&hl=en&sa=X&ei=Dtu1UdnsCcO_rQG8jICoAw&ved=0CE0Q6AEwBA#v=onepage&q=hadoop%20replication%20factor%201%20definitive%20guide&f=false
On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <k...@123.org> wrote:
> Hello,
>
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
>
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> >
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> >
> > Is this a normal behavior, if not what would be the cause?
>
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
>
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
>
> The third copy of the block gets stored onto a random host in that other
> rack.
>
> So your observations are correct.
>
> Kai
>
> --
> Kai Voigt
> k@123.org
>
>
>
>
>
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Kai Voigt <k...@123.org>.
Hello,
Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.
The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.
The third copy of the block gets stored onto a random host in that other rack.
So your observations are correct.
Kai
--
Kai Voigt
k@123.org
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Daryn Sharp <da...@yahoo-inc.com>.
It's normal. The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node. Using a replication factor of 1 is not advised if you value your data. However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.
Daryn
On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:
> Hello,
>
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
>
> Thanks,
>
> Razen
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Daryn Sharp <da...@yahoo-inc.com>.
It's normal. The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node. Using a replication factor of 1 is not advised if you value your data. However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.
Daryn
On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:
> Hello,
>
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
>
> Thanks,
>
> Razen
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Kai Voigt <k...@123.org>.
Hello,
Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.
The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.
The third copy of the block gets stored onto a random host in that other rack.
So your observations are correct.
Kai
--
Kai Voigt
k@123.org
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Kai Voigt <k...@123.org>.
Hello,
Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.
The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.
The third copy of the block gets stored onto a random host in that other rack.
So your observations are correct.
Kai
--
Kai Voigt
k@123.org
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Kai Voigt <k...@123.org>.
Hello,
Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <ra...@gmail.com>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.
The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.
The third copy of the block gets stored onto a random host in that other rack.
So your observations are correct.
Kai
--
Kai Voigt
k@123.org
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Daryn Sharp <da...@yahoo-inc.com>.
It's normal. The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node. Using a replication factor of 1 is not advised if you value your data. However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.
Daryn
On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:
> Hello,
>
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
>
> Thanks,
>
> Razen
Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Posted by Daryn Sharp <da...@yahoo-inc.com>.
It's normal. The default placement strategy stores the first block on the same node for performance, then choses a second random node on another rack, then a third node on the same rack as the second node. Using a replication factor of 1 is not advised if you value your data. However, if you want a better distribution of blocks with 1 replica then consider using a non-DN host to upload your files.
Daryn
On Jun 10, 2013, at 8:36 AM, Razen Al Harbi wrote:
> Hello,
>
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
>
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
>
> Is this a normal behavior, if not what would be the cause?
>
> Thanks,
>
> Razen