You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2007/12/10 20:58:49 UTC

HDFS tool and replication questions...

Hi All:
   
  Is there a tool available that will provide information about how a file is replicated within HDFS?  I'm looking for something that will "prove" that a file is replicated across multiple nodes, and let me see how many nodes participated, etc.  This is a point of interest technically, but more importantly a point of due diligence around data security and integrity accountability. 
   
  Also, are there any metrics or best practices around what the replication factor should be based on the number of nodes in the grid?  Does HDFS attempt to involve all nodes in the grid in replication?  In other words, if I have 100 nodes in my grid, and a replication factor of 6, will all 100 nodes wind up storing data for a given file assuming the file large enough?
   
  Thanks,
  C G

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

Re: HDFS tool and replication questions...

Posted by Ted Dunning <td...@veoh.com>.
More to the specific point, yes, all 100 nodes will wind up storing data for
large files because blocks should be assigned pretty much at random.

The exception is files that originate on a datanode.  There, the local node
gets one copy of each block.  Replica blocks follow the random rule,
however, so that you wind up in the same place in the end.


On 12/10/07 1:10 PM, "dhruba Borthakur" <dh...@yahoo-inc.com> wrote:

> The replication factor should be such that it can provide some level of
> availability and performance. HDFS attempts to distribute replicas of a
> block so that they reside across multiple racks. HDFS block replication
> is *purely* block-based and file-agnostic; i.e. blocks belonging to the
> same file are handled precisely the same way as blocks belonging to
> different files.
> 
> Hope this helps,
> dhruba
>    
>   Also, are there any metrics or best practices around what the
> replication factor should be based on the number of nodes in the grid?
> Does HDFS attempt to involve all nodes in the grid in replication?  In
> other words, if I have 100 nodes in my grid, and a replication factor of
> 6, will all 100 nodes wind up storing data for a given file assuming the
> file large enough?
>    
>   Thanks,
>   C G
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo!
> Search.


RE: HDFS tool and replication questions...

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
The replication factor should be such that it can provide some level of
availability and performance. HDFS attempts to distribute replicas of a
block so that they reside across multiple racks. HDFS block replication
is *purely* block-based and file-agnostic; i.e. blocks belonging to the
same file are handled precisely the same way as blocks belonging to
different files.

Hope this helps,
dhruba
   
  Also, are there any metrics or best practices around what the
replication factor should be based on the number of nodes in the grid?
Does HDFS attempt to involve all nodes in the grid in replication?  In
other words, if I have 100 nodes in my grid, and a replication factor of
6, will all 100 nodes wind up storing data for a given file assuming the
file large enough?
   
  Thanks,
  C G

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo!
Search.

Re: HDFS tool and replication questions...

Posted by Ted Dunning <td...@veoh.com>.
The web interface to the namenode will let your drill down to the file
itself.  That will tell you where the blocks are (scroll down to the
bottom).  You can also use hadoop fsck <filename>

For example:

tdunning@metricsapp5:~/hadoop-0.15.1$ bin/hadoop fsck
/user/rmobin/data/11/30........................Status: HEALTHY
 Total size:    13728838080 B
 Total blocks:  216 (avg. block size 63559435 B)
 Total dirs:    0
 Total files:   24
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     2
 Real replication factor:       2.0


The filesystem under path '/user/rmobin/data/11/30' is HEALTHY




On 12/10/07 11:58 AM, "C G" <pa...@yahoo.com> wrote:

> Hi All:
>    
>   Is there a tool available that will provide information about how a file is
> replicated within HDFS?  I'm looking for something that will "prove" that a
> file is replicated across multiple nodes, and let me see how many nodes
> participated, etc.  This is a point of interest technically, but more
> importantly a point of due diligence around data security and integrity
> accountability. 
>    
>   Also, are there any metrics or best practices around what the replication
> factor should be based on the number of nodes in the grid?  Does HDFS attempt
> to involve all nodes in the grid in replication?  In other words, if I have
> 100 nodes in my grid, and a replication factor of 6, will all 100 nodes wind
> up storing data for a given file assuming the file large enough?
>    
>   Thanks,
>   C G
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.