You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Philippe Kernévez <pk...@octo.com> on 2018/01/03 13:08:27 UTC

Re: queries regarding hadoop DFS

Hi Sachin,

On Mon, Dec 18, 2017 at 9:09 AM, Sachin Tiwari <im...@gmail.com> wrote:

> Hi
>
> I am trying to use hadoop as distributed file storage system.
>
> I did a POC with a small cluster with 1 name-node and 4 datanodes and I
> was able to get/put files using hdfs client and monitor the datanodes
> status on: http://master-machine:50070/dfshealth.html
>
> However, I have few open questions that I would like to discuss with you
> guys before, before taking the solution to next level.
>
> *Question are as follows:*
>
> 1) Is hdfs good in handling binary data? Like executable, zip, VDI, etc?
>
​Yes for the storage. For the processing with Yarn it will depend ​on your
processing and format.



>
> 2) How many datanodes a namenode can handle? Assuming it's running on 24
> core, 90GB RAM and handling files b/w 200MB to 1GB in size? (assuming
> deafult block size 128)
>
​A name node is more limited by the file number not the datanode number. A
namenode is able to manage several hundred of datanode.
An exemple of sizing :
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_command-line-installation/content/configuring-namenode-heap-size.html
​



>
> 3) Is there are way to tune the cluster setup i.e. determine best value
> for block size, replication factor, heap etc?
>
​To tune the configuration you can change the value at cluster level or by
folder/file.
List of values :
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
At least :
dfs.replication, dfs.replication.max, dfs.namenode.replication.min
and dfs.blocksize.
I don't understand what do you mean with 'heap size' for HDFS.
​



>
> 4) I was also curious how much time does a namenode service takes to
> acknowledge that a datanode has gone down?
>
​Depend on you configuration.
There is a heartbeat​ (dfs.heartbeat.interval) of 3s by default.
But there are several parameters that define several status for a namenode
(ok, stale and dead).
seaa dfs.namenode.stale.datanode.interval
and dfs.namenode.heartbeat.recheck-interval
By default a namenode will not reveive write after 30s and block will be
replicated after 10min30s.



>
> 5) What happens next? That is, does namenode starts replicating block of
> that down datanode to other available datanodes to meet the replication
> factor?
>
​Yes​


>
> 6) what happens when datanode comes back up? won't there more blocks
> (replication) in system than expected as namenode has replicated them while
> it was down?
>
​Blocks won't be used. HDFS rebalancing will be required if your want to
have data on all nodes.​


> ​
>
> 7) Also, after coming up does the datanode performs cleanup for the files
> (blocks) that were pruned while the datanode was down? That is, reclaim the
> diskdpace by deleting blocks that are deleted while it was down?
>
​See before​

>
> 8) During copying/replication does datanode with more available space get
> priority over datanode with comparitively less space?
>
​No​



>
> 9) What are your recommendations for a cluster of around 2500 machines
> with 24 core and 90GB RAM and 500MB to 1TB disk space to spare for HDFS?
> Are there any good tools to manage such huge cluster to track its health
> and other status?
>
​A cluster with 2500 are very-very rare, need a very huge expertise (very
fare from your actual questions) and years of expertise.
My first recommandation is ​to start with small cluster (10-20), learn with
it, automate every steps on provisioning and the most important : hire
experts.




>
> 10) For a non-networking guy like me and not owner of the network topology
> of the machineswhat is the best recommendation from your side to make
> cluster rack-aware? I mean what should I do get benefited by rack-awareness
> in the cluster?
>
​With you ambition, hire or employ an expert for the cluster setup.
The configuration is ​quite simple, the network configuration on nodes are
more complexe (to have enough bandwidth with bonding)  :
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/RackAwareness.html

Regards,
Philippe



>
>
> Thanks,
> Sachin
>
>
>
>
>
>
>
>
>
>
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch