You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Sachin Tiwari <im...@gmail.com> on 2017/12/18 08:09:25 UTC

Fwd: queries regarding hadoop DFS

Hi

I am trying to use hadoop as distributed file storage system.

I did a POC with a small cluster with 1 name-node and 4 datanodes and I was
able to get/put files using hdfs client and monitor the datanodes status
on: http://master-machine:50070/dfshealth.html

However, I have few open questions that I would like to discuss with you
guys before, before taking the solution to next level.

*Question are as follows:*

1) Is hdfs good in handling binary data? Like executable, zip, VDI, etc?

2) How many datanodes a namenode can handle? Assuming it's running on 24
core, 90GB RAM and handling files b/w 200MB to 1GB in size? (assuming
deafult block size 128)

3) Is there are way to tune the cluster setup i.e. determine best value for
block size, replication factor, heap etc?

4) I was also curious how much time does a namenode service takes to
acknowledge that a datanode has gone down?

5) What happens next? That is, does namenode starts replicating block of
that down datanode to other available datanodes to meet the replication
factor?

6) what happens when datanode comes back up? won't there more blocks
(replication) in system than expected as namenode has replicated them while
it was down?

7) Also, after coming up does the datanode performs cleanup for the files
(blocks) that were pruned while the datanode was down? That is, reclaim the
diskdpace by deleting blocks that are deleted while it was down?

8) During copying/replication does datanode with more available space get
priority over datanode with comparitively less space?

9) What are your recommendations for a cluster of around 2500 machines with
24 core and 90GB RAM and 500MB to 1TB disk space to spare for HDFS? Are
there any good tools to manage such huge cluster to track its health and
other status?

10) For a non-networking guy like me and not owner of the network topology
of the machineswhat is the best recommendation from your side to make
cluster rack-aware? I mean what should I do get benefited by rack-awareness
in the cluster?


Thanks,
Sachin

Re: queries regarding hadoop DFS

Posted by Philippe Kernévez <pk...@octo.com>.
Hi Sachin,

On Mon, Dec 18, 2017 at 9:09 AM, Sachin Tiwari <im...@gmail.com> wrote:

> Hi
>
> I am trying to use hadoop as distributed file storage system.
>
> I did a POC with a small cluster with 1 name-node and 4 datanodes and I
> was able to get/put files using hdfs client and monitor the datanodes
> status on: http://master-machine:50070/dfshealth.html
>
> However, I have few open questions that I would like to discuss with you
> guys before, before taking the solution to next level.
>
> *Question are as follows:*
>
> 1) Is hdfs good in handling binary data? Like executable, zip, VDI, etc?
>
​Yes for the storage. For the processing with Yarn it will depend ​on your
processing and format.



>
> 2) How many datanodes a namenode can handle? Assuming it's running on 24
> core, 90GB RAM and handling files b/w 200MB to 1GB in size? (assuming
> deafult block size 128)
>
​A name node is more limited by the file number not the datanode number. A
namenode is able to manage several hundred of datanode.
An exemple of sizing :
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_command-line-installation/content/configuring-namenode-heap-size.html
​



>
> 3) Is there are way to tune the cluster setup i.e. determine best value
> for block size, replication factor, heap etc?
>
​To tune the configuration you can change the value at cluster level or by
folder/file.
List of values :
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
At least :
dfs.replication, dfs.replication.max, dfs.namenode.replication.min
and dfs.blocksize.
I don't understand what do you mean with 'heap size' for HDFS.
​



>
> 4) I was also curious how much time does a namenode service takes to
> acknowledge that a datanode has gone down?
>
​Depend on you configuration.
There is a heartbeat​ (dfs.heartbeat.interval) of 3s by default.
But there are several parameters that define several status for a namenode
(ok, stale and dead).
seaa dfs.namenode.stale.datanode.interval
and dfs.namenode.heartbeat.recheck-interval
By default a namenode will not reveive write after 30s and block will be
replicated after 10min30s.



>
> 5) What happens next? That is, does namenode starts replicating block of
> that down datanode to other available datanodes to meet the replication
> factor?
>
​Yes​


>
> 6) what happens when datanode comes back up? won't there more blocks
> (replication) in system than expected as namenode has replicated them while
> it was down?
>
​Blocks won't be used. HDFS rebalancing will be required if your want to
have data on all nodes.​


> ​
>
> 7) Also, after coming up does the datanode performs cleanup for the files
> (blocks) that were pruned while the datanode was down? That is, reclaim the
> diskdpace by deleting blocks that are deleted while it was down?
>
​See before​

>
> 8) During copying/replication does datanode with more available space get
> priority over datanode with comparitively less space?
>
​No​



>
> 9) What are your recommendations for a cluster of around 2500 machines
> with 24 core and 90GB RAM and 500MB to 1TB disk space to spare for HDFS?
> Are there any good tools to manage such huge cluster to track its health
> and other status?
>
​A cluster with 2500 are very-very rare, need a very huge expertise (very
fare from your actual questions) and years of expertise.
My first recommandation is ​to start with small cluster (10-20), learn with
it, automate every steps on provisioning and the most important : hire
experts.




>
> 10) For a non-networking guy like me and not owner of the network topology
> of the machineswhat is the best recommendation from your side to make
> cluster rack-aware? I mean what should I do get benefited by rack-awareness
> in the cluster?
>
​With you ambition, hire or employ an expert for the cluster setup.
The configuration is ​quite simple, the network configuration on nodes are
more complexe (to have enough bandwidth with bonding)  :
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/RackAwareness.html

Regards,
Philippe



>
>
> Thanks,
> Sachin
>
>
>
>
>
>
>
>
>
>
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch