You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dibyendu Karmakar <di...@gmail.com> on 2013/04/11 12:19:12 UTC

UNDERSTANDING HADOOP PERFORMANCE

Hi everyone,
I am testing hadoop performance. I have come accross the following parameters:
1. dfs.replication
2. dfs.block.size
3. dfs.heartbeat.interval   (dafault: 3)
4. dfs.blockreport.intervalMsec   (default: 3600000)
5. dfs.namenode.handler.count   (default: 10)
6. dfs.datanode.handler.count   (default: 3)
7.dfs.replication.interval    (default: 3)
8.dfs.namenode.decomission.interval    (default: 300)

I have successfully tested 1 and 2 parameters. But the rest of the
parameters starting from dfs.heartbeat.interval is confusing me a lot.

On increment of those parameters, will the hadoop perform better? (
considering separately for read and write operation )...
OR, do I have to decrease those parameters to have hadoop perform better?

Anyone please help. If possible please explain
dfs.namenode.hanlder.count and dfs.datanode.handler.count i.e. what
these two parameters do?

Thank you
-- 
Dibyendu Karmakar,
< dibyendu.dets@gmail.com >

Re: UNDERSTANDING HADOOP PERFORMANCE

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server will have to handle incoming requests. The default values should be fine for smaller clusters, but if you have a lot of simultaneous HDFS operations, you may see performance gains by increasing these numbers. Just make sure you have the memory to spare and adjust your heap sizes accordingly.

dfs.heartbeat.interval and dfs.blockreport.intervalMsec will affect performance in larger clusters. Datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. Similarly, the datanode will send a list of all the blocks it has every dfs.blockreport.intervalMsec milliseconds. For a cluster of 30 machines, that means the namenode receives a heartbeat, on average, every 0.1 seconds, and a block report every 6 minutes, which should be a negligible load and worth the extra reliability. If your block reports are taking too long, that's a sign that you have too many small files and should look into archiving or consolidating them somehow. Personally, I ran into trouble around 1 million blocks/datanode.

dfs.namenode.decommission.interval is only used when removing datanodes from the cluster. You can safely ignore it.

Regards,
Marcos

On 11-04-2013 07:19, Dibyendu Karmakar wrote:

Hi everyone,
I am testing hadoop performance. I have come accross the following parameters:
1. dfs.replication
2. dfs.block.size
3. dfs.heartbeat.interval   (dafault: 3)
4. dfs.blockreport.intervalMsec   (default: 3600000)
5. dfs.namenode.handler.count   (default: 10)
6. dfs.datanode.handler.count   (default: 3)
7.dfs.replication.interval    (default: 3)
8.dfs.namenode.decomission.interval    (default: 300)

I have successfully tested 1 and 2 parameters. But the rest of the
parameters starting from dfs.heartbeat.interval is confusing me a lot.

On increment of those parameters, will the hadoop perform better? (
considering separately for read and write operation )...
OR, do I have to decrease those parameters to have hadoop perform better?

Anyone please help. If possible please explain
dfs.namenode.hanlder.count and dfs.datanode.handler.count i.e. what
these two parameters do?

Thank you



Re: UNDERSTANDING HADOOP PERFORMANCE

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server will have to handle incoming requests. The default values should be fine for smaller clusters, but if you have a lot of simultaneous HDFS operations, you may see performance gains by increasing these numbers. Just make sure you have the memory to spare and adjust your heap sizes accordingly.

dfs.heartbeat.interval and dfs.blockreport.intervalMsec will affect performance in larger clusters. Datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. Similarly, the datanode will send a list of all the blocks it has every dfs.blockreport.intervalMsec milliseconds. For a cluster of 30 machines, that means the namenode receives a heartbeat, on average, every 0.1 seconds, and a block report every 6 minutes, which should be a negligible load and worth the extra reliability. If your block reports are taking too long, that's a sign that you have too many small files and should look into archiving or consolidating them somehow. Personally, I ran into trouble around 1 million blocks/datanode.

dfs.namenode.decommission.interval is only used when removing datanodes from the cluster. You can safely ignore it.

Regards,
Marcos

On 11-04-2013 07:19, Dibyendu Karmakar wrote:

Hi everyone,
I am testing hadoop performance. I have come accross the following parameters:
1. dfs.replication
2. dfs.block.size
3. dfs.heartbeat.interval   (dafault: 3)
4. dfs.blockreport.intervalMsec   (default: 3600000)
5. dfs.namenode.handler.count   (default: 10)
6. dfs.datanode.handler.count   (default: 3)
7.dfs.replication.interval    (default: 3)
8.dfs.namenode.decomission.interval    (default: 300)

I have successfully tested 1 and 2 parameters. But the rest of the
parameters starting from dfs.heartbeat.interval is confusing me a lot.

On increment of those parameters, will the hadoop perform better? (
considering separately for read and write operation )...
OR, do I have to decrease those parameters to have hadoop perform better?

Anyone please help. If possible please explain
dfs.namenode.hanlder.count and dfs.datanode.handler.count i.e. what
these two parameters do?

Thank you



Re: UNDERSTANDING HADOOP PERFORMANCE

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server will have to handle incoming requests. The default values should be fine for smaller clusters, but if you have a lot of simultaneous HDFS operations, you may see performance gains by increasing these numbers. Just make sure you have the memory to spare and adjust your heap sizes accordingly.

dfs.heartbeat.interval and dfs.blockreport.intervalMsec will affect performance in larger clusters. Datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. Similarly, the datanode will send a list of all the blocks it has every dfs.blockreport.intervalMsec milliseconds. For a cluster of 30 machines, that means the namenode receives a heartbeat, on average, every 0.1 seconds, and a block report every 6 minutes, which should be a negligible load and worth the extra reliability. If your block reports are taking too long, that's a sign that you have too many small files and should look into archiving or consolidating them somehow. Personally, I ran into trouble around 1 million blocks/datanode.

dfs.namenode.decommission.interval is only used when removing datanodes from the cluster. You can safely ignore it.

Regards,
Marcos

On 11-04-2013 07:19, Dibyendu Karmakar wrote:

Hi everyone,
I am testing hadoop performance. I have come accross the following parameters:
1. dfs.replication
2. dfs.block.size
3. dfs.heartbeat.interval   (dafault: 3)
4. dfs.blockreport.intervalMsec   (default: 3600000)
5. dfs.namenode.handler.count   (default: 10)
6. dfs.datanode.handler.count   (default: 3)
7.dfs.replication.interval    (default: 3)
8.dfs.namenode.decomission.interval    (default: 300)

I have successfully tested 1 and 2 parameters. But the rest of the
parameters starting from dfs.heartbeat.interval is confusing me a lot.

On increment of those parameters, will the hadoop perform better? (
considering separately for read and write operation )...
OR, do I have to decrease those parameters to have hadoop perform better?

Anyone please help. If possible please explain
dfs.namenode.hanlder.count and dfs.datanode.handler.count i.e. what
these two parameters do?

Thank you



Re: UNDERSTANDING HADOOP PERFORMANCE

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server will have to handle incoming requests. The default values should be fine for smaller clusters, but if you have a lot of simultaneous HDFS operations, you may see performance gains by increasing these numbers. Just make sure you have the memory to spare and adjust your heap sizes accordingly.

dfs.heartbeat.interval and dfs.blockreport.intervalMsec will affect performance in larger clusters. Datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. Similarly, the datanode will send a list of all the blocks it has every dfs.blockreport.intervalMsec milliseconds. For a cluster of 30 machines, that means the namenode receives a heartbeat, on average, every 0.1 seconds, and a block report every 6 minutes, which should be a negligible load and worth the extra reliability. If your block reports are taking too long, that's a sign that you have too many small files and should look into archiving or consolidating them somehow. Personally, I ran into trouble around 1 million blocks/datanode.

dfs.namenode.decommission.interval is only used when removing datanodes from the cluster. You can safely ignore it.

Regards,
Marcos

On 11-04-2013 07:19, Dibyendu Karmakar wrote:

Hi everyone,
I am testing hadoop performance. I have come accross the following parameters:
1. dfs.replication
2. dfs.block.size
3. dfs.heartbeat.interval   (dafault: 3)
4. dfs.blockreport.intervalMsec   (default: 3600000)
5. dfs.namenode.handler.count   (default: 10)
6. dfs.datanode.handler.count   (default: 3)
7.dfs.replication.interval    (default: 3)
8.dfs.namenode.decomission.interval    (default: 300)

I have successfully tested 1 and 2 parameters. But the rest of the
parameters starting from dfs.heartbeat.interval is confusing me a lot.

On increment of those parameters, will the hadoop perform better? (
considering separately for read and write operation )...
OR, do I have to decrease those parameters to have hadoop perform better?

Anyone please help. If possible please explain
dfs.namenode.hanlder.count and dfs.datanode.handler.count i.e. what
these two parameters do?

Thank you