You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "ayushtkn (via GitHub)" <gi...@apache.org> on 2023/02/15 06:04:00 UTC

[GitHub] [hadoop] ayushtkn commented on pull request #5396: HDFS-16918. Optionally shut down datanode if it does not stay connected to active namenode

ayushtkn commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430800093

From Admin I mean cluster Administrator services, they can keep a track of datanodes and decide on what needs to be done to the datanode.
If those services can shoot a restart if the datanode is shutdown, they can track in which situation the datanode needs to restarted.

Not checking the code, but comments:

- If the datanode is connected to observer namenode, it can serve requests, why we need to shutdown,
- Even if it is connected to standby, a failover happens and it will be in good shape, else if you restart a bunch of datanodes, the new namenode will be flooded by block reports and just increasing problems.
- If something gets messed up with Active namenode, you shutdown all, the BR are already heavy, you forced all other namenodes to handle them again, making failover more difficult. and if it is some faulty datanodes which lost connection, you didn't get that alarmed, and all Standby and Observers will keep on getting flooded by BRs, so in case Active NN literally dies and tries to failover to any of the Namenode which these Datanodes were connected, will be fed with unnecessary loads of BlockReports. (BR has an option of initial delay as well, it isn't like all bombard at once and you are sorted in 5-10 mins)
- If something got messed with the datanode, that is why it isn't able to connect to Active. If something is in Memory not persisted to disk, or some JMX parameter or N/W parameters which can be used to figure out things gets lost.
- That is the reason most cluster administrator in not so cool situations, show XYZ datanode is unhealthy or not, if in some case they don't it should be handled over there.
- In case of shared datanodes in a federated setup, say it is connected to Active for one Namespace and has completely lost touch with another, then? Restart to get both working? Don't restart so that at least one stays working? Both are correct in there own ways and situation and the datanode shouldn't be in a state to decide its fate for such reasons.

We do terminate Namenode is a bunch of conditions for sure, I don't want to get deep into those reasons, it is more or less preventive measure to terminate Namenode, if something serious has happened. This by architecture of HDFS itself isn't look very valid for HDFS.

PS. Making anything configurable doesn't justify having it in. if we are letting any user to use this via any config as well, then we should be sure enough it is necessary and good thing to do, we can not say ohh you configured it, now it is your problem...

I would say it is just pulling those cluster administrator things to datanode, like what Cloudera Manager or may be Ambari should do.

Not in favour of this...

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org