You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Yongjun Zhang <yz...@cloudera.com> on 2014/12/18 00:02:00 UTC
supportability JIRAs

Hi,

When creating jira, there is a "Labels" entry where we could label the jira
as supportability. It would help to do this, so we could possibly give
priority to fix them for better supportability,  especially for those JIRAs
that tend to be small but very high-impact for end users.

The types of jiras that we can label "supportability" are (see some
detailed description at the bottom of this mail):

- improving logs and error messages
- simplifying workflows for admins
- reducing configuration complexity
- adding new metrics
- ...

For example,

HDFS-7281 <https://issues.apache.org/jira/browse/HDFS-7281>Missing block is
marked as corrupted block

HDFS-7497 <https://issues.apache.org/jira/browse/HDFS-7497>Inconsistent
report of decommissioning DataNodes between dfsadmin and NameNode webui

HDFS-6959 <https://issues.apache.org/jira/browse/HDFS-6959>Make the HDFS
home directory location customizable.

HDFS-6403 <https://issues.apache.org/jira/browse/HDFS-6403>Add metrics for
log warnings reported by JVM pauses

This email serves as an proposal to do this kind of labeling when creating
new jiras. We could also go back to label old jiras for reference when time
allows.

Comments are welcome.

Thanks.

--Yongjun

Below is a bit more detailed description of some relevant scenarios (thanks
to Andrew Wang and Todd Lipcon):

1) In the presence of configuration errors, detecting them preemptively
before they result in the system getting into a funky state. For example,
we used to have a possible configuration where the NN would start up bound
to 0.0.0.0 and then advertise 0.0.0.0 to the SNN as its remote IP. This
meant that checkpointing wouldn't work, but would fail with confusing
errors. Aborting at startup made this easier to support.

2) In the presence of environmental issues, detecting them and giving
meaningful errors. For example, stuff like the GC Pause monitor that's in
the NN now is helpful because when something goes wrong, you have a smoking
gun. (even though it's not exactly an NN bug that GC happens, in some cases)

3) Changing non-specific error messages to specific ones. For example,
we've had cases before where we throw an NPE, and the "fix" is to check for
null and throw an IllegalArgumentException with a nice message or
something. It wasn't a bug that the system failed with that particular
config, but the error message tells the user/supporter exactly what to do
to fix it.