You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Robert Chansler (JIRA)" <ji...@apache.org> on 2007/12/08 01:48:43 UTC

[jira] Created: (HADOOP-2385) Validate configuration parameters

Validate configuration parameters
---------------------------------

                 Key: HADOOP-2385
                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.16.0
            Reporter: Robert Chansler


Configuration parameters should be fully validated before name nodes or data nodes begin service.

Required parameters must be present.
Required and optional parameters must have values of proper type and range.
Undefined parameters must not be present.

(I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557478#action_12557478 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> This means that the configuration classes should be public then, right?

Yes, if the parameters they access should be publicly accessible.  One might argue that certain parameters are only consumed internally and don't need public accessors, but, more typically, parameter accessors are on public classes.

> And it doesn't matter where the get/setters are.
> Particularly we can combine all of them in one class
> or even place them in the Configuration class. Is it what you want?

They shouldn't be all in one place or all in Configuration for the same reason that we don't put everything in a single file: we should attempt to keep related things together, to localize changes.  So an HDFS-specific parameter accessor should be on an HDFS-specific class.  How fine-grained we localize isn't clear.  Generally, finer is better: find the most-specific public class that encompasses the use and add the accessor there.  So if something's only used in the Datanode, but used in a few different classes there, then it might best be on Datanode.

> What I meant is that we keep placing logically independent
> code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class.
> And configuration is just an example of such logically independent part.

If configuration stuff is not specific to FSNamesystem (i.e., logically independent) then it shouldn't go there.  If it is specific to FSNamesystem then it could go there, or perhaps on a new class that's used only by FSNamesystem, e.g., FSNamesystemParams.  If it's used equally by FSNamesystem and other classes then it could either go on an existing shared class (e.g., Namenode) or a new shared class (NamenodeParams).


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558880#action_12558880 ] 

Konstantin Shvachko commented on HADOOP-2385:
---------------------------------------------

I agree, visibility of accessors should depend on whether they are used publicly or internally.
Say getters for parameters used only in FSNamesystem should package private.
Some getters may be public, but the corresponding setters may not.

> find the most-specific public class that encompasses the use and add the accessor there.

I would prefer creating new classes solely dedicated to configuration logic rather then including implementation of accessors in existing public classes. Imo this makes a better structured code.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557108#action_12557108 ] 

Konstantin Shvachko commented on HADOOP-2385:
---------------------------------------------

We already have JobConf derived from Configuration.
I am just proposing to extend this approach to other components.
The Configuration itself should remain the same for each component.
That is it reads the same config files with the same override rules, and with the same properties inside.
It just exposes get methods specific to the component.
Conversion from one class to another can be easily provided by constructors like 
{code} FsConfiguration(Configuration conf) {code}
And then you can pass RPCCongifuration as a parameter into it since the latter 
is a subclass of Configuration and since it contains all properties related to rpc.
So configuration class for each component is just a wrapper exposing methods and
providing validation and potentially conversion of the parameters related to the component.
I do not support the idea of placing static getters for configuration parameters in the (top-level) component 
classes because I would prefer to separate all configuration issues from the component code.
FSNamesystem is again almost 4K lines long, JobTracker and TaskTracker over 2K lines.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559140#action_12559140 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> I would prefer creating new classes solely dedicated to configuration logic [ ... ]

I think this varies, case-by-case.  For a complex subsystem like HDFS, it may make sense to have dedicated configuration classes.  For a standalone classes, like an InputFormat or compression codec, it probably makes sense to put configuration accessors directly on the class in question.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551670 ] 

Konstantin Shvachko commented on HADOOP-2385:
---------------------------------------------

I think this should be related to all hadoop components including hdfs client, secondary-node, job-tracker and task-tracker.
They all should independently verify parameters they depend upon.

I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ...
derived from the base Configuration class, with explicit getters and setters for each parameter used in the component.
That way it will be easy to provide verification and backward compatibility.

That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface.
What was the reason it had never been committed?

We should also use xml schema for the verification purposes. That would make verification automatic, and there will be no
need to write verification code.

> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551883 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ...

This can be a problem when a configuration is passed between components.  For example, if we had an FsConfiguration, RpcConfiguration, etc., then calls from, e.g., mapred code to these would require the creation of a new configuration.  So this only works well for top-level components, when all calls from a component are to components that are configured by Configuration.  But it can be hard to know when something is top-level, e.g., we'd like to be able to run datanode within other daemons.

> That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface.

Hmm.  It looks like I dropped the ball on that one!  It's not back-compatible, since any code that calls 'new Configuration()' would break, and it's missing some javadoc, etc.  But it might still be a nice change, although I don't see how it would help us here.

> We should also use xml schema for the verification purposes.

I don't see how that would do sufficient type checking any more easily.  Do you?


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550145 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> Configuration parameters should be fully validated before name nodes or data nodes begin service.

Is the idea that namenode or datanode startup should fail fast when misconfigured?  Or that we be able to check parameters without starting daemons?

If the former, we can just be more careful to validate all critical parameters when they're first used.  If the latter, we could add an option to the main() that simply validates critical parameters and exits.  In either case, configuration validation should not be centralized in Hadoop, but rather implemented in each component that is configured.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558880#action_12558880 ] 

shv edited comment on HADOOP-2385 at 1/14/08 4:44 PM:
----------------------------------------------------------------------

I agree, visibility of accessors should depend on whether they are used publicly or internally.
Say getters for parameters used only in FSNamesystem should be package private.
Some getters may be public, but the corresponding setters may not.

> find the most-specific public class that encompasses the use and add the accessor there.

I would prefer creating new classes solely dedicated to configuration logic rather then including implementation of accessors in existing public classes. Imo this makes a better structured code.


      was (Author: shv):
    I agree, visibility of accessors should depend on whether they are used publicly or internally.
Say getters for parameters used only in FSNamesystem should package private.
Some getters may be public, but the corresponding setters may not.

> find the most-specific public class that encompasses the use and add the accessor there.

I would prefer creating new classes solely dedicated to configuration logic rather then including implementation of accessors in existing public classes. Imo this makes a better structured code.

  
> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557471#action_12557471 ] 

Konstantin Shvachko commented on HADOOP-2385:
---------------------------------------------

JobConf does not set hdfs parameters, but I agree users might want to set/access inter-component parameters.
My idea was to use carefully designed hierarchy of configuration classes and interfaces.
Like TaskTrackerConfiguration can be inherited from DFSClientConfiguration.
But I see the static approach you propose is simpler.
This means that the configuration classes should be public then, right?
And it doesn't matter where the get/setters are. Particularly we can combine all of them in one class
or even place them in the Configuration class. Is it what you want?

> I don't find the argument that FSNamesystem is already too big compelling.

Yes the size is not important. What I meant is that we keep placing logically independent
code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class.
And configuration is just an example of such logically independent part.
If you write a converter for a parameter or add a verification constraint - these changes belong to
the configuration only, namely to the implementation of getters, why modify FSNamesystem if it only calls them.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557431#action_12557431 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> Why setters need to be static?

Users need to, e.g., be able to set HDFS parameters on a JobConf.  We can get away with a single subclass of Configuration that has setters, but once we add a second, it would be impossible to create a single configuration instance that can configure multiple components.

> Why per-package, not per-component?

That's fine too.  You seemed to be complaining that classes were too specific for this case, so I said I was okay with per-package if you thought that more appropirate here, although perhaps that's too general for your taste in this case, and you'd rather separate, e.g., Namenode from Datanode parameters.  That's fine with me too.  However I don't find the argument that FSNamesystem is already too big compelling.  That's a separate issue: it should perhaps be decomposed into multiple classes, and when that's done, configuration accessors might move around, but if there are FSNamesystem-specific configuration accessors then I'd argue they belong in FSNamesystem, regardless of that class's current size.

> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557359#action_12557359 ] 

Doug Cutting commented on HADOOP-2385:
--------------------------------------

> The Configuration itself should remain the same for each component.
> It just exposes get methods specific to the component.

Yes, that would work for getters, but not for setters.  In many cases we need setters too, and it would be confusing to implement getters and setters using different styles.  Setters are best implemented as static methods, thus, for symmetry, getters must be also.

> I do not support the idea of placing static getters for configuration parameters in the (top-level) component

I'm okay having per-package config classes (e.g.m DFSConfig) that centralizes configuration setters and getters for that package, since, in some cases, the classes which consume these (e.g., FSNamesystem) are not public classes.


> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557424#action_12557424 ] 

Konstantin Shvachko commented on HADOOP-2385:
---------------------------------------------

Why setters need to be static? What is wrong with this
{code}
class NameNodeConfig extends Configuration {
	void setHttpBindAddres(String host, int port) {
		set("dfs.http.bindAddress", host + ":" + port);
	}
}
{code}
Why per-package, not per-component?

Serialization is common for all of them: it's the one defined in Configuration.
And the configuration files (hadooop-default.xml, hadoop-site.xml) are all the same.
Just the classes are different because their accessors are different.

> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2385) Validate configuration parameters

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549926 ] 

Owen O'Malley commented on HADOOP-2385:
---------------------------------------

I can see the motivation/frustration for this, but it doesn't work very well with the Hadoop configuration model, which is very open. In particular, it would have to be a very inclusive and configurable list, since the same configuration file is used for map/reduce and hdfs. 

I would think that the best tradeoff would be to have a configuration file that the namenode and datanodes use to say that all attributes under dfs.* can only be on this list.

> Validate configuration parameters
> ---------------------------------
>
>                 Key: HADOOP-2385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2385
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.