You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Ari Rabkin (JIRA)" <ji...@apache.org> on 2010/11/15 23:22:15 UTC

[jira] Created: (HADOOP-7036) spellcheck for configuration

spellcheck for configuration
----------------------------

                 Key: HADOOP-7036
                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
             Project: Hadoop Common
          Issue Type: New Feature
          Components: conf
            Reporter: Ari Rabkin
            Assignee: Ari Rabkin


Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.

The system works as follows:

- Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
- Distribute these extracted sets, per version.
- A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-7036:
-------------------------------

    Attachment: confspellcheck.jar

Jar for spellcheck. Should go in contrib/spellcheck directory.
Source code available from http://code.google.com/p/jchord/source/browse/#svn/trunk/conf_spellchecker/

Available under BSD license 

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932245#action_12932245 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

Actually, as I think a bit more, XML schema isn't quite as powerful as this approach. The spellchecker tool is able to check constraints like "this option must be a writable local file", which don't fit into a schema.

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964916#action_12964916 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

The jar is now in Maven. I'll revise the patch to pull it in that way.

The tool is intended for ops folks to some extent, and novice users to an even greater extent. I'm not sure which way that pushes the packaging question. As I understand, this is different in the v20 branch and the v21 branch. Is there a document somewhere summarizing what goes in contrib for each Hadoop branch?

I don't understand the right way to divide stuff up amongst projects. Seems like a hassle to have the Mapred dictionary in one patch against one project, the HDFS dictionary in another, and the common options and the invoke script in a third. Can that really be the right way to go?



> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971995#action_12971995 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

The code is already released under Apache license (google code, not github)

The reason I want it integrated with Hadoop is that it's a tool primarily designed to help novice users, who are very unlikely to go off and install some small little component that they've never heard of. Almost all the value inheres in being "on by default."

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Konstantin Boudnik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933141#action_12933141 ] 

Konstantin Boudnik commented on HADOOP-7036:
--------------------------------------------

Good idea, I like it. However the jar file needs to be included via Ivy dependency declaration - not directly to the SVN.

Also, it seems that it needs to be split between the project. E.g. Common shouldn't know anything about HDFS or MR specific configuration options.

One more nit: the tool sounds more like a nice addition to Ops (cluster operation) folks or whoever else needs to create their own configurations. Perhaps it belongs to HDFS/MR contribs rathen than Common.

+1 on the idea, though!

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932260#action_12932260 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

It is already a separate project, on Google Code; it's a subcomponent of http://code.google.com/p/jchord/

But I thought it made sense to include the Hadoop-specific scripts and the [hopefully human-checked] Hadoop dictionary files in Hadoop contrib. This improves visibility and also benefits the Hadoop community by helping users avoid what I gather is a significant problem -- mis-spelled option names.

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Konstantin Boudnik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971963#action_12971963 ] 

Konstantin Boudnik commented on HADOOP-7036:
--------------------------------------------

I'd rather agree with you about the overhead. Unfortunately, this is how things have became after 3-way split we have experienced a couple of years ago. Perhaps, Eli point make sense and it would be a good idea to put it as a separate project under Apache license with its own artifact to github?

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971795#action_12971795 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

I've been overloaded with other things and am only now getting back to this.  I'm still unsure what the right way to package this is.

- The jar is now on maven.
- I have separate HDFS and MapReduce dictionary files.

Where should I put the script that launches the thing?  Separate scripts for MapReduce and HDFS? That seems very wasteful.

Do I need to open a pair of new JIRAs, one each for MAPREDUCE and HDFS?

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932236#action_12932236 ] 

Allen Wittenauer commented on HADOOP-7036:
------------------------------------------

Errr, don't we just need a schema definition and we can do this with any number of xml tools?

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-7036:
-------------------------------

    Attachment: hadoopSpellcheck.patch

Includes dictionary files for (and was tested with) 0.20.2 and 0.21.0

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932244#action_12932244 ] 

Ari Rabkin commented on HADOOP-7036:
------------------------------------

It would be possible to use XML schema to do the enforcement. I opted for this strategy so I could reuse the spellcheck component for other systems that use non-XML key-value configuration.

The hard part here isn't the enforcement per se, it's automatically extracting the schema and keeping it up to date for each version. That's the real contribution here; I'm undertaking to keep those up to date, using program analysis.

> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-7036) spellcheck for configuration

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932256#action_12932256 ] 

Eli Collins commented on HADOOP-7036:
-------------------------------------

Sounds like a good tool.  Maybe better as a separate project eg on github than part of core Hadoop?



> spellcheck for configuration
> ----------------------------
>
>                 Key: HADOOP-7036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7036
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: conf
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: confspellcheck.jar, hadoopSpellcheck.patch
>
>
> Hadoop does fairly limited correctness checks of its configuration. I propose a "configuration spellcheck" that can automatically catch errors, and particularly can catch cases where users mis-type the name of an option.
> The system works as follows:
> - Use program analysis to extract the set of options supported by each Hadoop version, annotated when possible with their types into a 'dictionary file'.
> - Distribute these extracted sets, per version.
> - A script that reads a dictionary file, reads the Hadoop config from a specified directory, and reports deviations. In particular, the system can report when an option is set that Hadoop will never read or when an invalid value is specified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.