You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gavin Brown (JIRA)" <ji...@apache.org> on 2014/10/23 13:20:34 UTC

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

    [ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181238#comment-14181238 ] 

Gavin Brown edited comment on SPARK-1473 at 10/23/14 11:20 AM:
---------------------------------------------------------------

Hello, I am the first author of the paper being discussed.

Our paper did indeed separate the two tasks of (1) estimating the probability distributions, and (2) the process/dynamics of selecting features once you have those probabilities.  So as Sam says, yes it is entirely possible that bad estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, but is known as "entropy estimation".  Sam rightly points out that you could use LaPlace, or other smoothing methods.  Which one will work best on any arbitrary dataset is unknown in general.  However many of these smoothing methods all perform identically once you get a reasonable number of datapoints per feature.  Small sample feature selection using information theoretic methods is an open problem - but one could use good heuristics like AIC or BIC to regularize.

In answer to the question by Sam - "Is there any literature that has attempted to *formalize the way we introduce independence*" ... no, our paper is the only one I know of.

I think of these information theoretic filter methods as a way to *explore* large data. If you wanted to build the best possible classifier, with no limit on computation, you'd use a wrapper method around it.  Filter methods by nature assume independence between the feature selection stage and the classifier building, which is clearly false, but works well in practice with very large datasets that cannot have unlimited compute time.  If you want to explore data, a filter is a good heuristic - and the information theoretic ones are the most theoretically grounded "heuristics" I know of.

Happy to answer further questions or give my perspectives.    Everybody - thanks for the attention to the paper - very happy to see it is useful.


was (Author: gbrown@cs.man.ac.uk):
Hello, I am the first author of the paper being discussed.  Everybody - thanks for the attention to the paper - very happy to see it is useful.

Our paper did indeed separate the two tasks of (1) estimating the probability distributions, and (2) the process/dynamics of selecting features once you have those probabilities.  So as Sam says, yes it is entirely possible that bad estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, but is known as "entropy estimation".  Sam rightly points out that you could use LaPlace, or other smoothing methods.  Which one will work best on any arbitrary dataset is unknown in general.  However many of these smoothing methods all perform identically once you get a reasonable number of datapoints per feature.  Small sample feature selection using information theoretic methods is an open problem - but one could use good heuristics like AIC or BIC to regularize.

In answer to the question by Sam - "Is there any literature that has attempted to *formalize the way we introduce independence*" ... no, our paper is the only one I know of.

I think of these information theoretic filter methods as a way to *explore* large data. If you wanted to build the best possible classifier, with no limit on computation, you'd use a wrapper method around it.  Filter methods by nature assume independence between the feature selection stage and the classifier building, which is clearly false, but works well in practice with very large datasets that cannot have unlimited compute time.  If you want to explore data, a filter is a good heuristic - and the information theoretic ones are the most theoretically grounded "heuristics" I know of.

Happy to answer further questions or give my perspectives.   

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n > 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics for text classification." The Journal of machine learning research 3 (2003): 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org