You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Erik J. Erlandson (JIRA)" <ji...@apache.org> on 2014/05/10 23:57:00 UTC

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

    [ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993742#comment-13993742 ] 

Erik J. Erlandson commented on SPARK-1473:
------------------------------------------

I'm fairly new to Spark, and hopefully what follows isn't old news...

Feature subsetting ought (imo) to be considered as part of a larger picture that involves various ETL-like tasks such as

*) data assessment -- examining data columns to assess data types (real, integer, categorical/binary), identify noise in data (empty/missing values, bad values), suggest possible quantizations

*) data quantization -- mapping values into byte encodings, sparse binary, etc

*) dataset transposition -- moving from sample-wise to feature-wise orientation (e.g. decision tree training can work more efficiently when data can be traversed by feature)

*) feature extraction, augmentation, reduction

I don't yet have a strong feel for how these tasks should best work in spark, but in my previous lives I've found they are common and closely-integrated tasks when preparing for the care and feeding of ML models.



> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Priority: Minor
>              Labels: features
>             Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n > 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics for text classification." The Journal of machine learning research 3 (2003): 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)