You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by "Makoto Yui (JIRA)" <ji...@apache.org> on 2018/04/04 02:47:00 UTC

[jira] [Commented] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections

    [ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424912#comment-16424912 ] 

Makoto Yui commented on HIVEMALL-181:
-------------------------------------

[~takuti] is working on this kind of feature selection mechanism in our company.

It's named GUESS feature to select meaningful columns. It uses [Chain of Responsibility|https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern] pattern for filtering rules.

There are a lot of rules including heuristics to filer out ID columns from exploratory variables. Using standard deviation would be most beneficial for filtering rule.

> Plan rewrting rules to filter meaningful training data before feature selections
> --------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>
> In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a pretty simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)