You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by "Makoto Yui (Jira)" <ji...@apache.org> on 2019/11/11 18:14:00 UTC

[jira] [Closed] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

     [ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Makoto Yui closed HIVEMALL-181.
-------------------------------
    Resolution: Abandoned

> Plan rewriting rules to filter meaningful training data before feature selections
> ---------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>         Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections.
> As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3].
> I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS functionalities) in this ticket to track them.
>  
> *References:*
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
>  [3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited, Proceedings of SIGMOD, 2018.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)