You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/06/03 17:03:38 UTC
[jira] [Commented] (FLINK-2116) Make pipeline extension require less coding

    [ https://issues.apache.org/jira/browse/FLINK-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570965#comment-14570965 ] 

ASF GitHub Bot commented on FLINK-2116:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/772

    [FLINK-2116] [ml] Reusing predict operation for evaluation

    This PR adds an `evaluate` method to `Predictor` which takes a `DataSet[Testing]` and returns a `DataSet[(LabelType, LabelType)]`, where the first tuple field is the true label and the second field denotes the predicted label. The evaluation logic is defined via a `EvaluateDataSetOperation`.
    
    Since predicting test data and evaluate test data both use the same prediction logic, a new level  of abstraction was introduced. The old `PredictOperation` is now called `PredictDataSetOperation` and a new `PredictOperation` was defined. The `PredictOperation` takes an element of the dataset as well as the model of the associated `Predictor` and calculates one prediction.
    
    If one wants to implement the predict operation of a `Predictor` then one can do it on the level of `PredictDataSetOperation` which gives you access to the `DataSet` of input elements or on the level of `PredictOperation`. If one chooses the latter, then the system will automatically apply this operation to all elements of the input `DataSet` (see `Predictor.defaultPredictDataSetOperation`).
    
    Having defined a `PredictOperation` allows to automatically call `evaluate` for this `Predictor` without having to define a `EvaluateDataSetOperation`. The only constraint is that the input data has to be `DataSet[(TestingType, LabelType)]`. The input is thus a tuple with a testing value and the true label value. The system will then calculate the prediction for the testing value and return a `DataSet[(LabelType, LabelType)]` where the first field value of the tuple is the true label value and the second field value is the predicted label value.
    
    What do you think of these changes? Will they ease the development of future `Predictor`s?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink evaluatePredictor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/772.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #772
    
----
commit 49c02514a6a23d7ef95ce46966ff7ee7a1f407ad
Author: Till Rohrmann <tr...@apache.org>
Date:   2015-06-02T12:34:27Z

    [FLINK-2116] [ml] Adds evaluate method to Predictor. Adds PredictOperation which can be reused by evaluate if the input data is of the format (TestingType, LabelType) where the second tuple field represents the true label.

----


> Make pipeline extension require less coding
> -------------------------------------------
>
>                 Key: FLINK-2116
>                 URL: https://issues.apache.org/jira/browse/FLINK-2116
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Mikio Braun
>            Assignee: Till Rohrmann
>            Priority: Minor
>
> Right now, implementing methods from the pipelines for new types, or even adding new methods to pipelines requires many steps:
> 1) implementing methods for new types
>   implement implicit of the corresponding class encapsulating the operation in the companion object
> 2) adding methods to the pipeline
>   - adding a method
>   - adding a trait for the operation
>   - implement implicit in the companion object
> These are all objects which contain many generic parameters, so reducing the work would be great.
> The goal should be that you can really focus on the code to add, and have as little boilerplate code as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)