You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Hao Ren <in...@gmail.com> on 2015/09/17 17:07:10 UTC

[MLlib] BinaryLogisticRegressionSummary on test set

Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),

It might be useful if we can create a summary for any given dataset, not
just training set.
Actually, BinaryLogisticRegressionTrainingSummary  is only created when
model is computed based on training set.
As usual, we need to summary test set to know about the model performance.
However, we can not create our own BinaryLogisticRegressionSummary for
other date set (of type DataFrame), because the Summary class is "private"
in classification package.

Would it be better to remove the "private" access modifier and allow the
following code on user side:

val lr = new LogisticRegression()

val model = lr.fit(trainingSet)

val binarySummary =
  new BinaryLogisticRegressionSummary(
    model.transform(testSet),
    lr.probabilityCol,
    lr.labelCol
  )

binarySummary.roc


Thus, we can use the model to summary any data set we want.

If there is a way to summary test set, please let me know. I have browsed
LogisticRegression.scala, but failed to find one.

Thx.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

Re: [MLlib] BinaryLogisticRegressionSummary on test set

Posted by Feynman Liang <fl...@databricks.com>.
If you have the time, submitting a PR for it would be awesome! However, our
review bandwidth is limited so you should not expect it to get immediately
reviewed. Let's continue discussion of the name on JIRA

On Fri, Sep 18, 2015 at 2:47 AM, Hao Ren <in...@gmail.com> wrote:

> Thank you for the reply.
>
> I have created a jira issue and pinged mengxr.
>
> Here is the link: https://issues.apache.org/jira/browse/SPARK-10691
>
> I did not find jkbradley on jira. I saw he is on github.
>
> BTW, should I create a pull request on removing the private modifier for
> further discussion ?
>
> Thx.
>
> On Thu, Sep 17, 2015 at 6:44 PM, Feynman Liang <fl...@databricks.com>
> wrote:
>
>> We have kept that private because we need to decide on a name for the
>> method which evaluates on a test set (see the TODO comment
>> <https://github.com/apache/spark/pull/7099/files#diff-668c79317c51f40df870d3404d8a731fR272>);
>> perhaps you could push for this to happen by creating a Jira and pinging
>> jkbradley and mengxr. Thanks!
>>
>> On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <in...@gmail.com> wrote:
>>
>>> Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),
>>>
>>> It might be useful if we can create a summary for any given dataset, not
>>> just training set.
>>> Actually, BinaryLogisticRegressionTrainingSummary  is only created when
>>> model is computed based on training set.
>>> As usual, we need to summary test set to know about the model
>>> performance.
>>> However, we can not create our own BinaryLogisticRegressionSummary for
>>> other date set (of type DataFrame), because the Summary class is "private"
>>> in classification package.
>>>
>>> Would it be better to remove the "private" access modifier and allow the
>>> following code on user side:
>>>
>>> val lr = new LogisticRegression()
>>>
>>> val model = lr.fit(trainingSet)
>>>
>>> val binarySummary =
>>>   new BinaryLogisticRegressionSummary(
>>>     model.transform(testSet),
>>>     lr.probabilityCol,
>>>     lr.labelCol
>>>   )
>>>
>>> binarySummary.roc
>>>
>>>
>>> Thus, we can use the model to summary any data set we want.
>>>
>>> If there is a way to summary test set, please let me know. I have
>>> browsed LogisticRegression.scala, but failed to find one.
>>>
>>> Thx.
>>>
>>> --
>>> Hao Ren
>>>
>>> Data Engineer @ leboncoin
>>>
>>> Paris, France
>>>
>>
>>
>
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>

Re: [MLlib] BinaryLogisticRegressionSummary on test set

Posted by Hao Ren <in...@gmail.com>.
Thank you for the reply.

I have created a jira issue and pinged mengxr.

Here is the link: https://issues.apache.org/jira/browse/SPARK-10691

I did not find jkbradley on jira. I saw he is on github.

BTW, should I create a pull request on removing the private modifier for
further discussion ?

Thx.

On Thu, Sep 17, 2015 at 6:44 PM, Feynman Liang <fl...@databricks.com>
wrote:

> We have kept that private because we need to decide on a name for the
> method which evaluates on a test set (see the TODO comment
> <https://github.com/apache/spark/pull/7099/files#diff-668c79317c51f40df870d3404d8a731fR272>);
> perhaps you could push for this to happen by creating a Jira and pinging
> jkbradley and mengxr. Thanks!
>
> On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <in...@gmail.com> wrote:
>
>> Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),
>>
>> It might be useful if we can create a summary for any given dataset, not
>> just training set.
>> Actually, BinaryLogisticRegressionTrainingSummary  is only created when
>> model is computed based on training set.
>> As usual, we need to summary test set to know about the model performance.
>> However, we can not create our own BinaryLogisticRegressionSummary for
>> other date set (of type DataFrame), because the Summary class is "private"
>> in classification package.
>>
>> Would it be better to remove the "private" access modifier and allow the
>> following code on user side:
>>
>> val lr = new LogisticRegression()
>>
>> val model = lr.fit(trainingSet)
>>
>> val binarySummary =
>>   new BinaryLogisticRegressionSummary(
>>     model.transform(testSet),
>>     lr.probabilityCol,
>>     lr.labelCol
>>   )
>>
>> binarySummary.roc
>>
>>
>> Thus, we can use the model to summary any data set we want.
>>
>> If there is a way to summary test set, please let me know. I have browsed
>> LogisticRegression.scala, but failed to find one.
>>
>> Thx.
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>


-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

Re: [MLlib] BinaryLogisticRegressionSummary on test set

Posted by Feynman Liang <fl...@databricks.com>.
We have kept that private because we need to decide on a name for the
method which evaluates on a test set (see the TODO comment
<https://github.com/apache/spark/pull/7099/files#diff-668c79317c51f40df870d3404d8a731fR272>);
perhaps you could push for this to happen by creating a Jira and pinging
jkbradley and mengxr. Thanks!

On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <in...@gmail.com> wrote:

> Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),
>
> It might be useful if we can create a summary for any given dataset, not
> just training set.
> Actually, BinaryLogisticRegressionTrainingSummary  is only created when
> model is computed based on training set.
> As usual, we need to summary test set to know about the model performance.
> However, we can not create our own BinaryLogisticRegressionSummary for
> other date set (of type DataFrame), because the Summary class is "private"
> in classification package.
>
> Would it be better to remove the "private" access modifier and allow the
> following code on user side:
>
> val lr = new LogisticRegression()
>
> val model = lr.fit(trainingSet)
>
> val binarySummary =
>   new BinaryLogisticRegressionSummary(
>     model.transform(testSet),
>     lr.probabilityCol,
>     lr.labelCol
>   )
>
> binarySummary.roc
>
>
> Thus, we can use the model to summary any data set we want.
>
> If there is a way to summary test set, please let me know. I have browsed
> LogisticRegression.scala, but failed to find one.
>
> Thx.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>