You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2011/03/31 19:31:26 UTC

Confidence interval for logistic regression

Hi,

In logistic regression we basically estimate the mean (or mode) of the
prediction.
Is there a way also to esimate standard error there in the same
learning pipeline?

Or i need to setup another pipeline just for that perhaps?

Thanks.
-Dmitriy

Re: Confidence interval for logistic regression

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ted, thank you very much.

Let me check the references.

On Thu, Mar 31, 2011 at 1:53 PM, Ted Dunning <te...@gmail.com> wrote:
>
>
> On Thu, Mar 31, 2011 at 11:21 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>
>> Thank you, Ted.
>>
>> (i think MAP is another way to say 'distribution mode' for the entire
>> training set?).
>
> Not quite.  That would be maximum likelihood.
> MAP is the distribution mode for the likelihood times the prior (aka the
> posterior distribution).  For the uniform prior, these are the same.  Of
> course, the uniform prior is often nonsensical mathematically.
>
>>
>> I think i am talking about uncertainty of the result, not the
>> parameters...
>
> Sure.  But the posterior of the parameters leads to the posterior of the
> result.
> The real problem here is that you often have strong interactions in the
> parameters that will lead to the same result.
> For instance, if you have one predictor variable repeated in your input, you
> have the worst case of co-linearity.  The L_1 regularized SGD will be unable
> to pick either variable, but the sum of the weights on the two variables
> will be constant and your predicted value could be perfectly accurate even
> though the parameters each separately appear to be uncertain.  The actual
> posterior for the parameter space is a highly correlated distribution.  The
> problem here is that the correlation matrix has n^2 elements (though it is
> sparse) which makes the computation of correlation difficult if only because
> over-fitting is even worse for n^2 elements than for n.
> Without some decent correlation estimate, you can't get a good error
> estimate on the result.
> So that is why the problem is hard.
> Here are a few possible solutions:
> a) on-line bootstrapping
> One thing that might work is to clone the CrossFoldLearner to make a
> bootstrap learner.  Each training example would be passed to a subset of the
> different sub-learners at random.  The goal would be to approximate
> resampling.  Then you would look at the diversity of opinion between the
> different classifiers.
> This has some theoretical difficulties with the relationship between what a
> real resampling would do and what this does.  Also, you are giving the
> learners less data so that isn't going to estimate the error you would get
> with all the data.
>>
>> We want to say how confident the regressed value is. Or sort of what
>> the standard variance in the vicinity of X on average were in the
>> training set.
>>
>> Say we have a traditional PPC (pay per click) advertising. So we might
>> use a binomial regression to compute CTR prediction (prob of a
>> click-thru).
>
> Sure.
> And, in general, you don't want variance, but instead want to be able to
> sample the posterior.  Then you can sample the posterior estimate of regret
> for all of your models and ads and decide which one to pick. This is
> delicate because you need a realistic prior to avoid blue-sky bets all the
> time.  There is a major role in this for multi-level models so that you get
> well-founded priors.
>>
>> Than we could just multiply that by expectation of what click is worth
>> (obtained thru some bidding system) and hence obtain expectation of a
>> particluar ad payoff in a given stituation.
>
> Payoff is only part of the problem, of course, because you really have a
> bandit problem here.  You need to model payoff and opportunity cost of
> making the wrong decision now, but also to incorporate the estimated benefit
> that learning about a model might have.  Again, strong priors are very
> valuable in this.
>>
>> But there's a difference in situation when we say "rev(A)=5c +- 0.1c'
>> and 'rev(A)=5c+-2c' because in the first case we pretty much damn sure
>> B is almost always better than A and in the second case we just say
>> 'or, they are both about the same, so just rotate them".
>
> This is regret.  What is the expectation of opportunity cost.  Or
>    C_j =    \int_Y \left[ max (Y) - y_j \right] dP(Y)
> where Y is the vector of payoffs and P(Y) is the multi-dimensional
> cumulative distribution of same.
>>
>> So one way to go about this I see is if we have regression of the mode
>> of posterrior
>> http://latex.codecogs.com/gif.latex?y=\hat{y}\left(\boldsymbol{z},\boldsymbol{\beta}\right)
>> then say we want to estimate 'variance'in the vicinity' by building a
>> regression for another target set composed of squares of errors
>>
>> http://latex.codecogs.com/gif.latex?\left(y-\hat{y}\right)^{2}=\hat{s}\left(\boldsymbol{x},\boldsymbol{\beta}\right)
>>
>> and that would give us much more leverage when comparing ad
>> performance. In other words, we want more handle on questions like
>> 'how often ad A is better performing than an ad b?)
>
> This is Laplace's method for estimating posterior distributions.  See
> http://www.inference.phy.cam.ac.uk/mackay/laplace.pdf
> for instance and the 1992 paper of his own that he cites on the first page.
>  Mackay's book is excellent on this and related topics.  See
> http://www.inference.phy.cam.ac.uk/itprnn/book.html
>
> These methods can be fruitful, but I don't know how to implement them in the
> presence of big data (i.e. in an on-line learner).  With small data, the
> bayesglm package in R may be helpful.   See
> http://www.stat.columbia.edu/~gelman/research/unpublished/priors7.pdf
> for more information.  I have used bayesglm in smaller data situations with
> very good results.
>

Re: Confidence interval for logistic regression

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Mar 31, 2011 at 11:21 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Thank you, Ted.
>
> (i think MAP is another way to say 'distribution mode' for the entire
> training set?).
>

Not quite.  That would be maximum likelihood.

MAP is the distribution mode for the likelihood times the prior (aka the
posterior distribution).  For the uniform prior, these are the same.  Of
course, the uniform prior is often nonsensical mathematically.



> I think i am talking about uncertainty of the result, not the parameters...
>

Sure.  But the posterior of the parameters leads to the posterior of the
result.

The real problem here is that you often have strong interactions in the
parameters that will lead to the same result.

For instance, if you have one predictor variable repeated in your input, you
have the worst case of co-linearity.  The L_1 regularized SGD will be unable
to pick either variable, but the sum of the weights on the two variables
will be constant and your predicted value could be perfectly accurate even
though the parameters each separately appear to be uncertain.  The actual
posterior for the parameter space is a highly correlated distribution.  The
problem here is that the correlation matrix has n^2 elements (though it is
sparse) which makes the computation of correlation difficult if only because
over-fitting is even worse for n^2 elements than for n.

Without some decent correlation estimate, you can't get a good error
estimate on the result.

So that is why the problem is hard.

Here are a few possible solutions:

a) on-line bootstrapping

One thing that might work is to clone the CrossFoldLearner to make a
bootstrap learner.  Each training example would be passed to a subset of the
different sub-learners at random.  The goal would be to approximate
resampling.  Then you would look at the diversity of opinion between the
different classifiers.

This has some theoretical difficulties with the relationship between what a
real resampling would do and what this does.  Also, you are giving the
learners less data so that isn't going to estimate the error you would get
with all the data.

We want to say how confident the regressed value is. Or sort of what
> the standard variance in the vicinity of X on average were in the
> training set.
>
> Say we have a traditional PPC (pay per click) advertising. So we might
> use a binomial regression to compute CTR prediction (prob of a
> click-thru).
>

Sure.

And, in general, you don't want variance, but instead want to be able to
sample the posterior.  Then you can sample the posterior estimate of regret
for all of your models and ads and decide which one to pick. This is
delicate because you need a realistic prior to avoid blue-sky bets all the
time.  There is a major role in this for multi-level models so that you get
well-founded priors.

Than we could just multiply that by expectation of what click is worth
> (obtained thru some bidding system) and hence obtain expectation of a
> particluar ad payoff in a given stituation.
>

Payoff is only part of the problem, of course, because you really have a
bandit problem here.  You need to model payoff and opportunity cost of
making the wrong decision now, but also to incorporate the estimated benefit
that learning about a model might have.  Again, strong priors are very
valuable in this.

But there's a difference in situation when we say "rev(A)=5c +- 0.1c'
> and 'rev(A)=5c+-2c' because in the first case we pretty much damn sure
> B is almost always better than A and in the second case we just say
> 'or, they are both about the same, so just rotate them".
>

This is regret.  What is the expectation of opportunity cost.  Or

   C_j =    \int_Y \left[ max (Y) - y_j \right] dP(Y)

where Y is the vector of payoffs and P(Y) is the multi-dimensional
cumulative distribution of same.


> So one way to go about this I see is if we have regression of the mode
> of posterrior
> http://latex.codecogs.com/gif.latex?y=\hat{y}\left(\boldsymbol{z},\boldsymbol{\beta}\right)
> then say we want to estimate 'variance'in the vicinity' by building a
> regression for another target set composed of squares of errors
> http://latex.codecogs.com/gif.latex
> ?\left(y-\hat{y}\right)^{2}=\hat{s}\left(\boldsymbol{x},\boldsymbol{\beta}\right)
>
> and that would give us much more leverage when comparing ad
> performance. In other words, we want more handle on questions like
> 'how often ad A is better performing than an ad b?)
>

This is Laplace's method for estimating posterior distributions.  See

http://www.inference.phy.cam.ac.uk/mackay/laplace.pdf

for instance and the 1992 paper of his own that he cites on the first page.
 Mackay's book is excellent on this and related topics.  See

http://www.inference.phy.cam.ac.uk/itprnn/book.html


These methods can be fruitful, but I don't know how to implement them in the
presence of big data (i.e. in an on-line learner).  With small data, the
bayesglm package in R may be helpful.   See

http://www.stat.columbia.edu/~gelman/research/unpublished/priors7.pdf

for more information.  I have used bayesglm in smaller data situations with
very good results.

Re: Confidence interval for logistic regression

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Ted.

(i think MAP is another way to say 'distribution mode' for the entire
training set?).


I think i am talking about uncertainty of the result, not the parameters...

Here's the problem.


We want to say how confident the regressed value is. Or sort of what
the standard variance in the vicinity of X on average were in the
training set.

Say we have a traditional PPC (pay per click) advertising. So we might
use a binomial regression to compute CTR prediction (prob of a
click-thru).

Than we could just multiply that by expectation of what click is worth
(obtained thru some bidding system) and hence obtain expectation of a
particluar ad payoff in a given stituation.

Say we then move to comparison of 2 ads, A and B, and regression tells
as ad A is predicted to have 5c expectation and ad B is predicted to
have 6c gross revenue expectation.

But there's a difference in situation when we say "rev(A)=5c +- 0.1c'
and 'rev(A)=5c+-2c' because in the first case we pretty much damn sure
B is almost always better than A and in the second case we just say
'or, they are both about the same, so just rotate them".

So one way to go about this I see is if we have regression of the mode
of posterrior http://latex.codecogs.com/gif.latex?y=\hat{y}\left(\boldsymbol{z},\boldsymbol{\beta}\right)

then say we want to estimate 'variance'in the vicinity' by building a
regression for another target set composed of squares of errors
http://latex.codecogs.com/gif.latex?\left(y-\hat{y}\right)^{2}=\hat{s}\left(\boldsymbol{x},\boldsymbol{\beta}\right)

and that would give us much more leverage when comparing ad
performance. In other words, we want more handle on questions like
'how often ad A is better performing than an ad b?)

On Thu, Mar 31, 2011 at 10:55 AM, Ted Dunning <te...@gmail.com> wrote:
> The SGD optimizer computes the MAP (maximum a posteriori) estimate.  This is
> like maximum likelihood, but it
> takes into account the prior distribution.
> It is possible to compute standard errors on the the coefficients, but my
> memory is that this is a bit tricky to do in
> an on-line setting.  My guess is that the normal approximations made in
> these estimates will not be very reliable
> because of the small counts and very sparse data that these systems normally
> operate on.
> Normally, however, the use of standard error on the coefficient matrix is
> for the purposes of variable selection.  That
> generally turns out to be a very bad idea.  The reason is that variable
> selection is inherently similar to L_0 regularization
> and it is well known that this completely defeats convexity and leads to
> problems with massive numbers of local optima
> (massive here means exponential in the number of variables).
> In practice, L_1 regularization is a better choice, both because it is
> feasible and because the results are often better.
> But, once you apply a good regularizer, there isn't a question any more
> about whether a coefficient is bounded away
> from zero by the standard error because the regularizer will set the
> coefficient to zero if that is justifiable.
> Moreover, in the on-line Mahout setting, the final model isn't even the
> result of learning with a constant regularization
> parameter.  The regularization parameter evolves over time to give the best
> cross-validation estimate of performance.
> Can you say more about what the underlying need that motivates this need?
>
> On Thu, Mar 31, 2011 at 10:31 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> In logistic regression we basically estimate the mean (or mode) of the
>> prediction.
>> Is there a way also to esimate standard error there in the same
>> learning pipeline?
>>
>> Or i need to setup another pipeline just for that perhaps?
>>
>> Thanks.
>> -Dmitriy
>
>

Re: Confidence interval for logistic regression

Posted by Ted Dunning <te...@gmail.com>.
The SGD optimizer computes the MAP (maximum a posteriori) estimate.  This is
like maximum likelihood, but it
takes into account the prior distribution.

It is possible to compute standard errors on the the coefficients, but my
memory is that this is a bit tricky to do in
an on-line setting.  My guess is that the normal approximations made in
these estimates will not be very reliable
because of the small counts and very sparse data that these systems normally
operate on.

Normally, however, the use of standard error on the coefficient matrix is
for the purposes of variable selection.  That
generally turns out to be a very bad idea.  The reason is that variable
selection is inherently similar to L_0 regularization
and it is well known that this completely defeats convexity and leads to
problems with massive numbers of local optima
(massive here means exponential in the number of variables).

In practice, L_1 regularization is a better choice, both because it is
feasible and because the results are often better.
But, once you apply a good regularizer, there isn't a question any more
about whether a coefficient is bounded away
from zero by the standard error because the regularizer will set the
coefficient to zero if that is justifiable.

Moreover, in the on-line Mahout setting, the final model isn't even the
result of learning with a constant regularization
parameter.  The regularization parameter evolves over time to give the best
cross-validation estimate of performance.

Can you say more about what the underlying need that motivates this need?


On Thu, Mar 31, 2011 at 10:31 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Hi,
>
> In logistic regression we basically estimate the mean (or mode) of the
> prediction.
> Is there a way also to esimate standard error there in the same
> learning pipeline?
>
> Or i need to setup another pipeline just for that perhaps?
>
> Thanks.
> -Dmitriy
>