You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by LiLeqiang <fr...@hotmail.com> on 2012/04/12 07:34:52 UTC

Mahout Logisitc Regression Do Not Work Properly for Me

Hey guys,
I have a problem using mahout's Logisitc Regression classifiers.
I'm doing some study on ad click predictions. I've got a large collection of ad impression and click data for my study. The size of my collection is of hundreds of millions. My goal is to predict CTR(click through rate) of a collection of ad creatives, given the id of the ad creative impressed, the region, time, os and device type of the user, network type of the user, and some other features. 
To achieve this, I treat all above features as predictor variables. The id of ad creative is treated as categorical, and region, time(in hours) are also categorical.
 My target variable is defined to be whether the user would click on the impressed ad creative. So, the target variable should have 2 values: 0 means would not click, and 1 means would click. 
Here is the trick: suppose the model is already trained ok. When there comes an ad request, I first extract the predictor variables from the request. Then, for each of the advertisements available, I append the id of current ad to the predictors and then call the model's classifyScalar method to get the possibility of current ad to be clicked, which is my goal: the CTRs.

I first used the OnlineLogisticRegression class to do the job. Learning rate is first choosen to be 0.2. I found that when the number of training passes is large enough, the model is converged and CTRs predicted by the model is unexpectedly high, over 20% for most ads in my test cases, and this is unacceptable since the actual CTR of my data is just around 1%-2%. Then, I tried different learning rates, and got the same phenomenon. I see other parameters of OnlineLogisticRegression, but do not know how they works exactly.
Then, I tried the  AdaptiveLogisticRegression. I found that with larger averaging window and interval, the algorithm would perform better. but the final result I got is just like above, unexpectedly high.
I thought there must be something wrong. I tried some feature interaction(just combine string values of feature together to get the new feature, features are all categorical), and the result is even worse. 
There must be something wrong, but I cannot figure it out.My choise of the target variable may not be suitable, or feature selection is not properly done, or maybe I should try other approches like linear regression.
Anybody has encountered situations like this? I will appreciate if there's someone could give me some advice.

Thanks a lot,Frelankie lee

 		 	   		  

Re: Mahout Logisitc Regression Do Not Work Properly for Me

Posted by Ted Dunning <te...@gmail.com>.
So, the first thought that I have is that it sounds like you have dense
variables rather than sparse.  This may affect behavior of the Mahout
system.  If you have some text-like features of the ad, then you may get
cleaner results.

Secondly, I don't see any interaction features.  With as much training data
as you have, interactions with user id are probably warranted.

Regarding the predicted click-through, it is very hard to say if these
results are implausible or not purely on the predicted scores.  Logistic
regression as used here may or may not provide calibrated scores even if
working well.  In your case, we clearly have calibration issues, but I
think that a lift chart or Lorenz plot might be more useful for determining
whether you are actually getting reasonable results.

In general, to do off-line evaluation of an ad targeting system, you need
to include a random component in your current ad-targeting system so that
you don't have a grotesquely biased result.

So the real question is not so much whether the score is accurately
predicting click-through, but whether high scores correlate well with
clicks (and vice versa).

2012/4/11 LiLeqiang <fr...@hotmail.com>

>
> Hey guys,
> I have a problem using mahout's Logisitc Regression classifiers.
> I'm doing some study on ad click predictions. I've got a large collection
> of ad impression and click data for my study. The size of my collection is
> of hundreds of millions. My goal is to predict CTR(click through rate) of a
> collection of ad creatives, given the id of the ad creative impressed, the
> region, time, os and device type of the user, network type of the user, and
> some other features.
> To achieve this, I treat all above features as predictor variables. The id
> of ad creative is treated as categorical, and region, time(in hours) are
> also categorical.
>  My target variable is defined to be whether the user would click on the
> impressed ad creative. So, the target variable should have 2 values: 0
> means would not click, and 1 means would click.
> Here is the trick: suppose the model is already trained ok. When there
> comes an ad request, I first extract the predictor variables from the
> request. Then, for each of the advertisements available, I append the id of
> current ad to the predictors and then call the model's classifyScalar
> method to get the possibility of current ad to be clicked, which is my
> goal: the CTRs.
>
> I first used the OnlineLogisticRegression class to do the job. Learning
> rate is first choosen to be 0.2. I found that when the number of training
> passes is large enough, the model is converged and CTRs predicted by the
> model is unexpectedly high, over 20% for most ads in my test cases, and
> this is unacceptable since the actual CTR of my data is just around 1%-2%.
> Then, I tried different learning rates, and got the same phenomenon. I see
> other parameters of OnlineLogisticRegression, but do not know how they
> works exactly.
> Then, I tried the  AdaptiveLogisticRegression. I found that with larger
> averaging window and interval, the algorithm would perform better. but the
> final result I got is just like above, unexpectedly high.
> I thought there must be something wrong. I tried some feature
> interaction(just combine string values of feature together to get the new
> feature, features are all categorical), and the result is even worse.
> There must be something wrong, but I cannot figure it out.My choise of the
> target variable may not be suitable, or feature selection is not properly
> done, or maybe I should try other approches like linear regression.
> Anybody has encountered situations like this? I will appreciate if there's
> someone could give me some advice.
>
> Thanks a lot,Frelankie lee
>
>