You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kundan kumar <ii...@gmail.com> on 2016/07/12 09:21:06 UTC

Handling categorical variables in StreamingLogisticRegressionwithSGD

Hi ,

I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
prediction model.

The document :

http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression

mentions that the numFeatures should be *constant*.

The problem that I am facing is :
Since most of my variables are categorical, the numFeatures variable should
be the final set of variables after encoding and parsing the categorical
variables in labeled point format.

Suppose, for a categorical variable x1 I have 10 distinct values in current
window.

But in the next window some new values/items gets added to x1 and the
number of distinct values increases. How should I handle the numFeatures
variable in this case, because it will change now ?

Basically, my question is how should I handle the new values of the
categorical variables in streaming model.

Thanks,
Kundan

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

Posted by kundan kumar <ii...@gmail.com>.

Hi Sean ,

Thanks for the reply !!

Is there anything already available in spark that can fix the depth of
categorical variables. The OneHotEncoder changes the level of the vector
created depending on the number of distinct values coming in the stream.

Is there any parameter available with the StringIndexer so that I can fix
the level of categorical variable or will I need to write some
implementation of my own.

Thanks,
Kundan

On Tue, Jul 12, 2016 at 5:43 PM, Sean Owen <so...@cloudera.com> wrote:

> Yeah, for this to work, you need to know the number of distinct values
> a categorical feature will take on, ever. Sometimes that's known,
> sometimes it's not.
>
> One option is to use an algorithm that can use categorical features
> directly, like decision trees.
>
> You could consider hashing your features if so. So, you'd have maybe
> 10 indicator columns and you hash the feature into one of those 10
> columns to figure out which one it corresponds to. Of course, when you
> have an 11th value it collides with one of them and they get
> conflated, but, at least you can sort of proceed.
>
> This is more usually done with a large number of feature values, but
> maybe that's what you have. It's more problematic the smaller your
> hash space is.
>
> On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <ii...@gmail.com>
> wrote:
> > Hi ,
> >
> > I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
> > prediction model.
> >
> > The document :
> >
> >
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
> >
> > mentions that the numFeatures should be constant.
> >
> > The problem that I am facing is :
> > Since most of my variables are categorical, the numFeatures variable
> should
> > be the final set of variables after encoding and parsing the categorical
> > variables in labeled point format.
> >
> > Suppose, for a categorical variable x1 I have 10 distinct values in
> current
> > window.
> >
> > But in the next window some new values/items gets added to x1 and the
> number
> > of distinct values increases. How should I handle the numFeatures
> variable
> > in this case, because it will change now ?
> >
> > Basically, my question is how should I handle the new values of the
> > categorical variables in streaming model.
> >
> > Thanks,
> > Kundan
> >
> >
>

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

Posted by Sean Owen <so...@cloudera.com>.

Yeah, for this to work, you need to know the number of distinct values
a categorical feature will take on, ever. Sometimes that's known,
sometimes it's not.

One option is to use an algorithm that can use categorical features
directly, like decision trees.

You could consider hashing your features if so. So, you'd have maybe
10 indicator columns and you hash the feature into one of those 10
columns to figure out which one it corresponds to. Of course, when you
have an 11th value it collides with one of them and they get
conflated, but, at least you can sort of proceed.

This is more usually done with a large number of feature values, but
maybe that's what you have. It's more problematic the smaller your
hash space is.

On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <ii...@gmail.com> wrote:
> Hi ,
>
> I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
> prediction model.
>
> The document :
>
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
>
> mentions that the numFeatures should be constant.
>
> The problem that I am facing is :
> Since most of my variables are categorical, the numFeatures variable should
> be the final set of variables after encoding and parsing the categorical
> variables in labeled point format.
>
> Suppose, for a categorical variable x1 I have 10 distinct values in current
> window.
>
> But in the next window some new values/items gets added to x1 and the number
> of distinct values increases. How should I handle the numFeatures variable
> in this case, because it will change now ?
>
> Basically, my question is how should I handle the new values of the
> categorical variables in streaming model.
>
> Thanks,
> Kundan
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org