You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by sam wu <sw...@gmail.com> on 2013/01/03 22:28:37 UTC

classifier predicting only using beginning subset of time-based feature

Hi,

Normally classifier does prediction based on the same set of feature used
in training.
What happens if we need to predict only based on some beginning subset of
time-based feature ?

Say, we have an eCommerce web site,
user transaction
1.users log in,  2. spend some time browsing/playing, 3. maybe buy some
goods. 4. exit
user can have several transactions per day over some time period.

Goal:
predict/classifier new user type (for simplicity, say we only have MVP, and
Non-MVP type).

The tricky part is that we'd like to do a decent prediction on a pretty new
user (say 5-7 days old),
but classifier is basically trained based on much longer time period

Suppose we have thousands unlabeled senior users(with several months data)
to start with.

Firstly, to bootstrap the label, I do a cluster to populate the label based
on long period of attributes.
Then, the question,
1. how do I find a meaningful threshold T(say 3 or 5 or 7,10..days) to
forecast new user,
2. how do I infer the new training parameter(based on T)  from the old
training param - do I do another ML trick?

Ideas are highly appreciated.

Sam

Re: classifier predicting only using beginning subset of time-based feature

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Jan 3, 2013 at 1:28 PM, sam wu <sw...@gmail.com> wrote:

> Hi,
>
> Normally classifier does prediction based on the same set of feature used
> in training.
> What happens if we need to predict only based on some beginning subset of
> time-based feature ?
>
> Say, we have an eCommerce web site,
> user transaction
> 1.users log in,  2. spend some time browsing/playing, 3. maybe buy some
> goods. 4. exit
> user can have several transactions per day over some time period.
>
> Goal:
> predict/classifier new user type (for simplicity, say we only have MVP, and
> Non-MVP type).
>
> The tricky part is that we'd like to do a decent prediction on a pretty new
> user (say 5-7 days old),
> but classifier is basically trained based on much longer time period

The classifier should only be trained on how those longer term users looked
when they were new.  If your database doesn't allow you to know what you
knew about those users back then, then you will have severe problems
training your model and may have to go back to logs in order to get a view
of what these users looked like when new.

If you don't do this correctly, then you are introducing what is known as a
time machine leak into your training data.  What this means is that you
will be training a classifier that will only work correctly if you also
have a time machine to gather its input.

> Suppose we have thousands unlabeled senior users(with several months data)
> to start with.
>

You have to truncate their data.

> Firstly, to bootstrap the label, I do a cluster to populate the label based
> on long period of attributes.
> Then, the question,
> 1. how do I find a meaningful threshold T(say 3 or 5 or 7,10..days) to
> forecast new user,
>

Try each one and see how well they work.  Then decide what the business
value is for each threshold.  You may decide to use many models to predict
things better and better as the user ages.

> 2. how do I infer the new training parameter(based on T)  from the old
> training param - do I do another ML trick?
>

I don't understand this question.