You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Boris Fersing <bo...@fersing.eu> on 2012/03/06 16:48:36 UTC

Updating a classifier model on the fly

Hi all,

is there a way to update a classifier model on the fly? Or do I need
to recompute everything each time I add a document to a category in
the training set?

I would like to build something similar to some spam filters, where
you can confirm that a message is a spam or not, and thus, train the
classifier.

regards,
Boris
-- 
42

Re: Updating a classifier model on the fly

Posted by Boris Fersing <bo...@fersing.eu>.

Hi,

thank you all for your help. I think I'll recompute the entire model
after some files have been added to a category (threshold to be
determined), because I may want to also add a new category in some
situations. Computing the model doesn't take that long anyway.

cheers,
Boris

On Wed, Mar 7, 2012 at 03:39, Paritosh Ranjan <pr...@xebia.com> wrote:
> You can look into ClusterIterator. It requires prior information but is able
> to train on the fly.
>
>
> On 06-03-2012 22:14, Temese Szalai wrote:
>>
>> One other thing to consider about (and I don't know if Mahout supports
>> this
>> because I am very new to Mahout although very experienced with text
>> classification specifically)
>> is that I have seen unsupervised learning or semi-supervised learning
>> approaches work for an "on the fly" re-computation of a model. This can be
>> particularly helpful for
>> data bootstrapping, i.e., cases where you have a small initial set of data
>> and want to put some kind of filter and feedback loop in place to build a
>> curated data set.
>>
>> This is different than classification though where you have a labeled data
>> set and train the classifier to identify things that look like that data
>> set.
>>
>> In the once or twice I've seen unsupervised or semi-supervised learning
>> applied to create a "model", I've seen it work ok when there is only one
>> category. So, if you are building a classic binary classifier
>> and only care about one category and your system will work just fine with
>> that (i.e., "is this spam? y/n?"), this might be worth looking into if
>> your
>> use cases and business needs
>> really demand something on the fly and possibly can handle lower precision
>> and recall while the system learns.
>>
>> I don't know if this is useful to you at all.
>>
>> Temese
>>
>> On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<bo...@fersing.eu>  wrote:
>>
>>> Thanks Charles, I'll have a look at it.
>>>
>>> cheers,
>>> Boris
>>>
>>> On Tue, Mar 6, 2012 at 11:25, Charles Earl<ch...@me.com>  wrote:
>>>>
>>>> Boris,
>>>> Have you looked at online decision trees and the ilke
>>>> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
>>>> I think ultimately the concept boils down to Temese's observation of
>>>
>>> their being some measure (in the paper's case, concept drift)
>>>>
>>>> that triggers re-training of the entire set.
>>>> C
>>>> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
>>>>
>>>>> Hi Temese,
>>>>>
>>>>> thank you very much for this information.
>>>>>
>>>>> Boris
>>>>>
>>>>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai<te...@gmail.com>
>>>
>>> wrote:
>>>>>>
>>>>>> Hi Boris -
>>>>>>
>>>>>> Unless Mahout has super-powers that I am not aware of, years of
>>>
>>> experience
>>>>>>
>>>>>> in text classification tell me that - yes, you will have to rebuild
>>>>>> the
>>>>>> classifier model regularly as new labeled data becomes available.
>>>>>>
>>>>>> If you are building a system that incorporates a user feedback loop as
>>>
>>> it
>>>>>>
>>>>>> sounds like you are (i.e., "yes, this message is spam"), one thing
>>>>>> that
>>>>>> might reduce the amount of classifier re-training would be to verify
>>>
>>> that
>>>>>>
>>>>>> the
>>>>>> new incoming labeled document is not already in your data set, i.e.,
>>>
>>> not a
>>>>>>
>>>>>> dupe. Additionally, you probably want to wait to retrain until you
>>>>>> have
>>>>>> some critical mass of newly labeled documents or else you have a
>>>
>>> critical
>>>>>>
>>>>>> data point to include.
>>>>>>
>>>>>> If someone has the ability to say "no this is not spam", keeping that
>>>
>>> data
>>>>>>
>>>>>> as labeled data to add to your anti-content/negative content set would
>>>
>>> be
>>>>>>
>>>>>> valuable.
>>>>>> Best,
>>>>>> Temese
>>>>>>
>>>>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<bo...@fersing.eu>
>>>
>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> is there a way to update a classifier model on the fly? Or do I need
>>>>>>> to recompute everything each time I add a document to a category in
>>>>>>> the training set?
>>>>>>>
>>>>>>> I would like to build something similar to some spam filters, where
>>>>>>> you can confirm that a message is a spam or not, and thus, train the
>>>>>>> classifier.
>>>>>>>
>>>>>>> regards,
>>>>>>> Boris
>>>>>>> --
>>>>>>> 42
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 42
>>>
>>>
>>>
>>> --
>>> 42
>>>
>



-- 
42

Re: Updating a classifier model on the fly

Posted by Paritosh Ranjan <pr...@xebia.com>.

You can look into ClusterIterator. It requires prior information but is 
able to train on the fly.

On 06-03-2012 22:14, Temese Szalai wrote:
> One other thing to consider about (and I don't know if Mahout supports this
> because I am very new to Mahout although very experienced with text
> classification specifically)
> is that I have seen unsupervised learning or semi-supervised learning
> approaches work for an "on the fly" re-computation of a model. This can be
> particularly helpful for
> data bootstrapping, i.e., cases where you have a small initial set of data
> and want to put some kind of filter and feedback loop in place to build a
> curated data set.
>
> This is different than classification though where you have a labeled data
> set and train the classifier to identify things that look like that data
> set.
>
> In the once or twice I've seen unsupervised or semi-supervised learning
> applied to create a "model", I've seen it work ok when there is only one
> category. So, if you are building a classic binary classifier
> and only care about one category and your system will work just fine with
> that (i.e., "is this spam? y/n?"), this might be worth looking into if your
> use cases and business needs
> really demand something on the fly and possibly can handle lower precision
> and recall while the system learns.
>
> I don't know if this is useful to you at all.
>
> Temese
>
> On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<bo...@fersing.eu>  wrote:
>
>> Thanks Charles, I'll have a look at it.
>>
>> cheers,
>> Boris
>>
>> On Tue, Mar 6, 2012 at 11:25, Charles Earl<ch...@me.com>  wrote:
>>> Boris,
>>> Have you looked at online decision trees and the ilke
>>> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
>>> I think ultimately the concept boils down to Temese's observation of
>> their being some measure (in the paper's case, concept drift)
>>> that triggers re-training of the entire set.
>>> C
>>> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
>>>
>>>> Hi Temese,
>>>>
>>>> thank you very much for this information.
>>>>
>>>> Boris
>>>>
>>>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai<te...@gmail.com>
>> wrote:
>>>>> Hi Boris -
>>>>>
>>>>> Unless Mahout has super-powers that I am not aware of, years of
>> experience
>>>>> in text classification tell me that - yes, you will have to rebuild the
>>>>> classifier model regularly as new labeled data becomes available.
>>>>>
>>>>> If you are building a system that incorporates a user feedback loop as
>> it
>>>>> sounds like you are (i.e., "yes, this message is spam"), one thing that
>>>>> might reduce the amount of classifier re-training would be to verify
>> that
>>>>> the
>>>>> new incoming labeled document is not already in your data set, i.e.,
>> not a
>>>>> dupe. Additionally, you probably want to wait to retrain until you have
>>>>> some critical mass of newly labeled documents or else you have a
>> critical
>>>>> data point to include.
>>>>>
>>>>> If someone has the ability to say "no this is not spam", keeping that
>> data
>>>>> as labeled data to add to your anti-content/negative content set would
>> be
>>>>> valuable.
>>>>> Best,
>>>>> Temese
>>>>>
>>>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<bo...@fersing.eu>
>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> is there a way to update a classifier model on the fly? Or do I need
>>>>>> to recompute everything each time I add a document to a category in
>>>>>> the training set?
>>>>>>
>>>>>> I would like to build something similar to some spam filters, where
>>>>>> you can confirm that a message is a spam or not, and thus, train the
>>>>>> classifier.
>>>>>>
>>>>>> regards,
>>>>>> Boris
>>>>>> --
>>>>>> 42
>>>>>>
>>>>
>>>>
>>>> --
>>>> 42
>>
>>
>> --
>> 42
>>

Re: Updating a classifier model on the fly

Posted by Temese Szalai <te...@gmail.com>.

One other thing to consider about (and I don't know if Mahout supports this
because I am very new to Mahout although very experienced with text
classification specifically)
is that I have seen unsupervised learning or semi-supervised learning
approaches work for an "on the fly" re-computation of a model. This can be
particularly helpful for
data bootstrapping, i.e., cases where you have a small initial set of data
and want to put some kind of filter and feedback loop in place to build a
curated data set.

This is different than classification though where you have a labeled data
set and train the classifier to identify things that look like that data
set.

In the once or twice I've seen unsupervised or semi-supervised learning
applied to create a "model", I've seen it work ok when there is only one
category. So, if you are building a classic binary classifier
and only care about one category and your system will work just fine with
that (i.e., "is this spam? y/n?"), this might be worth looking into if your
use cases and business needs
really demand something on the fly and possibly can handle lower precision
and recall while the system learns.

I don't know if this is useful to you at all.

Temese

On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing <bo...@fersing.eu> wrote:

> Thanks Charles, I'll have a look at it.
>
> cheers,
> Boris
>
> On Tue, Mar 6, 2012 at 11:25, Charles Earl <ch...@me.com> wrote:
> > Boris,
> > Have you looked at online decision trees and the ilke
> > http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
> > I think ultimately the concept boils down to Temese's observation of
> their being some measure (in the paper's case, concept drift)
> > that triggers re-training of the entire set.
> > C
> > On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
> >
> >> Hi Temese,
> >>
> >> thank you very much for this information.
> >>
> >> Boris
> >>
> >> On Tue, Mar 6, 2012 at 11:14, Temese Szalai <te...@gmail.com>
> wrote:
> >>> Hi Boris -
> >>>
> >>> Unless Mahout has super-powers that I am not aware of, years of
> experience
> >>> in text classification tell me that - yes, you will have to rebuild the
> >>> classifier model regularly as new labeled data becomes available.
> >>>
> >>> If you are building a system that incorporates a user feedback loop as
> it
> >>> sounds like you are (i.e., "yes, this message is spam"), one thing that
> >>> might reduce the amount of classifier re-training would be to verify
> that
> >>> the
> >>> new incoming labeled document is not already in your data set, i.e.,
> not a
> >>> dupe. Additionally, you probably want to wait to retrain until you have
> >>> some critical mass of newly labeled documents or else you have a
> critical
> >>> data point to include.
> >>>
> >>> If someone has the ability to say "no this is not spam", keeping that
> data
> >>> as labeled data to add to your anti-content/negative content set would
> be
> >>> valuable.
> >>> Best,
> >>> Temese
> >>>
> >>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <bo...@fersing.eu>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> is there a way to update a classifier model on the fly? Or do I need
> >>>> to recompute everything each time I add a document to a category in
> >>>> the training set?
> >>>>
> >>>> I would like to build something similar to some spam filters, where
> >>>> you can confirm that a message is a spam or not, and thus, train the
> >>>> classifier.
> >>>>
> >>>> regards,
> >>>> Boris
> >>>> --
> >>>> 42
> >>>>
> >>
> >>
> >>
> >> --
> >> 42
> >
>
>
>
> --
> 42
>

Re: Updating a classifier model on the fly

Posted by Boris Fersing <bo...@fersing.eu>.

Thanks Charles, I'll have a look at it.

cheers,
Boris

On Tue, Mar 6, 2012 at 11:25, Charles Earl <ch...@me.com> wrote:
> Boris,
> Have you looked at online decision trees and the ilke
> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
> I think ultimately the concept boils down to Temese's observation of their being some measure (in the paper's case, concept drift)
> that triggers re-training of the entire set.
> C
> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
>
>> Hi Temese,
>>
>> thank you very much for this information.
>>
>> Boris
>>
>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai <te...@gmail.com> wrote:
>>> Hi Boris -
>>>
>>> Unless Mahout has super-powers that I am not aware of, years of experience
>>> in text classification tell me that - yes, you will have to rebuild the
>>> classifier model regularly as new labeled data becomes available.
>>>
>>> If you are building a system that incorporates a user feedback loop as it
>>> sounds like you are (i.e., "yes, this message is spam"), one thing that
>>> might reduce the amount of classifier re-training would be to verify that
>>> the
>>> new incoming labeled document is not already in your data set, i.e., not a
>>> dupe. Additionally, you probably want to wait to retrain until you have
>>> some critical mass of newly labeled documents or else you have a critical
>>> data point to include.
>>>
>>> If someone has the ability to say "no this is not spam", keeping that data
>>> as labeled data to add to your anti-content/negative content set would be
>>> valuable.
>>> Best,
>>> Temese
>>>
>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <bo...@fersing.eu> wrote:
>>>
>>>> Hi all,
>>>>
>>>> is there a way to update a classifier model on the fly? Or do I need
>>>> to recompute everything each time I add a document to a category in
>>>> the training set?
>>>>
>>>> I would like to build something similar to some spam filters, where
>>>> you can confirm that a message is a spam or not, and thus, train the
>>>> classifier.
>>>>
>>>> regards,
>>>> Boris
>>>> --
>>>> 42
>>>>
>>
>>
>>
>> --
>> 42
>



-- 
42

Re: Updating a classifier model on the fly

Posted by Charles Earl <ch...@me.com>.

Boris,
Have you looked at online decision trees and the ilke 
http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
I think ultimately the concept boils down to Temese's observation of their being some measure (in the paper's case, concept drift)
that triggers re-training of the entire set. 
C
On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:

> Hi Temese,
> 
> thank you very much for this information.
> 
> Boris
> 
> On Tue, Mar 6, 2012 at 11:14, Temese Szalai <te...@gmail.com> wrote:
>> Hi Boris -
>> 
>> Unless Mahout has super-powers that I am not aware of, years of experience
>> in text classification tell me that - yes, you will have to rebuild the
>> classifier model regularly as new labeled data becomes available.
>> 
>> If you are building a system that incorporates a user feedback loop as it
>> sounds like you are (i.e., "yes, this message is spam"), one thing that
>> might reduce the amount of classifier re-training would be to verify that
>> the
>> new incoming labeled document is not already in your data set, i.e., not a
>> dupe. Additionally, you probably want to wait to retrain until you have
>> some critical mass of newly labeled documents or else you have a critical
>> data point to include.
>> 
>> If someone has the ability to say "no this is not spam", keeping that data
>> as labeled data to add to your anti-content/negative content set would be
>> valuable.
>> Best,
>> Temese
>> 
>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <bo...@fersing.eu> wrote:
>> 
>>> Hi all,
>>> 
>>> is there a way to update a classifier model on the fly? Or do I need
>>> to recompute everything each time I add a document to a category in
>>> the training set?
>>> 
>>> I would like to build something similar to some spam filters, where
>>> you can confirm that a message is a spam or not, and thus, train the
>>> classifier.
>>> 
>>> regards,
>>> Boris
>>> --
>>> 42
>>> 
> 
> 
> 
> -- 
> 42

Re: Updating a classifier model on the fly

Posted by Boris Fersing <bo...@fersing.eu>.

Hi Temese,

thank you very much for this information.

Boris

On Tue, Mar 6, 2012 at 11:14, Temese Szalai <te...@gmail.com> wrote:
> Hi Boris -
>
> Unless Mahout has super-powers that I am not aware of, years of experience
> in text classification tell me that - yes, you will have to rebuild the
> classifier model regularly as new labeled data becomes available.
>
> If you are building a system that incorporates a user feedback loop as it
> sounds like you are (i.e., "yes, this message is spam"), one thing that
> might reduce the amount of classifier re-training would be to verify that
> the
> new incoming labeled document is not already in your data set, i.e., not a
> dupe. Additionally, you probably want to wait to retrain until you have
> some critical mass of newly labeled documents or else you have a critical
> data point to include.
>
> If someone has the ability to say "no this is not spam", keeping that data
> as labeled data to add to your anti-content/negative content set would be
> valuable.
> Best,
> Temese
>
> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <bo...@fersing.eu> wrote:
>
>> Hi all,
>>
>> is there a way to update a classifier model on the fly? Or do I need
>> to recompute everything each time I add a document to a category in
>> the training set?
>>
>> I would like to build something similar to some spam filters, where
>> you can confirm that a message is a spam or not, and thus, train the
>> classifier.
>>
>> regards,
>> Boris
>> --
>> 42
>>



-- 
42

Re: Updating a classifier model on the fly

Posted by Temese Szalai <te...@gmail.com>.

Hi Boris -

Unless Mahout has super-powers that I am not aware of, years of experience
in text classification tell me that - yes, you will have to rebuild the
classifier model regularly as new labeled data becomes available.

If you are building a system that incorporates a user feedback loop as it
sounds like you are (i.e., "yes, this message is spam"), one thing that
might reduce the amount of classifier re-training would be to verify that
the
new incoming labeled document is not already in your data set, i.e., not a
dupe. Additionally, you probably want to wait to retrain until you have
some critical mass of newly labeled documents or else you have a critical
data point to include.

If someone has the ability to say "no this is not spam", keeping that data
as labeled data to add to your anti-content/negative content set would be
valuable.
Best,
Temese

On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <bo...@fersing.eu> wrote:

> Hi all,
>
> is there a way to update a classifier model on the fly? Or do I need
> to recompute everything each time I add a document to a category in
> the training set?
>
> I would like to build something similar to some spam filters, where
> you can confirm that a message is a spam or not, and thus, train the
> classifier.
>
> regards,
> Boris
> --
> 42
>