You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mani Kumar <ma...@gmail.com> on 2009/12/28 20:16:49 UTC

Incremental training of Classifier

Hi All,

I have ran 20newsgroups example. Got a very good idea of how cluster is
working for a defined dataset.

But i have a slightly different situation here.

* I have few thousands of documents (50k).
* Everyday i get some e.g. 1k documents and out of which 600 are already
classified so i need to classify only 400 documents everyday.

So my approach would be:

1. Get all the documents into hdfs
2. Train classifier based on data in hdfs
3. Classify new unclassified document.

Right now i don't see a way to add more training documents (600 already
classified docs) into system? Am i missing something?

Also I don't want to remove and then create training model again.

Thanks!
Mani Kumar

Re: Incremental training of Classifier

Posted by Mani Kumar <ma...@gmail.com>.

Hi Ted,

Sure, i understand that. Will keep posting about findings and requirements.

Thanks!
Mani Kumar

On Tue, Dec 29, 2009 at 10:52 AM, Ted Dunning <te...@gmail.com> wrote:

> Mani,
>
> Keep in mind that Mahout is very new and that you can have a substantial
> influence on direction at this stage of the project by becoming involved.
>
> On Mon, Dec 28, 2009 at 9:15 PM, Mani Kumar <manikumarchauhan@gmail.com
> >wrote:
>
> > @Ted: Currently i just started experimentation with mahout, and don't
> have
> > a
> > very clear picture of how it can work for us. I'll let you details as i
> get
> > more experience with mahout and more deeper understanding of our
> > requirement.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Incremental training of Classifier

Posted by Ted Dunning <te...@gmail.com>.

Mani,

Keep in mind that Mahout is very new and that you can have a substantial
influence on direction at this stage of the project by becoming involved.

On Mon, Dec 28, 2009 at 9:15 PM, Mani Kumar <ma...@gmail.com>wrote:

> @Ted: Currently i just started experimentation with mahout, and don't have
> a
> very clear picture of how it can work for us. I'll let you details as i get
> more experience with mahout and more deeper understanding of our
> requirement.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Incremental training of Classifier

Posted by Robin Anil <ro...@gmail.com>.

On Tue, Dec 29, 2009 at 10:45 AM, Mani Kumar <ma...@gmail.com>wrote:

> @Robin: thanks! btw whats the reasoning behind using CBayes for >2
> categories? While bayes works for spam/not spam kinda classification, why
> not for > 2 categories. It'd great if you can give some pointers to read
> and
> understand.
>
Just a slight diff in math behind it. CBayes is Bayes but tries to classify
objects as not belonging to a class instead of belonging to a class. For
more read insight you can read up the paper on Complementary Naive Bayes. Do
a quick experiment on 20 news groups with CBayes and Bayes. You will see the
difference.



> @Ted: Currently i just started experimentation with mahout, and don't have
> a
> very clear picture of how it can work for us. I'll let you details as i get
> more experience with mahout and more deeper understanding of our
> requirement.
>
> Thanks!
> Mani Kumar
>
> On Tue, Dec 29, 2009 at 6:14 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > mani,
> >
> > You are sounding more and more like the poster child for an on-line
> > classifier.
> >
> > The idea would be that you would give your classified docs to the system
> > first for testing, then again for incremental training.  You can use the
> > results of the test to adjust the learning rate for the incremental
> > learning.
> >
> > See the work I have started with MAHOUT-228 for the beginnings of this.
> >  Let
> > me know where it should go to help with your needs (i.e. what entry
> points
> > that you would need).
> >
> > On Mon, Dec 28, 2009 at 1:33 PM, Mani Kumar <manikumarchauhan@gmail.com
> > >wrote:
> >
> > > lets talk about bigger numbers e.g. i have more than 1 million docs and
> i
> > > get 10k new docs every day out of which 6k is already classified.
> > >
> > > Monitoring performance is good but it can be done weekly instead of
> daily
> > > just to reduce cost.
> > >
> > > I actually wanted to avoid the retraining as much as possible because
> it
> > > comes with huge cost for large dataset.
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Re: Incremental training of Classifier

Posted by Mani Kumar <ma...@gmail.com>.

@Robin: thanks! btw whats the reasoning behind using CBayes for >2
categories? While bayes works for spam/not spam kinda classification, why
not for > 2 categories. It'd great if you can give some pointers to read and
understand.

@Ted: Currently i just started experimentation with mahout, and don't have a
very clear picture of how it can work for us. I'll let you details as i get
more experience with mahout and more deeper understanding of our
requirement.

Thanks!
Mani Kumar

On Tue, Dec 29, 2009 at 6:14 AM, Ted Dunning <te...@gmail.com> wrote:

> mani,
>
> You are sounding more and more like the poster child for an on-line
> classifier.
>
> The idea would be that you would give your classified docs to the system
> first for testing, then again for incremental training.  You can use the
> results of the test to adjust the learning rate for the incremental
> learning.
>
> See the work I have started with MAHOUT-228 for the beginnings of this.
>  Let
> me know where it should go to help with your needs (i.e. what entry points
> that you would need).
>
> On Mon, Dec 28, 2009 at 1:33 PM, Mani Kumar <manikumarchauhan@gmail.com
> >wrote:
>
> > lets talk about bigger numbers e.g. i have more than 1 million docs and i
> > get 10k new docs every day out of which 6k is already classified.
> >
> > Monitoring performance is good but it can be done weekly instead of daily
> > just to reduce cost.
> >
> > I actually wanted to avoid the retraining as much as possible because it
> > comes with huge cost for large dataset.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Incremental training of Classifier

Posted by Ted Dunning <te...@gmail.com>.

mani,

You are sounding more and more like the poster child for an on-line
classifier.

The idea would be that you would give your classified docs to the system
first for testing, then again for incremental training.  You can use the
results of the test to adjust the learning rate for the incremental
learning.

See the work I have started with MAHOUT-228 for the beginnings of this.  Let
me know where it should go to help with your needs (i.e. what entry points
that you would need).

On Mon, Dec 28, 2009 at 1:33 PM, Mani Kumar <ma...@gmail.com>wrote:

> lets talk about bigger numbers e.g. i have more than 1 million docs and i
> get 10k new docs every day out of which 6k is already classified.
>
> Monitoring performance is good but it can be done weekly instead of daily
> just to reduce cost.
>
> I actually wanted to avoid the retraining as much as possible because it
> comes with huge cost for large dataset.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Incremental training of Classifier

Posted by Robin Anil <ro...@gmail.com>.

>
>
>
> > with a 50K set, you may/may not loose out on some features. Depends
> > entirely
> > on the data. If you dont mind answering - What is the number of
> categories
> > that you have?
> >
>
>   ~50 categories
>

Stick to CBayes classifier for >2 categories.

Re: Incremental training of Classifier

Posted by Mani Kumar <ma...@gmail.com>.

Comments inline.

-Mani Kumar

On Tue, Dec 29, 2009 at 3:14 AM, Robin Anil <ro...@gmail.com> wrote:

> with a 50K set, you may/may not loose out on some features. Depends
> entirely
> on the data. If you dont mind answering - What is the number of categories
> that you have?
>

  ~50 categories

>
> I agree that re-training 1 million docs is cumbersome. But if i remember
> correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4
> HT
> systems in 20 mins.


-- thats fast.


> I dont know how big your data or how big your cluster
> is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am
> blind and have no sense of what is big after working at google). I say, try
> and estimate it yourself.


-- daily 1 hour is not an issue but daily 6-8 hours will be an issue.


>
> On the other hand. You could also try a dual fold approach. A sturdy 1
> million docs trained classifier and recent 50K docs classifier. And do some
> form of voting.
>
> I am sure you will not be able to load the 1mil model in to memory, you
> might need to use Hbase there. Instead you can use 50K model in  memory for
> fast classification. Then run a batch classification job daily to
> re-classify your dataset based on the 1mil model
>

-- yes, i'll have to use hbase only. thanks!


>
> Robin
>
>
>
> On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <manikumarchauhan@gmail.com
> >wrote:
>
> > Thanks for the quick response.
> >
> > @Robin absolutely agree on your suggestion regarding using 600 docs for
> > monitoring performance.
> >
> > lets talk about bigger numbers e.g. i have more than 1 million docs and i
> > get 10k new docs every day out of which 6k is already classified.
> >
> > Monitoring performance is good but it can be done weekly instead of daily
> > just to reduce cost.
> >
> > I actually wanted to avoid the retraining as much as possible because it
> > comes with huge cost for large dataset.
> >
> > Better solution could that we'll use 50k docs from every category order
> by
> > created_at desc, to reduce the amount of data and stay tuned with latest
> > trends.
> >
> > Thanks a lot guys.
> >
> > -Mani Kumar
> >
> > On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > Long answer, You can use your 600 docs to test the classifier and see
> > > your
> > > > accuracy. Then retrain with the entire documents and then test a test
> > > data
> > > > set. So daily you can choose to include or exclude the 600 documents
> > that
> > > > come and ensure that you keep your classifier at the top performance.
> > > >  After
> > > > some amount of documents, you dont get much benefit of retraining.
> > > Further
> > > > training would only add over fitting errors.
> > > >
> > >
> > > The suggestion that the 600 new documents be used to monitor
> performance
> > is
> > > an excellent one.
> > >
> > > It should be pretty easy to add the "train on incremental data" option
> to
> > > K-means.
> > >
> > > Also, the k-means algorithm definitely will reach a point of
> diminishing
> > > returns, but it should be very resistant to over training.
> > >
> >
>

Re: Incremental training of Classifier

Posted by Robin Anil <ro...@gmail.com>.

with a 50K set, you may/may not loose out on some features. Depends entirely
on the data. If you dont mind answering - What is the number of categories
that you have?

I agree that re-training 1 million docs is cumbersome. But if i remember
correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4 HT
systems in 20 mins. I dont know how big your data or how big your cluster
is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am
blind and have no sense of what is big after working at google). I say, try
and estimate it yourself.

On the other hand. You could also try a dual fold approach. A sturdy 1
million docs trained classifier and recent 50K docs classifier. And do some
form of voting.

I am sure you will not be able to load the 1mil model in to memory, you
might need to use Hbase there. Instead you can use 50K model in  memory for
fast classification. Then run a batch classification job daily to
re-classify your dataset based on the 1mil model

Robin

On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <ma...@gmail.com>wrote:

> Thanks for the quick response.
>
> @Robin absolutely agree on your suggestion regarding using 600 docs for
> monitoring performance.
>
> lets talk about bigger numbers e.g. i have more than 1 million docs and i
> get 10k new docs every day out of which 6k is already classified.
>
> Monitoring performance is good but it can be done weekly instead of daily
> just to reduce cost.
>
> I actually wanted to avoid the retraining as much as possible because it
> comes with huge cost for large dataset.
>
> Better solution could that we'll use 50k docs from every category order by
> created_at desc, to reduce the amount of data and stay tuned with latest
> trends.
>
> Thanks a lot guys.
>
> -Mani Kumar
>
> On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > Long answer, You can use your 600 docs to test the classifier and see
> > your
> > > accuracy. Then retrain with the entire documents and then test a test
> > data
> > > set. So daily you can choose to include or exclude the 600 documents
> that
> > > come and ensure that you keep your classifier at the top performance.
> > >  After
> > > some amount of documents, you dont get much benefit of retraining.
> > Further
> > > training would only add over fitting errors.
> > >
> >
> > The suggestion that the 600 new documents be used to monitor performance
> is
> > an excellent one.
> >
> > It should be pretty easy to add the "train on incremental data" option to
> > K-means.
> >
> > Also, the k-means algorithm definitely will reach a point of diminishing
> > returns, but it should be very resistant to over training.
> >
>

Re: Incremental training of Classifier

Posted by Mani Kumar <ma...@gmail.com>.

Thanks for the quick response.

@Robin absolutely agree on your suggestion regarding using 600 docs for
monitoring performance.

lets talk about bigger numbers e.g. i have more than 1 million docs and i
get 10k new docs every day out of which 6k is already classified.

Monitoring performance is good but it can be done weekly instead of daily
just to reduce cost.

I actually wanted to avoid the retraining as much as possible because it
comes with huge cost for large dataset.

Better solution could that we'll use 50k docs from every category order by
created_at desc, to reduce the amount of data and stay tuned with latest
trends.

Thanks a lot guys.

-Mani Kumar

On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > Long answer, You can use your 600 docs to test the classifier and see
> your
> > accuracy. Then retrain with the entire documents and then test a test
> data
> > set. So daily you can choose to include or exclude the 600 documents that
> > come and ensure that you keep your classifier at the top performance.
> >  After
> > some amount of documents, you dont get much benefit of retraining.
> Further
> > training would only add over fitting errors.
> >
>
> The suggestion that the 600 new documents be used to monitor performance is
> an excellent one.
>
> It should be pretty easy to add the "train on incremental data" option to
> K-means.
>
> Also, the k-means algorithm definitely will reach a point of diminishing
> returns, but it should be very resistant to over training.
>

Re: Incremental training of Classifier

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <ro...@gmail.com> wrote:

> Long answer, You can use your 600 docs to test the classifier and see your
> accuracy. Then retrain with the entire documents and then test a test data
> set. So daily you can choose to include or exclude the 600 documents that
> come and ensure that you keep your classifier at the top performance.
>  After
> some amount of documents, you dont get much benefit of retraining. Further
> training would only add over fitting errors.
>

The suggestion that the 600 new documents be used to monitor performance is
an excellent one.

It should be pretty easy to add the "train on incremental data" option to
K-means.

Also, the k-means algorithm definitely will reach a point of diminishing
returns, but it should be very resistant to over training.

Re: Incremental training of Classifier

Posted by Robin Anil <ro...@gmail.com>.

Hi Mani,
            Short answer: Currently you need to retrain the model.

Long answer, You can use your 600 docs to test the classifier and see your
accuracy. Then retrain with the entire documents and then test a test data
set. So daily you can choose to include or exclude the 600 documents that
come and ensure that you keep your classifier at the top performance.  After
some amount of documents, you dont get much benefit of retraining. Further
training would only add over fitting errors.

Robin

On Tue, Dec 29, 2009 at 12:46 AM, Mani Kumar <ma...@gmail.com>wrote:

> Hi All,
>
> I have ran 20newsgroups example. Got a very good idea of how cluster is
> working for a defined dataset.
>
> But i have a slightly different situation here.
>
> * I have few thousands of documents (50k).
> * Everyday i get some e.g. 1k documents and out of which 600 are already
> classified so i need to classify only 400 documents everyday.
>
> So my approach would be:
>
> 1. Get all the documents into hdfs
> 2. Train classifier based on data in hdfs
> 3. Classify new unclassified document.
>
> Right now i don't see a way to add more training documents (600 already
> classified docs) into system? Am i missing something?
>
> Also I don't want to remove and then create training model again.
>
> Thanks!
> Mani Kumar
>