You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Salman Mahmood <sa...@gmail.com> on 2012/08/01 19:08:27 UTC

Clustering or Classification?

Hi all,

I am stuck between a decision to apply classification or clustering on the
data set I got. The more I think about it, the more I get confused. Heres
what I am confronted with.

I have got news documents (around 3000 and continuously increasing)
containing news about companies, investment, stocks, economy, quartly
income etc. My goal is to have the news sorted in such a way that I know
which news correspond to which company. e.g for the news item "Apple
launches new iphone", I need to associate the company Apple with it. A
particular news item/document only contains 'title' and 'description' so I
have to analyze the text in order to find out which company the news
referes to. It could be multiple companies too.

To solve this, I turned to Mahout.

I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
etc as top terms in my clusters and from there I would know the news in a
cluster corresponds to its cluster label, but things were a bit different.
I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
'shares', 'street', 'olympics' and lots of other terms as the top ones
(which makes sense as clustering algos' look for common terms). Although
there were some 'Apple' clusters but the news items associated with it were
very few.I thought may be clustering is not for this kind of problem as
many of the company news goes into more general clusters(investment,
profit) instead of the specific company cluster(Apple).

I started reading about classification which requires training data, The
name was convincing too as I actually want to 'classify' my news items into
'company names'. As I read on, I got an impression that the name
classification is a bit deceiving and the technique is used more for
prediction purposes as compared to classification. The other confusions
that I got was how can I prepare training data for news documents? lets
assume I have a list of companies that I am interested in. I write a
program to produce training data for the classifier. the program will see
if the news title or description contains the company name 'Apple' then its
a news story about apple. Is this how I can prepare training data?(off
course I read that training data is actually a set of predictors and target
variables). If so, then why should I use mahout classification in the first
place? I should ditch mahout and instead use this little program that I
wrote for training data(which actually does the classification)

You can see how confused I am about how to address this issue. Another
thing that concerns me is that if its possible to make a system this
intelligent, that if the news says 'iphone sales at a record high' without
using the word 'Apple', the system can classify it as a news related to
apple?

Thank you in advance for pointing me in the right direction.

Re: Clustering or Classification?

Posted by syed kather <in...@gmail.com>.

Sry I had not sean owen post as it is not update in mobile .

Syed Abdul kather
send from Samsung S3
On Aug 1, 2012 11:32 PM, "syed kather" <in...@gmail.com> wrote:

> Hi salman mahmood,
>     Whydont you try to apply clustering first . Once you applied high
> level clustering then check the top terms . You avoid the cluster which you
> feel good and try to find inter cluster which you found that it has
> confusion . Once you found that all the clusters are fine . To make the
> cluster perfect I had indexed all the document into solr . Because by using
> solr I had removed stop words and applied snow ball filter like that .
> Then as you know the identified all the clusters . Now try to verify
> whether cluster top term are good . Now from that cluster by using cluster
> points split the documents and according to its cluster . Now you will have
> bunch document s as group . Now if you apply classification and train the
> set .
>
> I hope u understood .. this is the approach I had followed . Let me know
> if you had some ideas .
> Syed Abdul kather
> send from Samsung S3
> On Aug 1, 2012 10:38 PM, "Salman Mahmood" <sa...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am stuck between a decision to apply classification or clustering on the
>> data set I got. The more I think about it, the more I get confused. Heres
>> what I am confronted with.
>>
>> I have got news documents (around 3000 and continuously increasing)
>> containing news about companies, investment, stocks, economy, quartly
>> income etc. My goal is to have the news sorted in such a way that I know
>> which news correspond to which company. e.g for the news item "Apple
>> launches new iphone", I need to associate the company Apple with it. A
>> particular news item/document only contains 'title' and 'description' so I
>> have to analyze the text in order to find out which company the news
>> referes to. It could be multiple companies too.
>>
>> To solve this, I turned to Mahout.
>>
>> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
>> etc as top terms in my clusters and from there I would know the news in a
>> cluster corresponds to its cluster label, but things were a bit different.
>> I got 'investment', 'stocks', 'correspondence', 'green energy',
>> 'terminal',
>> 'shares', 'street', 'olympics' and lots of other terms as the top ones
>> (which makes sense as clustering algos' look for common terms). Although
>> there were some 'Apple' clusters but the news items associated with it
>> were
>> very few.I thought may be clustering is not for this kind of problem as
>> many of the company news goes into more general clusters(investment,
>> profit) instead of the specific company cluster(Apple).
>>
>> I started reading about classification which requires training data, The
>> name was convincing too as I actually want to 'classify' my news items
>> into
>> 'company names'. As I read on, I got an impression that the name
>> classification is a bit deceiving and the technique is used more for
>> prediction purposes as compared to classification. The other confusions
>> that I got was how can I prepare training data for news documents? lets
>> assume I have a list of companies that I am interested in. I write a
>> program to produce training data for the classifier. the program will see
>> if the news title or description contains the company name 'Apple' then
>> its
>> a news story about apple. Is this how I can prepare training data?(off
>> course I read that training data is actually a set of predictors and
>> target
>> variables). If so, then why should I use mahout classification in the
>> first
>> place? I should ditch mahout and instead use this little program that I
>> wrote for training data(which actually does the classification)
>>
>> You can see how confused I am about how to address this issue. Another
>> thing that concerns me is that if its possible to make a system this
>> intelligent, that if the news says 'iphone sales at a record high' without
>> using the word 'Apple', the system can classify it as a news related to
>> apple?
>>
>> Thank you in advance for pointing me in the right direction.
>>
>

Re: Clustering or Classification?

Posted by syed kather <in...@gmail.com>.

Hi salman mahmood,
    Whydont you try to apply clustering first . Once you applied high level
clustering then check the top terms . You avoid the cluster which you feel
good and try to find inter cluster which you found that it has confusion .
Once you found that all the clusters are fine . To make the cluster perfect
I had indexed all the document into solr . Because by using solr I had
removed stop words and applied snow ball filter like that .
Then as you know the identified all the clusters . Now try to verify
whether cluster top term are good . Now from that cluster by using cluster
points split the documents and according to its cluster . Now you will have
bunch document s as group . Now if you apply classification and train the
set .

I hope u understood .. this is the approach I had followed . Let me know if
you had some ideas .
Syed Abdul kather
send from Samsung S3
On Aug 1, 2012 10:38 PM, "Salman Mahmood" <sa...@gmail.com> wrote:

> Hi all,
>
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
>
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>
> To solve this, I turned to Mahout.
>
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
>
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
>
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>
> Thank you in advance for pointing me in the right direction.
>

Re: Clustering or Classification?

Posted by Sean Owen <sr...@gmail.com>.

I'm suggesting that classification sounds like the right solution for
the problem you have described. You can use Mahout (or anything else
that classifies) for that. Yes I am the same.

On Wed, Aug 1, 2012 at 6:50 PM, Salman Mahmood <sa...@gmail.com> wrote:
> Hi Sean,
>
> Thank you for the clarification. So you are saying that Mahout is not
> suitable in this case or did you say clustering is not the right way to go
> and If its worth it, I should go for classification?
>
> Secondly are you the same Sean Owen who wrote Mahout in Action? :)
>

Re: Clustering or Classification?

Posted by Salman Mahmood <sa...@gmail.com>.

Hi Sean,

Thank you for the clarification. So you are saying that Mahout is not
suitable in this case or did you say clustering is not the right way to go
and If its worth it, I should go for classification?

Secondly are you the same Sean Owen who wrote Mahout in Action? :)

On Wed, Aug 1, 2012 at 7:39 PM, Sean Owen <sr...@gmail.com> wrote:

> Classifiers are supervised learning algorithms, so you need to provide
> a bunch of examples of positive and negative classes. In your example,
> it would be fine to label a bunch of articles as "about Apple" or not,
> then use feature vectors derived from TF-IDF as input, with these
> labels, to train a classifier that can tell when an article is "about
> Apple".
>
> I don't think it will quite work to automatically generate the
> training set by labeling according to the simple rule, that it is
> about Apple if 'Apple' is in the title. Well, if you do that, then
> there is no point in training a classifier. You can make a trivial
> classifier that achieves 100% accuracy on your test set by just
> checking if 'Apple' is in the title! Yes, you are right, this gains
> you nothing.
>
> Clearly you want to learn something subtler from the classifier, so
> that an article titled "Apple juice shown to reduce risk of dementia"
> isn't classified as about the company. You'd really need to feed it
> hand-classified documents.
>
> That's the bad news, but, sure you can certainly train N classifiers
> for N topics this way.
>
> Classifiers put items into a class or not. They are not the same as
> regression techniques which predict a continuous value for an input.
> They're related but distinct.
>
>
> Clustering has the advantage of being unsupervised. You don't need
> labels. However the resulting clusters are not guaranteed to match up
> to your notion of article topics. You may see a cluster that has a lot
> of Apple articles, some about the iPod, but also some about Samsung
> and laptops in general. I don't think this is the best tool for your
> problem.
>
>
>
>
> On Wed, Aug 1, 2012 at 6:08 PM, Salman Mahmood <sa...@gmail.com>
> wrote:
> > Hi all,
> >
> > I am stuck between a decision to apply classification or clustering on
> the
> > data set I got. The more I think about it, the more I get confused. Heres
> > what I am confronted with.
> >
> > I have got news documents (around 3000 and continuously increasing)
> > containing news about companies, investment, stocks, economy, quartly
> > income etc. My goal is to have the news sorted in such a way that I know
> > which news correspond to which company. e.g for the news item "Apple
> > launches new iphone", I need to associate the company Apple with it. A
> > particular news item/document only contains 'title' and 'description' so
> I
> > have to analyze the text in order to find out which company the news
> > referes to. It could be multiple companies too.
> >
> > To solve this, I turned to Mahout.
> >
> > I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> > etc as top terms in my clusters and from there I would know the news in a
> > cluster corresponds to its cluster label, but things were a bit
> different.
> > I got 'investment', 'stocks', 'correspondence', 'green energy',
> 'terminal',
> > 'shares', 'street', 'olympics' and lots of other terms as the top ones
> > (which makes sense as clustering algos' look for common terms). Although
> > there were some 'Apple' clusters but the news items associated with it
> were
> > very few.I thought may be clustering is not for this kind of problem as
> > many of the company news goes into more general clusters(investment,
> > profit) instead of the specific company cluster(Apple).
> >
> > I started reading about classification which requires training data, The
> > name was convincing too as I actually want to 'classify' my news items
> into
> > 'company names'. As I read on, I got an impression that the name
> > classification is a bit deceiving and the technique is used more for
> > prediction purposes as compared to classification. The other confusions
> > that I got was how can I prepare training data for news documents? lets
> > assume I have a list of companies that I am interested in. I write a
> > program to produce training data for the classifier. the program will see
> > if the news title or description contains the company name 'Apple' then
> its
> > a news story about apple. Is this how I can prepare training data?(off
> > course I read that training data is actually a set of predictors and
> target
> > variables). If so, then why should I use mahout classification in the
> first
> > place? I should ditch mahout and instead use this little program that I
> > wrote for training data(which actually does the classification)
> >
> > You can see how confused I am about how to address this issue. Another
> > thing that concerns me is that if its possible to make a system this
> > intelligent, that if the news says 'iphone sales at a record high'
> without
> > using the word 'Apple', the system can classify it as a news related to
> > apple?
> >
> > Thank you in advance for pointing me in the right direction.
>

Re: Clustering or Classification?

Posted by Sean Owen <sr...@gmail.com>.

Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
Apple".

I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.

Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.

That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.

Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.

Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
problem.

On Wed, Aug 1, 2012 at 6:08 PM, Salman Mahmood <sa...@gmail.com> wrote:
> Hi all,
>
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
>
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>
> To solve this, I turned to Mahout.
>
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
>
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
>
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>
> Thank you in advance for pointing me in the right direction.

Re: Clustering or Classification?

Posted by Biju Balakrishnan <bi...@gmail.com>.

Hi Salman

I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>

If this is the problem you are trying to solve.
I would suggest a different solution. As you want to classify based on
company only.
Its better to use a NER system to identify the company names in the
document and use the company names to map the articles to the company.
This would be a simple and effective solution.


> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>

This is hard to achieve. You may need to spend lot of time on creating the
training set. Even then the possibility of such a system using
classification is too low.

But if you are going with a NER based solution you could customize the NER
to identify the entities in this case "iPhone" and then map it to apple.
This is achievable at low risk.

Just a thought.
i would not recommend mahout for such a problem.

-- 
*Biju*
**

Re: Clustering or Classification?

Posted by John Conwell <jo...@iamjohn.me>.

here is an article I ran across a few weeks ago that I think describes what
your after (at least at a high level)
http://blog.getprismatic.com/blog/2012/4/17/clustering-related-stories.html


On Wed, Aug 1, 2012 at 10:08 AM, Salman Mahmood <sa...@gmail.com> wrote:

> Hi all,
>
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
>
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>
> To solve this, I turned to Mahout.
>
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
>
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
>
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>
> Thank you in advance for pointing me in the right direction.
>



-- 

Thanks,
John C

Re: Clustering or Classification?

Posted by Paritosh Ranjan <pr...@xebia.com>.

Would it help if you find clusters and map top terms with the categories?
I think mapping terms to categories will need to be a manual process, as 
any software won't be able to map iPhone to Apple by itself.

So, having a term -> category mapping beforehand, and using this mapping 
on cluster's top terms might help to categorize documents.

On 01-08-2012 22:38, Salman Mahmood wrote:
> Hi all,
>
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
>
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>
> To solve this, I turned to Mahout.
>
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
>
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
>
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>
> Thank you in advance for pointing me in the right direction.
>