You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Salman Mahmood <sa...@influestor.com> on 2012/08/13 11:58:00 UTC

Classifying items in more than one category

I am developing a news classification system where a particular news item is assigned to an organization or company name. For instance a news item labelled "Apple to launch new iPhone in september 2012" gets categorized in "Apple" news. 
So far, after training the classifier with a bunch of topics such as Apple news, Google news, Microsoft news, Samsung news, Bank of America news etc worked perfect and I was getting almost 99% correctly classified instances from a single trained model.
Now the problem is to classify a news such as "Samsung and Google prep attack against Apple" into three topics, "Apple", "Samsung" and "Google". 

My question over here is how can I use Mahouts classification to classify a single item into multiple classes. I saw a similar question in this thread http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3C20120607223156.GA26283@opus.istwok.net%3E.

Ted Dunning gave an interesting answer as to make seperate category for multiple topics, but in my case the combinations are many. I have to classify news into almost 15,000 companies and realistically speaking any news can be a mixture of any of the 15000 companies. So the making of combinations as a separate category is ruled out!.
A second suggestion was to arrange topics in a hierarchy which also does not apply over here as the company names doesn't converge to any base category.

Having 15000 models for 15000 topics would do it, but does not sound very plausible too!

So what should be the correct way for classifiying multi topic news then?

Thanks!  
 

Re: Classifying items in more than one category

Posted by Ted Dunning <te...@gmail.com>.
The hierarchical solution actually may apply here.  The ontology would be
something like:

Company Stories > Tech companies > {apple, samsung, ibm, ...}

The key would be to extend the hierarchical approach to allow
multi-branches with many options (such as tech companies).  To do this,
don't train a single model for all companies, but instead train a single
binary model per company.

On Mon, Aug 13, 2012 at 2:58 AM, Salman Mahmood <sa...@influestor.com>wrote:

> I am developing a news classification system where a particular news item
> is assigned to an organization or company name. For instance a news item
> labelled "Apple to launch new iPhone in september 2012" gets categorized in
> "Apple" news.
> So far, after training the classifier with a bunch of topics such as Apple
> news, Google news, Microsoft news, Samsung news, Bank of America news etc
> worked perfect and I was getting almost 99% correctly classified instances
> from a single trained model.
> Now the problem is to classify a news such as "Samsung and Google prep
> attack against Apple" into three topics, "Apple", "Samsung" and "Google".
>
> My question over here is how can I use Mahouts classification to classify
> a single item into multiple classes. I saw a similar question in this
> thread
> http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3C20120607223156.GA26283@opus.istwok.net%3E
> .
>
> Ted Dunning gave an interesting answer as to make seperate category for
> multiple topics, but in my case the combinations are many. I have to
> classify news into almost 15,000 companies and realistically speaking any
> news can be a mixture of any of the 15000 companies. So the making of
> combinations as a separate category is ruled out!.
> A second suggestion was to arrange topics in a hierarchy which also does
> not apply over here as the company names doesn't converge to any base
> category.
>
> Having 15000 models for 15000 topics would do it, but does not sound very
> plausible too!
>
> So what should be the correct way for classifiying multi topic news then?
>
> Thanks!
>