You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Shrikar archak <sh...@gmail.com> on 2011/07/26 23:17:46 UTC

Classification on Techcrunch

Hi All,
I am new to Machine learning and wanted to know more about Mahout in
general and how we can apply these algortithms to our applications.

I wanted to try out this example:

Techcrunch has the company database and also information about what that
company does.
I was thinking if we can use Mahout's Classifying algorithms which could
take these info
pages and classify them companies into different categories..

One more thing would be to look at their job description and find out what
technologies they are
using and classify them.

What would be the steps required to get this done..
I tried out Twenty
Newsgroups<https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups>example
in which case we need to train it.  I assume we need to
do something like that for the problem described above.
Please let me know.

Thanks,
Shrikar

Re: Classification on Techcrunch

Posted by Ted Dunning <te...@gmail.com>.

Yep.

That sounds like a fine approach.

You should try several algorithms, but the basic text classification
approach should work reasonably well, especially if you include phrases and
are aggressive about getting rid of garbage text.

On Tue, Jul 26, 2011 at 2:17 PM, Shrikar archak <sh...@gmail.com> wrote:

> Hi All,
> I am new to Machine learning and wanted to know more about Mahout in
> general and how we can apply these algortithms to our applications.
>
> I wanted to try out this example:
>
> Techcrunch has the company database and also information about what that
> company does.
> I was thinking if we can use Mahout's Classifying algorithms which could
> take these info
> pages and classify them companies into different categories..
>
> One more thing would be to look at their job description and find out what
> technologies they are
> using and classify them.
>
> What would be the steps required to get this done..
> I tried out Twenty
> Newsgroups<
> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> >example
> in which case we need to train it.  I assume we need to
> do something like that for the problem described above.
> Please let me know.
>
> Thanks,
> Shrikar
>

Re: Classification on Techcrunch

Posted by Ted Dunning <te...@gmail.com>.

This is a standard practice in machine learning.  You need to test and train
on different datasets to avoid having the learning algorithm learn about
irrelevant details and tell you that it will be perfect.

As an extreme example,  a machine learning program working against email
data might learn exactly which email id's were in which category.  This
would be a useful system because it would never see these emails again.  It
would appear to be perfectly accurate on the training data, but would fail
miserably on new data.

Separating training and test data is critical to avoid this kind of problem.

The third part of the Mahout in Action book talks a lot about issues like
this.

On Thu, Jul 28, 2011 at 9:48 AM, Shrikar archak <sh...@gmail.com> wrote:

> I didn't understand one thing here. Why are we having two dataset one to
> train and the actual dataset.
> What is the difference int those datasets? And for any application I am
> trying to build how should I generate the dataset
> for training the classifier.
>

Re: Classification on Techcrunch

Posted by Shrikar archak <sh...@gmail.com>.

Thanks Ted and Sean. I will try out that.
And also what should be the training dataset for the crunchbase? Is there
any document
which will help me in understanding what these dataset properties are or any
prerequisites for the datasets
to be in some format etc.

For example in 20newgroups there are four steps.
1) Generating the input dataset ( I assume its taking raw data and
converting into format which the classifier understands).
2) Generate test dataset. ( I am not sure what this is and how is it
different from the input dataset).
3) Take the generated input dataset and train the classifier.
4) Take the test data and run the classifier.

I didn't understand one thing here. Why are we having two dataset one to
train and the actual dataset.
What is the difference int those datasets? And for any application I am
trying to build how should I generate the dataset
for training the classifier.

Shrikar

On Wed, Jul 27, 2011 at 12:20 AM, Sean Owen <sr...@gmail.com> wrote:

> This is Crunchbase?
> If your goal is to classify on what the company *does*, then I think
> you are best ignoring most data (funding, employees, etc.) and cluster
> their descriptions and/or text of articles about them as if they are
> documents. In this sense it is similar to 20 newsgroups, yes. You'd
> have to extract the text from Crunchbase first and with those as text
> docs, the process is the same.
>
> On Tue, Jul 26, 2011 at 10:17 PM, Shrikar archak <sh...@gmail.com>
> wrote:
> > Hi All,
> > I am new to Machine learning and wanted to know more about Mahout in
> > general and how we can apply these algortithms to our applications.
> >
> > I wanted to try out this example:
> >
> > Techcrunch has the company database and also information about what that
> > company does.
> > I was thinking if we can use Mahout's Classifying algorithms which could
> > take these info
> > pages and classify them companies into different categories..
> >
> > One more thing would be to look at their job description and find out
> what
> > technologies they are
> > using and classify them.
> >
> > What would be the steps required to get this done..
> > I tried out Twenty
> > Newsgroups<
> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> >example
> > in which case we need to train it.  I assume we need to
> > do something like that for the problem described above.
> > Please let me know.
> >
> > Thanks,
> > Shrikar
> >
>

Re: Classification on Techcrunch

Posted by Sean Owen <sr...@gmail.com>.

This is Crunchbase?
If your goal is to classify on what the company *does*, then I think
you are best ignoring most data (funding, employees, etc.) and cluster
their descriptions and/or text of articles about them as if they are
documents. In this sense it is similar to 20 newsgroups, yes. You'd
have to extract the text from Crunchbase first and with those as text
docs, the process is the same.

On Tue, Jul 26, 2011 at 10:17 PM, Shrikar archak <sh...@gmail.com> wrote:
> Hi All,
> I am new to Machine learning and wanted to know more about Mahout in
> general and how we can apply these algortithms to our applications.
>
> I wanted to try out this example:
>
> Techcrunch has the company database and also information about what that
> company does.
> I was thinking if we can use Mahout's Classifying algorithms which could
> take these info
> pages and classify them companies into different categories..
>
> One more thing would be to look at their job description and find out what
> technologies they are
> using and classify them.
>
> What would be the steps required to get this done..
> I tried out Twenty
> Newsgroups<https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups>example
> in which case we need to train it.  I assume we need to
> do something like that for the problem described above.
> Please let me know.
>
> Thanks,
> Shrikar
>