You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Martin Provencher <mp...@gmail.com> on 2011/03/31 17:46:36 UTC

Huge classification engine

Hi all,
    I need a categorization/classification software able to categorize a
text or an html page on any categories. I need it to be scalable since I'll
need to categorize a lot of elements per seconds. To do it, I though I could
use the Wikipedia dump to get all the existing category in Wikipedia and
texts related to each categories.

I've tried to use the Mahout's Wikipedia example to do it. Here what I've
done :
    1. Extracting all the categories from the dump file
([[Category:#{category name}]])
    2. Split the Wikipedia dump with wikipediaXMLSplitter
    3. Create the input data with wikipediaDataSetCreator using the
categories find in 1
    4. Training the algorithm with trainclassifier
 After that, I got a model of 10 go. I've tried to use Classify on known
data to check the result, but it's a lot too long.

 Here are my questions :
    Is it possible to do something like that with Mahout? Do you know any
sources explaining how to do it?
    Does anybody already tried to implement this idea?
    Do you know any OpenSource software able to do the categorization that I
need?

Regards,

Martin

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

Sorry I didn't recognize the project.  I just had an off-line talk with
Thomas.

This is going to take some new efforts.  The problems you will face will be
very large vocabulary of target categories that you will probably have to
use multi-level clustering on your models in order to get decent final
classifier performance.

You are also going to have a devil of a time because wikipedia is
dramatically different from web-pages at large.

On Thu, Mar 31, 2011 at 11:31 AM, Martin Provencher <mprovencher86@gmail.com
> wrote:

> The goal of the project is to receive random HTML link and be able to
> regroup them by category to be able to recommand them later on to the user.
>
> How many categories?  (why?)
> Since I don't know in which categories the users will be interested in,
> it's hard to say how many I need. That's why I try to choose all the
> category from the Wikipedia dump, but I think it was a lot too much
> categories. A lot of them don't really make sense. I've just started
> cleaning them.
>
> Is there some logical structure to your categories?
> It would be preferable, but I don't know how to structure them. So, I would
> say no for now.
>
> Where will you get your training data?  Will it really be reliable?
> I use the wikipedia dump :
> http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. I think the content will be reliable.
>
> How fast do you need?
> For now, we need to be able to classify at least 50 pages per second, but
> we intend to increase that number a lot (hadoop will be useful for that).
>
> Which web pages?
> Random. Depends on the users.
>
> Regards,
>
> Martin
>
>
> On Thu, Mar 31, 2011 at 1:19 PM, Ted Dunning <te...@gmail.com>wrote:
>
>>
>>
>> On Thu, Mar 31, 2011 at 8:46 AM, Martin Provencher <
>> mprovencher86@gmail.com> wrote:
>>
>>> Hi all,
>>>    I need a categorization/classification software able to categorize a
>>> text or an html page on any categories. I need it to be scalable since
>>> I'll
>>> need to categorize a lot of elements per seconds. To do it, I though I
>>> could
>>> use the Wikipedia dump to get all the existing category in Wikipedia and
>>> texts related to each categories.
>>> ...
>>>
>>>  Here are my questions :
>>>    Is it possible to do something like that with Mahout? Do you know any
>>> sources explaining how to do it?
>>>    Does anybody already tried to implement this idea?
>>>    Do you know any OpenSource software able to do the categorization that
>>> I
>>> need?
>>>
>>
>> Yes.  I know of people using Mahout to categorize web sites in production
>> into thousands of categories.
>>
>> Speed is very high on a single machine and very scalable across multiple
>> machines.
>>
>> You can look at the Mahout in Action book third section for theory,
>> practice and code examples.
>>
>> There are probably still a few missing bits.  It would help if you could
>> say more about your intended
>> application.
>>
>> How many categories?  (why?)
>>
>> Is there some logical structure to your categories?
>>
>> Where will you get your training data?  Will it really be reliable?
>>
>> How fast do you need?
>>
>> Which web pages?
>>
>> Almost inevitably, after you answer these questions, there will be another
>> round of questions centered
>> around whether your requirements are really the right ones to drive the
>> benefit that you want.
>>
>>
>>
>

Re: Huge classification engine

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Thanks a bunch, Julien!

On Fri, Apr 1, 2011 at 12:49 PM, Julien Nioche
<li...@gmail.com> wrote:
> Dmitriy,
>
> Have a look at Behemoth (https://github.com/jnioche/behemoth). It can be
> used as a bridge between Nutch and Mahout. I've written a module for Mahout
> which generates Mahout vectors from a Behemoth sequence file. There is also
> a IO module which can convert Nutch segments into Behemoth sequence files.
> The combination of both should do the trick + you can use text analysis
> components such as UIMA or GATE as well to generate additional attributes
> beyond simple tokens.
>
> HTH
>
> Julien
>
> On 1 April 2011 19:52, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Yes. That's the problem. How to pull mahout vectorizers on the Nutch
>> data. If anybody knows of step-by-step howto, please do suggest. I
>> looked into it briefly and did not find a good solution immediately,
>> hence we are using a crawler other than nutch with a little clearer
>> documented api for document archivng than it immediately available in
>> Nutch documentation.
>>
>>
>> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
>> > On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
>> >> Hi,
>> >> I suggest you to use Map-reduce with crawler architecture for crawling
>> >> local file system. Since parsing HTML pages creates more overhead
>> >> delays.
>> >
>> > Apache Nutch being the obvious choice there - http://nutch.apache.org/
>> >
>> > I'd love to see some recipes documented that show Nutch and Mahout
>> > combined. For example scenario, crawling some site(s), classifying and
>> > having the results available in Lucene/Solr for search and other apps.
>> > http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>> > start for the Nutch side, but I'm unsure of the hooks / workflow for
>> > Mahout integration.
>> >
>> > Regarding training data for categorisation that targets Wikipedia
>> > categories, you can always pull in the textual content of *external*
>> > links referenced from Wikipedia. For this kind of app you can probably
>> > use the extractions from the DBpedia project, see the various download
>> > files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>> > 'external links' file, perhaps 'homepages' and others too). Also the
>> > category information is extracted there, see: "article categories",
>> > "category labels", and "categories (skos)" downloads. The latter gives
>> > some hierarchy, which might be useful for filtering out noise like
>> > admin categories or those that are absurdly detailed or general.
>> >
>> > Another source of indicative text is to cross-reference these
>> > categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>> > an investigation of that using Pig, which I should either finish or
>> > writeup. But Wikipedia's 'external links' plus using the category
>> > hierarchy info should be a good place to start, I'd guess.
>> >
>> > cheers,
>> >
>> > Dan
>> >
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

Note that you probably need to introduce an "OTHER" token so that you can
fix the vocabulary size.

Otherwise, hashed representations will let you have an open vocabulary but
still have a fixed feature vector size.

On Fri, Apr 1, 2011 at 12:49 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Dmitriy,
>
> Have a look at Behemoth (https://github.com/jnioche/behemoth). It can be
> used as a bridge between Nutch and Mahout. I've written a module for Mahout
> which generates Mahout vectors from a Behemoth sequence file. There is also
> a IO module which can convert Nutch segments into Behemoth sequence files.
> The combination of both should do the trick + you can use text analysis
> components such as UIMA or GATE as well to generate additional attributes
> beyond simple tokens.
>
> HTH
>
> Julien
>
> On 1 April 2011 19:52, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > Yes. That's the problem. How to pull mahout vectorizers on the Nutch
> > data. If anybody knows of step-by-step howto, please do suggest. I
> > looked into it briefly and did not find a good solution immediately,
> > hence we are using a crawler other than nutch with a little clearer
> > documented api for document archivng than it immediately available in
> > Nutch documentation.
> >
> >
> > On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
> > > On 1 April 2011 10:00, vineet yadav <vi...@gmail.com>
> wrote:
> > >> Hi,
> > >> I suggest you to use Map-reduce with crawler architecture for crawling
> > >> local file system. Since parsing HTML pages creates more overhead
> > >> delays.
> > >
> > > Apache Nutch being the obvious choice there - http://nutch.apache.org/
> > >
> > > I'd love to see some recipes documented that show Nutch and Mahout
> > > combined. For example scenario, crawling some site(s), classifying and
> > > having the results available in Lucene/Solr for search and other apps.
> > > http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
> > > start for the Nutch side, but I'm unsure of the hooks / workflow for
> > > Mahout integration.
> > >
> > > Regarding training data for categorisation that targets Wikipedia
> > > categories, you can always pull in the textual content of *external*
> > > links referenced from Wikipedia. For this kind of app you can probably
> > > use the extractions from the DBpedia project, see the various download
> > > files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
> > > 'external links' file, perhaps 'homepages' and others too). Also the
> > > category information is extracted there, see: "article categories",
> > > "category labels", and "categories (skos)" downloads. The latter gives
> > > some hierarchy, which might be useful for filtering out noise like
> > > admin categories or those that are absurdly detailed or general.
> > >
> > > Another source of indicative text is to cross-reference these
> > > categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
> > > an investigation of that using Pig, which I should either finish or
> > > writeup. But Wikipedia's 'external links' plus using the category
> > > hierarchy info should be a good place to start, I'd guess.
> > >
> > > cheers,
> > >
> > > Dan
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Huge classification engine

Posted by Julien Nioche <li...@gmail.com>.

Dmitriy,

Have a look at Behemoth (https://github.com/jnioche/behemoth). It can be
used as a bridge between Nutch and Mahout. I've written a module for Mahout
which generates Mahout vectors from a Behemoth sequence file. There is also
a IO module which can convert Nutch segments into Behemoth sequence files.
The combination of both should do the trick + you can use text analysis
components such as UIMA or GATE as well to generate additional attributes
beyond simple tokens.

HTH

Julien

On 1 April 2011 19:52, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Yes. That's the problem. How to pull mahout vectorizers on the Nutch
> data. If anybody knows of step-by-step howto, please do suggest. I
> looked into it briefly and did not find a good solution immediately,
> hence we are using a crawler other than nutch with a little clearer
> documented api for document archivng than it immediately available in
> Nutch documentation.
>
>
> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
> > On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
> >> Hi,
> >> I suggest you to use Map-reduce with crawler architecture for crawling
> >> local file system. Since parsing HTML pages creates more overhead
> >> delays.
> >
> > Apache Nutch being the obvious choice there - http://nutch.apache.org/
> >
> > I'd love to see some recipes documented that show Nutch and Mahout
> > combined. For example scenario, crawling some site(s), classifying and
> > having the results available in Lucene/Solr for search and other apps.
> > http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
> > start for the Nutch side, but I'm unsure of the hooks / workflow for
> > Mahout integration.
> >
> > Regarding training data for categorisation that targets Wikipedia
> > categories, you can always pull in the textual content of *external*
> > links referenced from Wikipedia. For this kind of app you can probably
> > use the extractions from the DBpedia project, see the various download
> > files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
> > 'external links' file, perhaps 'homepages' and others too). Also the
> > category information is extracted there, see: "article categories",
> > "category labels", and "categories (skos)" downloads. The latter gives
> > some hierarchy, which might be useful for filtering out noise like
> > admin categories or those that are absurdly detailed or general.
> >
> > Another source of indicative text is to cross-reference these
> > categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
> > an investigation of that using Pig, which I should either finish or
> > writeup. But Wikipedia's 'external links' plus using the category
> > hierarchy info should be a good place to start, I'd guess.
> >
> > cheers,
> >
> > Dan
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Huge classification engine

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yes. That's the problem. How to pull mahout vectorizers on the Nutch
data. If anybody knows of step-by-step howto, please do suggest. I
looked into it briefly and did not find a good solution immediately,
hence we are using a crawler other than nutch with a little clearer
documented api for document archivng than it immediately available in
Nutch documentation.


On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
> On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
>> Hi,
>> I suggest you to use Map-reduce with crawler architecture for crawling
>> local file system. Since parsing HTML pages creates more overhead
>> delays.
>
> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>
> I'd love to see some recipes documented that show Nutch and Mahout
> combined. For example scenario, crawling some site(s), classifying and
> having the results available in Lucene/Solr for search and other apps.
> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
> start for the Nutch side, but I'm unsure of the hooks / workflow for
> Mahout integration.
>
> Regarding training data for categorisation that targets Wikipedia
> categories, you can always pull in the textual content of *external*
> links referenced from Wikipedia. For this kind of app you can probably
> use the extractions from the DBpedia project, see the various download
> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
> 'external links' file, perhaps 'homepages' and others too). Also the
> category information is extracted there, see: "article categories",
> "category labels", and "categories (skos)" downloads. The latter gives
> some hierarchy, which might be useful for filtering out noise like
> admin categories or those that are absurdly detailed or general.
>
> Another source of indicative text is to cross-reference these
> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
> an investigation of that using Pig, which I should either finish or
> writeup. But Wikipedia's 'external links' plus using the category
> hierarchy info should be a good place to start, I'd guess.
>
> cheers,
>
> Dan
>

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

To clarify, I expect that the problem here is that the training data and the
production data are just too different for the model to be able to
generalize.

On the other hand, a poor performing model can be an excellent way to make
your hand tagging efforts more efficient, especially if you put the cut-offs
much lower than you would normally countenance.  The idea is that the
wikipedia model can give you training examples that have a good chance of
being relative (compared to the web at large).  If you hand tag these as
positive and negative, then when you train a model on the hand tagged data,
it should perform much better than the wikipedia model alone.  In
production, you can run the secondary model bare (if that works well) or as
a post-processor on your original wikipedia model.  Bare is what I generally
recommend.

On Fri, Apr 1, 2011 at 9:34 AM, Xiaomeng Wan <sh...@gmail.com> wrote:

> As Ted predicted, the performance of the wikipedia model is poor, even
> the dmoz model was not so good as expected. it is the manually tagged
> pages achieving the best.
>

Re: Huge classification engine

Posted by Xiaomeng Wan <sh...@gmail.com>.

Hi,
I did something similar before with Pig. I built three models trained
with wikipedia data, dmoz data and a set of manually tagged webpages
respectively, and wrote a Pig udf which calls the mahout classifier.
As Ted predicted, the performance of the wikipedia model is poor, even
the dmoz model was not so good as expected. it is the manually tagged
pages achieving the best.

Regards,
Shawn

On Fri, Apr 1, 2011 at 9:57 AM, Martin Provencher
<mp...@gmail.com> wrote:
> Dan, I think what you propose make a lot of sense. I won't try to use Nutch
> for now since we already have our crawler. But in the future, I think it
> will be a great thing to add to our solution.
>
> Here what I think, I'll try :
> 1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract
> all the valuable categories.
> 2. Use these categories to parse the Wikipedia dump to extract keywords for
> those categories.
> 3. Train an algorithm (I don't know between Bayesian or SGD)
> 4. Test it with some text extracted from HTML pages to verify
>
> After this try, I could try to add the external link of the Wikipedia dump
> in step 2 to get more keywords per categories. What do you think of this
> plan?
>
> Dan, for the dmoz categories, I'm not sure how to plug it in since I can't
> see an example of their dump (their link is down). I'll check that when I'll
> download the full dump.
>
> Thanks all for your useful answers
>
> Regards,
>
> Martin
>
> On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Bixo is another option.  http://bixolabs.com/
>>
>>
>> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
>>
>>> On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
>>> > Hi,
>>> > I suggest you to use Map-reduce with crawler architecture for crawling
>>> > local file system. Since parsing HTML pages creates more overhead
>>> > delays.
>>>
>>> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>>>
>>> I'd love to see some recipes documented that show Nutch and Mahout
>>> combined. For example scenario, crawling some site(s), classifying and
>>> having the results available in Lucene/Solr for search and other apps.
>>> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>>> start for the Nutch side, but I'm unsure of the hooks / workflow for
>>> Mahout integration.
>>>
>>> Regarding training data for categorisation that targets Wikipedia
>>> categories, you can always pull in the textual content of *external*
>>> links referenced from Wikipedia. For this kind of app you can probably
>>> use the extractions from the DBpedia project, see the various download
>>> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>>> 'external links' file, perhaps 'homepages' and others too). Also the
>>> category information is extracted there, see: "article categories",
>>> "category labels", and "categories (skos)" downloads. The latter gives
>>> some hierarchy, which might be useful for filtering out noise like
>>> admin categories or those that are absurdly detailed or general.
>>>
>>> Another source of indicative text is to cross-reference these
>>> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>>> an investigation of that using Pig, which I should either finish or
>>> writeup. But Wikipedia's 'external links' plus using the category
>>> hierarchy info should be a good place to start, I'd guess.
>>>
>>> cheers,
>>>
>>> Dan
>>>
>>
>>
>

Re: Huge classification engine

Posted by vineet yadav <vi...@gmail.com>.

Hi Martin,
If you are parsing DBpedia for identification  of categories and named
entities then I suggest you oliver blog article
(http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html).
Also check out linkedIn discussion on blog
http://www.linkedin.com/groups/Mining-Wikipedia-Hadoop-Pig-Natural-115439.S.39911336?goback=.gna_115439
Cheers
Vineet Yadav

On Fri, Apr 1, 2011 at 9:27 PM, Martin Provencher
<mp...@gmail.com> wrote:
> Dan, I think what you propose make a lot of sense. I won't try to use Nutch
> for now since we already have our crawler. But in the future, I think it
> will be a great thing to add to our solution.
>
> Here what I think, I'll try :
> 1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract
> all the valuable categories.
> 2. Use these categories to parse the Wikipedia dump to extract keywords for
> those categories.
> 3. Train an algorithm (I don't know between Bayesian or SGD)
> 4. Test it with some text extracted from HTML pages to verify
>
> After this try, I could try to add the external link of the Wikipedia dump
> in step 2 to get more keywords per categories. What do you think of this
> plan?
>
> Dan, for the dmoz categories, I'm not sure how to plug it in since I can't
> see an example of their dump (their link is down). I'll check that when I'll
> download the full dump.
>
> Thanks all for your useful answers
>
> Regards,
>
> Martin
>
> On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning <te...@gmail.com> wrote:
>>
>> Bixo is another option.  http://bixolabs.com/
>>
>> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
>>>
>>> On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
>>> > Hi,
>>> > I suggest you to use Map-reduce with crawler architecture for crawling
>>> > local file system. Since parsing HTML pages creates more overhead
>>> > delays.
>>>
>>> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>>>
>>> I'd love to see some recipes documented that show Nutch and Mahout
>>> combined. For example scenario, crawling some site(s), classifying and
>>> having the results available in Lucene/Solr for search and other apps.
>>> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>>> start for the Nutch side, but I'm unsure of the hooks / workflow for
>>> Mahout integration.
>>>
>>> Regarding training data for categorisation that targets Wikipedia
>>> categories, you can always pull in the textual content of *external*
>>> links referenced from Wikipedia. For this kind of app you can probably
>>> use the extractions from the DBpedia project, see the various download
>>> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>>> 'external links' file, perhaps 'homepages' and others too). Also the
>>> category information is extracted there, see: "article categories",
>>> "category labels", and "categories (skos)" downloads. The latter gives
>>> some hierarchy, which might be useful for filtering out noise like
>>> admin categories or those that are absurdly detailed or general.
>>>
>>> Another source of indicative text is to cross-reference these
>>> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>>> an investigation of that using Pig, which I should either finish or
>>> writeup. But Wikipedia's 'external links' plus using the category
>>> hierarchy info should be a good place to start, I'd guess.
>>>
>>> cheers,
>>>
>>> Dan
>>
>
>

Re: Huge classification engine

Posted by Martin Provencher <mp...@gmail.com>.

Dan, I think what you propose make a lot of sense. I won't try to use Nutch
for now since we already have our crawler. But in the future, I think it
will be a great thing to add to our solution.

Here what I think, I'll try :
1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract
all the valuable categories.
2. Use these categories to parse the Wikipedia dump to extract keywords for
those categories.
3. Train an algorithm (I don't know between Bayesian or SGD)
4. Test it with some text extracted from HTML pages to verify

After this try, I could try to add the external link of the Wikipedia dump
in step 2 to get more keywords per categories. What do you think of this
plan?

Dan, for the dmoz categories, I'm not sure how to plug it in since I can't
see an example of their dump (their link is down). I'll check that when I'll
download the full dump.

Thanks all for your useful answers

Regards,

Martin

On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning <te...@gmail.com> wrote:

> Bixo is another option.  http://bixolabs.com/
>
>
> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:
>
>> On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
>> > Hi,
>> > I suggest you to use Map-reduce with crawler architecture for crawling
>> > local file system. Since parsing HTML pages creates more overhead
>> > delays.
>>
>> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>>
>> I'd love to see some recipes documented that show Nutch and Mahout
>> combined. For example scenario, crawling some site(s), classifying and
>> having the results available in Lucene/Solr for search and other apps.
>> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>> start for the Nutch side, but I'm unsure of the hooks / workflow for
>> Mahout integration.
>>
>> Regarding training data for categorisation that targets Wikipedia
>> categories, you can always pull in the textual content of *external*
>> links referenced from Wikipedia. For this kind of app you can probably
>> use the extractions from the DBpedia project, see the various download
>> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>> 'external links' file, perhaps 'homepages' and others too). Also the
>> category information is extracted there, see: "article categories",
>> "category labels", and "categories (skos)" downloads. The latter gives
>> some hierarchy, which might be useful for filtering out noise like
>> admin categories or those that are absurdly detailed or general.
>>
>> Another source of indicative text is to cross-reference these
>> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>> an investigation of that using Pig, which I should either finish or
>> writeup. But Wikipedia's 'external links' plus using the category
>> hierarchy info should be a good place to start, I'd guess.
>>
>> cheers,
>>
>> Dan
>>
>
>

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

Bixo is another option.  http://bixolabs.com/

On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <da...@danbri.org> wrote:

> On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
> > Hi,
> > I suggest you to use Map-reduce with crawler architecture for crawling
> > local file system. Since parsing HTML pages creates more overhead
> > delays.
>
> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>
> I'd love to see some recipes documented that show Nutch and Mahout
> combined. For example scenario, crawling some site(s), classifying and
> having the results available in Lucene/Solr for search and other apps.
> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
> start for the Nutch side, but I'm unsure of the hooks / workflow for
> Mahout integration.
>
> Regarding training data for categorisation that targets Wikipedia
> categories, you can always pull in the textual content of *external*
> links referenced from Wikipedia. For this kind of app you can probably
> use the extractions from the DBpedia project, see the various download
> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
> 'external links' file, perhaps 'homepages' and others too). Also the
> category information is extracted there, see: "article categories",
> "category labels", and "categories (skos)" downloads. The latter gives
> some hierarchy, which might be useful for filtering out noise like
> admin categories or those that are absurdly detailed or general.
>
> Another source of indicative text is to cross-reference these
> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
> an investigation of that using Pig, which I should either finish or
> writeup. But Wikipedia's 'external links' plus using the category
> hierarchy info should be a good place to start, I'd guess.
>
> cheers,
>
> Dan
>

Re: Huge classification engine

Posted by Dan Brickley <da...@danbri.org>.

On 1 April 2011 10:00, vineet yadav <vi...@gmail.com> wrote:
> Hi,
> I suggest you to use Map-reduce with crawler architecture for crawling
> local file system. Since parsing HTML pages creates more overhead
> delays.

Apache Nutch being the obvious choice there - http://nutch.apache.org/

I'd love to see some recipes documented that show Nutch and Mahout
combined. For example scenario, crawling some site(s), classifying and
having the results available in Lucene/Solr for search and other apps.
http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
start for the Nutch side, but I'm unsure of the hooks / workflow for
Mahout integration.

Regarding training data for categorisation that targets Wikipedia
categories, you can always pull in the textual content of *external*
links referenced from Wikipedia. For this kind of app you can probably
use the extractions from the DBpedia project, see the various download
files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
'external links' file, perhaps 'homepages' and others too). Also the
category information is extracted there, see: "article categories",
"category labels", and "categories (skos)" downloads. The latter gives
some hierarchy, which might be useful for filtering out noise like
admin categories or those that are absurdly detailed or general.

Another source of indicative text is to cross-reference these
categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
an investigation of that using Pig, which I should either finish or
writeup. But Wikipedia's 'external links' plus using the category
hierarchy info should be a good place to start, I'd guess.

cheers,

Dan

Re: Huge classification engine

Posted by vineet yadav <vi...@gmail.com>.

Hi,
I suggest you to use Map-reduce with crawler architecture for crawling
local file system. Since parsing HTML pages creates more overhead
delays.
Thanks
Vineet Yadav
On Fri, Apr 1, 2011 at 1:07 PM, Sreejith S <sr...@gmail.com> wrote:
> Mahout can handle huge amount of data set.As a personal experience,
> yesterday i run mahout classification on 4,00,000 reviews.
> Amazingly, it took 10-15 mins only.
> I guess there is no problem for a huge data set.Since mahout is scalable.
>
>
> On Fri, Apr 1, 2011 at 2:26 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> This will be the easiest part if you parallelize the parsing and
>> tokenization.  The classifier will be able to handle hundreds of pages per
>> second per machine.
>>
>> On Thu, Mar 31, 2011 at 11:31 AM, Martin Provencher <
>> mprovencher86@gmail.com
>> > wrote:
>>
>> > For now, we need to be able to classify at least 50 pages per second, but
>> > we intend to increase that number a lot (hadoop will be useful for that).
>> >
>>
>
>
>
> --
> *********************************
> Sreejith.S
>
> http://sreejiths.emurse.com/
> http://srijiths.wordpress.com/
> tweet2sree@twitter
>
> *********************************
> ILUGCBE
> http://ilugcbe.techstud.org
>

Re: Huge classification engine

Posted by Sreejith S <sr...@gmail.com>.

Mahout can handle huge amount of data set.As a personal experience,
yesterday i run mahout classification on 4,00,000 reviews.
Amazingly, it took 10-15 mins only.
I guess there is no problem for a huge data set.Since mahout is scalable.

On Fri, Apr 1, 2011 at 2:26 AM, Ted Dunning <te...@gmail.com> wrote:

> This will be the easiest part if you parallelize the parsing and
> tokenization.  The classifier will be able to handle hundreds of pages per
> second per machine.
>
> On Thu, Mar 31, 2011 at 11:31 AM, Martin Provencher <
> mprovencher86@gmail.com
> > wrote:
>
> > For now, we need to be able to classify at least 50 pages per second, but
> > we intend to increase that number a lot (hadoop will be useful for that).
> >
>

-- 
*********************************
Sreejith.S

http://sreejiths.emurse.com/
http://srijiths.wordpress.com/
tweet2sree@twitter

*********************************
ILUGCBE
http://ilugcbe.techstud.org

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

This will be the easiest part if you parallelize the parsing and
tokenization.  The classifier will be able to handle hundreds of pages per
second per machine.

On Thu, Mar 31, 2011 at 11:31 AM, Martin Provencher <mprovencher86@gmail.com
> wrote:

> For now, we need to be able to classify at least 50 pages per second, but
> we intend to increase that number a lot (hadoop will be useful for that).
>

Re: Huge classification engine

Posted by Martin Provencher <mp...@gmail.com>.

The goal of the project is to receive random HTML link and be able to
regroup them by category to be able to recommand them later on to the user.

How many categories?  (why?)
Since I don't know in which categories the users will be interested in, it's
hard to say how many I need. That's why I try to choose all the category
from the Wikipedia dump, but I think it was a lot too much categories. A lot
of them don't really make sense. I've just started cleaning them.

Is there some logical structure to your categories?
It would be preferable, but I don't know how to structure them. So, I would
say no for now.

Where will you get your training data?  Will it really be reliable?
I use the wikipedia dump :
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
I think the content will be reliable.

How fast do you need?
For now, we need to be able to classify at least 50 pages per second, but we
intend to increase that number a lot (hadoop will be useful for that).

Which web pages?
Random. Depends on the users.

Regards,

Martin

On Thu, Mar 31, 2011 at 1:19 PM, Ted Dunning <te...@gmail.com> wrote:

>
>
> On Thu, Mar 31, 2011 at 8:46 AM, Martin Provencher <
> mprovencher86@gmail.com> wrote:
>
>> Hi all,
>>    I need a categorization/classification software able to categorize a
>> text or an html page on any categories. I need it to be scalable since
>> I'll
>> need to categorize a lot of elements per seconds. To do it, I though I
>> could
>> use the Wikipedia dump to get all the existing category in Wikipedia and
>> texts related to each categories.
>> ...
>>
>>  Here are my questions :
>>    Is it possible to do something like that with Mahout? Do you know any
>> sources explaining how to do it?
>>    Does anybody already tried to implement this idea?
>>    Do you know any OpenSource software able to do the categorization that
>> I
>> need?
>>
>
> Yes.  I know of people using Mahout to categorize web sites in production
> into thousands of categories.
>
> Speed is very high on a single machine and very scalable across multiple
> machines.
>
> You can look at the Mahout in Action book third section for theory,
> practice and code examples.
>
> There are probably still a few missing bits.  It would help if you could
> say more about your intended
> application.
>
> How many categories?  (why?)
>
> Is there some logical structure to your categories?
>
> Where will you get your training data?  Will it really be reliable?
>
> How fast do you need?
>
> Which web pages?
>
> Almost inevitably, after you answer these questions, there will be another
> round of questions centered
> around whether your requirements are really the right ones to drive the
> benefit that you want.
>
>
>

Re: Huge classification engine

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Mar 31, 2011 at 8:46 AM, Martin Provencher
<mp...@gmail.com>wrote:

> Hi all,
>    I need a categorization/classification software able to categorize a
> text or an html page on any categories. I need it to be scalable since I'll
> need to categorize a lot of elements per seconds. To do it, I though I
> could
> use the Wikipedia dump to get all the existing category in Wikipedia and
> texts related to each categories.
> ...
>  Here are my questions :
>    Is it possible to do something like that with Mahout? Do you know any
> sources explaining how to do it?
>    Does anybody already tried to implement this idea?
>    Do you know any OpenSource software able to do the categorization that I
> need?
>

Yes.  I know of people using Mahout to categorize web sites in production
into thousands of categories.

Speed is very high on a single machine and very scalable across multiple
machines.

You can look at the Mahout in Action book third section for theory, practice
and code examples.

There are probably still a few missing bits.  It would help if you could say
more about your intended
application.

How many categories?  (why?)

Is there some logical structure to your categories?

Where will you get your training data?  Will it really be reliable?

How fast do you need?

Which web pages?

Almost inevitably, after you answer these questions, there will be another
round of questions centered
around whether your requirements are really the right ones to drive the
benefit that you want.