You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alexander Aristov <al...@gmail.com> on 2012/07/01 21:02:51 UTC

nucth and mahout integration

People

can you give me some advises?

I want to integrate nutch and mahout to classify crawled pages.

1st question: Has someone tried this and are there any libraries available?

next: What is better/easier? Improve nutch and inject mahout classifier
into the project OR improve mahout to add an ability to read and write
nutch files?

Best Regards
Alexander Aristov

Re: nucth and mahout integration

Posted by Mathijs Homminga <ma...@kalooga.com>.
We wrote a custom Nutch parse plugin that uses a Mahout classifier to classify docs.

Mathijs Homminga

On Jul 1, 2012, at 21:02, Alexander Aristov <al...@gmail.com> wrote:

> People
> 
> can you give me some advises? 
> 
> I want to integrate nutch and mahout to classify crawled pages. 
> 
> 1st question: Has someone tried this and are there any libraries available?
> 
> next: What is better/easier? Improve nutch and inject mahout classifier into the project OR improve mahout to add an ability to read and write nutch files?
> 
> Best Regards
> Alexander Aristov

Re: nucth and mahout integration

Posted by Julien Nioche <li...@gmail.com>.
Alexander,

can you give me some advises?
>
> I want to integrate nutch and mahout to classify crawled pages.
>
> 1st question: Has someone tried this and are there any libraries available?
>

https://github.com/DigitalPebble/behemoth could be used to do Nutch ->
Behemoth -> Mahout. The only problem is that there is no standard format
for the Mahout classifiers so you would need to write a bit of code for it.
There is also a SOLR plugin in Behemoth

Alternatively you can use out Text Classification API (
https://github.com/DigitalPebble/TextClassification) within a Nutch
indexing filter.


>
> next: What is better/easier? Improve nutch and inject mahout classifier
> into the project OR improve mahout to add an ability to read and write
> nutch files?
>

Depends on what you need to do with the data after classification.
Behemoth already does the conversion from Nutch to Mahout but again the
problem is the lack of standard on the Mahout side.

HTH

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble