You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by arijit <pa...@yahoo.com> on 2012/10/27 14:31:27 UTC

Fw: Injecting Mahout in the nutch-solr mix

Hi,
   As this topic concerns both nutch and mahout, I am forwarding my request for direction which I posted on the nutch user mailing list, to this list
-Arijit

----- Forwarded Message -----
From: arijit <pa...@yahoo.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org> 
Sent: Saturday, October 27, 2012 5:51 PM
Subject: Injecting Mahout in the nutch-solr mix

Hi,
   I have been using nutch to crawl some wiki sites and using the following in my plugin:
   o a subclass of HtmlParseFilter to do some learning of the crawled data for pattern and
   o use the learning from the earlier step in a sublclass of IndexingFilter to add additional indexes when adding the index info into solr.

   It works. However, it means that I need to spend time doing some specific coding for understanding these various classes of documents. I am looking at Mahout to help me with this intermediate job - and the clustering functionality seems pretty suited to help me cluster the crawled pages to help add the specific dimensions into solr.

   Do you think this is a good way forward? Should I try and use Mahout as a library help me do the
 plugin stuff that I described earlier? Or is there any better way to achieve the clustering before I add indexes into solr?

   Any help, direction on this is much appreciated.
-Arijit