You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marcus Herou <ma...@tailsweep.com> on 2007/08/06 15:42:29 UTC

Integration of Nutch

Hi.

I am building (yet another) crawler, parsing and indexing the html files
crawled with Lucene. Then I came to think about it. Stupido! why aren't you
using nutch instead!

My use case is something like this.

100-1000 domains with average depth of 3 to 5 I think. If I miss some pages
it is not the end of the world so a tradeoff between depth and crawl speed
is taken.
All urls must be crawled at least once a day and be crontabbed.

I would like to have one lucene dir which is optimized after each reindexing
not one dir per crawl so I need to create something like the recrawl script
which is published on the Wiki.

I would prefer to search the content myself by creating an IndexSearcher,
this is because I already index a whole lot of RSS feeds so I would like to
do a "MultiIndex" search, think that will be hard to do without doing it
yourself.

I noticed the WAR file but I would prefer too create the templates myself.

Anyone have a good pattern regarding this ?

Kindly

//Marcus Herou








-- 
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com

Re: Integration of Nutch

Posted by Renaud Richardet <re...@apache.org>.

hi Marcus,

[please respond to the list, and not to my email, so that we keep 
everybody on the loop ;-) ]

> Hi. Renaud thanks for your answer!
>
> Just so I get it clear: The index dir which the crawl put the segment 
> files in is just a plain Lucene dir just like the IndexWriter creates 
> true|false ?
yep, just try to open it up with Luke.
>
> So what you're saying is that I should have something like two dirs 
> named e.g. crawl & crawl_tmp
>
> Something like this ?
>
> 1. Do the crawl to "crawl_tmp"
> 2. Rename / move "crawl_tmp" to "crawl"
> 3. Notify my lucenesearcher to reinit which points to "crawl"
yes, that should work.
>
> I save all RSS content into my DB since they are connected to another 
> table named "Site" which the users of our service enters info about 
> their site into. So I can link them by id rather than the harder to 
> interpret domain/url.
> Lucene is just called upon my save/update methods in the responsible 
> DAO. I can with some persuation change my mind :)
>
> I fully understand that your crawler does a much better/faster job 
> than my "ThreadedJobPool" but in the RSS case I think I need to do it 
> myself.
whatever works for you :-)
>
> Oh and I have a another question: What code part generates the 
> summaries in the search result? I would like to look at that 
> "HtmlParser". I'm writing my own which determines how many <h1><h2><p> 
> etc tags there are on a page and tries to use the <h> as title (falls 
> back to <title>) and <p> as summaries and cleansed <div>... as fallback.
check 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/summary/lucene/package-summary.html

or tweak it in NutchBean:

public Summary[] *getSummary*(HitDetails[] hits, Query query)
    throws IOException {
    return summarizer.*getSummary*(hits, query);
  }


HTH,
Renaud

>
> The bastard Google have a really nice summarizer (surprise) Yahoo does 
> not have as good. Look below:
> http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta= 
> <http://www.google.se/search?hl=sv&as_qdr=all&q=site%3Ahttp%3A%2F%2Fgadgets.fosfor.se%2Fcanon-eos-40d&btnG=S%C3%B6k&meta=>
> Go to the URL http://gadgets.fosfor.se/canon-eos-40d/ 
> <http://gadgets.fosfor.se/canon-eos-40d/> and look in the content; The 
> damn summarizer is spot on. Probably a reason is since he has Google 
> Ads which "wraps" the contextual content and makes it really easy to find
>
> What's your thoughts regarding the summaries ?
>
>
> Thanks again.
>
> Kindly
>
> //Marcus
>
>
>
>
> On 8/7/07, *Renaud Richardet* <renaud@apache.org 
> <ma...@apache.org>> wrote:
>
>     hi Marcus,
>     > Hi.
>     >
>     > I am building (yet another) crawler, parsing and indexing the
>     html files
>     > crawled with Lucene. Then I came to think about it. Stupido! why
>     aren't you
>     > using nutch instead!
>     >
>     > My use case is something like this.
>     >
>     > 100-1000 domains with average depth of 3 to 5 I think. If I miss
>     some pages
>     > it is not the end of the world so a tradeoff between depth and
>     crawl speed
>     > is taken.
>     > All urls must be crawled at least once a day and be crontabbed.
>     >
>     > I would like to have one lucene dir which is optimized after
>     each reindexing
>     > not one dir per crawl so I need to create something like the
>     recrawl script
>     > which is published on the Wiki.
>     >
>     Not sure I understand: why don't you just throw away the old index
>     once
>     you have successfully created the new one (since you have to re-crawl
>     the whole content daily)?
>     > I would prefer to search the content myself by creating an
>     IndexSearcher,
>     > this is because I already index a whole lot of RSS feeds so I
>     would like to
>     > do a "MultiIndex" search, think that will be hard to do without
>     doing it
>     > yourself.
>     >
>     Or you could index the feeds with Nutch, too. There's a plugin for
>     RSS...
>     > I noticed the WAR file but I would prefer too create the
>     templates myself.
>     >
>     Actually, the WAR is just a started, you will have to implement your
>     layout anyway in the jsp's.
>
>     HTH,
>     Renaud
>
>

Re: Integration of Nutch

Posted by Renaud Richardet <re...@apache.org>.

hi Marcus,
> Hi.
>
> I am building (yet another) crawler, parsing and indexing the html files
> crawled with Lucene. Then I came to think about it. Stupido! why aren't you
> using nutch instead!
>
> My use case is something like this.
>
> 100-1000 domains with average depth of 3 to 5 I think. If I miss some pages
> it is not the end of the world so a tradeoff between depth and crawl speed
> is taken.
> All urls must be crawled at least once a day and be crontabbed.
>
> I would like to have one lucene dir which is optimized after each reindexing
> not one dir per crawl so I need to create something like the recrawl script
> which is published on the Wiki.
>   
Not sure I understand: why don't you just throw away the old index once 
you have successfully created the new one (since you have to re-crawl 
the whole content daily)?
> I would prefer to search the content myself by creating an IndexSearcher,
> this is because I already index a whole lot of RSS feeds so I would like to
> do a "MultiIndex" search, think that will be hard to do without doing it
> yourself.
>   
Or you could index the feeds with Nutch, too. There's a plugin for RSS...
> I noticed the WAR file but I would prefer too create the templates myself.
>   
Actually, the WAR is just a started, you will have to implement your 
layout anyway in the jsp's.

HTH,
Renaud