You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Newton <cd...@gmail.com> on 2006/07/13 15:58:18 UTC

nutch suitable for blogs?

  Hi all.  First off, I'm using Nutch 0.72.

  I've been playing with nutch for a couple weeks now, and have some
questions relating to indexing blog sites.

  Many blog platforms have a changes.xml file posted on some schedule (
blogger.com/changes10.xml is every 10 minutes), that list the blogs that
have been updated in the last 10 minutes.  Others have an atom stream...
either way, the URLs you need to index are included, and there are -always-
new URLs to crawl, and I know which ones are updated, and don't want to
automatically recrawl them when I hit some time period (like crawl again in
30 days).

  Nutch seems to be designed to be given a few seed URLs, which it can
inject into it's DB, crawl them, extract new links from those sites, and
crawl those too...  previously crawled sites will be recrawled again
automatically once the time since last crawl hits some predefined number (30
days by default).  ie: perfectly normal search engine behavior.

  For blogs... I want it to crawl the injected URLs, and none of the links
on the page.  I did this (I think!) by setting db.max.outlinks.per.page to
zero.  I want it to ONLY crawl the newly injected URLs (I did this by
setting urlfilter.prefix.file to the name of my file that has the list of
updated blog URLs).

  I not sure this setup will ensure that, when 30 days rolls around, nutch
doesn't start automatically throwing old URLs into newly generated segments
for a recrawl.

  For this test, I have this cycle: download changes10.xml, process with
xsltproc to a plain text list of URLs.  Inject into db...  make sure
urlfilter.prefix.file is set to the file with this list of URLs.  Generate a
new segment, fetch, and index.

  This results in a new segment every 10 minutes.  Every 30 minutes I run
'merge' to merge the segment indexes into crawl/index.

  Now first... anyone see any problems with this setup?

  Second... I end up with perpetually growing list of segments, meaning the
'merge' run is taking longer and longer each time.  How do I fix this?

  Third...  just in general... it seems I've had to goof with nutch's config
enough to make this work in this way, that it makes me want to ask if using
nutch for this purpose is indeed the correct path.  I know Technorati just
directly uses lucene for a similar purpose.  Should that be the path I take
(HTMLParser to fecth and extract text, lucene setup with incremental
indexes)?

Thanks for any help anyone can provide.

Chris



-- 
Chris Newton,
CTO Radian6, www.radian6.com
Phone: 506-452-9039

Re: nutch suitable for blogs?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Chris,

>  Hi all.  First off, I'm using Nutch 0.72.
>
>  I've been playing with nutch for a couple weeks now, and have some
>questions relating to indexing blog sites.

[snip]

>  Third...  just in general... it seems I've had to goof with nutch's config
>enough to make this work in this way, that it makes me want to ask if using
>nutch for this purpose is indeed the correct path.  I know Technorati just
>directly uses lucene for a similar purpose.  Should that be the path I take
>(HTMLParser to fecth and extract text, lucene setup with incremental
>indexes)?

We've done something similar, in using Nutch to crawl code 
repositories. My advice would be to continue down your current path, 
as there's quite a lot in Nutch besides just the fetching support 
that proves useful when processing and serving up web-based content.

Eventually you might decide to just use Lucene and various pieces of 
Nutch as a better solution, but until then I think it's probably 
faster to use Nutch as your starting point, and also if/when that 
time comes, you'll have a much better understanding of how best to 
slice and dice.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"