You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by J B <be...@hotmail.com> on 2005/06/01 17:58:45 UTC

Architecture for parallell crawling

Hello,

Forgive me for my dumb questions, but I couldn't find any guidance in the 
other postings.

I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably 
in parallell to save time (threads?). Only the pages on those sites should 
be crawled and not links pointing to other sites. When querying the indexed 
material, all 20 sources should be searched in the same query. The urls-file 
looks like this:

http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...

The file crawl-urlsfilter.txt looks like this:

+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...

I have tried several different approaches and configurations of these two 
files, but I never get the desired result. There's always just one crawling 
process, and it never gets all 20 sites. Moreover, it follows external links 
to other sites...

Given the above, what "Nutch-architecture" should I use?

Best regards,

Jon

"I didn't realize that I was stupid until I got to know Nutch"

_________________________________________________________________
Lättare att hitta drömresan med MSN Resor http://www.msn.se/resor/

Re: Architecture for parallell crawling

Posted by ir <ir...@gmail.com>.

I'll give this a shot, although I am still kinda new with nutch take
my suggestions with a grain of salt.

First of all your not going to move onto site2 until site1 is finished
crawling... so depending on your depth and the size of the site it
might take forever to just crawl 1 site.  What I have had success
doing is setting up a crawl for just 1 site at a time.  That way I can
have a url filter tailored just for that page, its much easier to keep
strait.

I won't go into detail because I wouldn't not want to provide
incorrect info to the list... but I have multiple instances of "nutch
crawl" running at the same time each with there own set of URL filters
specifically designed for just 1 site.  When the crawls are done I
combine the segments and index all the information.  If you would like
me to go into detail further I can.

This would work well in your situation as you could create 20
instances of nutch and just kick off all 20 crawls at once (if your
machine can handle it).  It would also make each url filter much
easier to develop and test.

You could even write a very simple script that would kick off all the
crawls at once and then combine them into 1 index when you were done.

If you are interested and someone else who knows more then I do say
that this method is acceptable I will explain it in a future mailing.


P.S as far as what architecture "Whole Web" or "Intranet" I would
stick with what you are doing until you understand it better and then
make a decision if the added flex ability of doing "whole web
crawling" is a benefit to you.

RE: Architecture for parallell crawling

Posted by Chirag Chaman <de...@filangy.com>.

Jon,

First, we need to get rid of this thought
>> "I didn't realize that I was stupid until I got to know Nutch"

Gotta keep a positive view, this is not an easy software to learn in a week
or so.

Now, 

1. Threads,
That happens by default. It's specified in the conf file -- and the default
values are good enough. I would encourage you to read through the
nutch-default.xml file as that will give you an overview of all the things
available in Nutch.

2. Don't follow external links.

Check if you are using the new version of nutch. The older version had a bug
where links would get added to the DB without getting filtered. This has
since been fixed. I would also urge you to apply Andrzej's fetcher patch.

For starters I would recommend not following links and seeing if you can get
your initial URL list indexed (all of them to figure out what could be
causing the 20 site to not be indexed), then add links back.  Take a look at
http://www.siteXX.com/robots.txt manually to confirm that you are being
blocked from the sites not being indexed.

CC-




 
 

-----Original Message-----
From: J B [mailto:bewalog_33@hotmail.com] 
Sent: Wednesday, June 01, 2005 11:59 AM
To: nutch-user@incubator.apache.org
Subject: Architecture for parallell crawling

Hello,

Forgive me for my dumb questions, but I couldn't find any guidance in the
other postings.

I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably
in parallell to save time (threads?). Only the pages on those sites should
be crawled and not links pointing to other sites. When querying the indexed
material, all 20 sources should be searched in the same query. The urls-file
looks like this:

http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...

The file crawl-urlsfilter.txt looks like this:

+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...

I have tried several different approaches and configurations of these two
files, but I never get the desired result. There's always just one crawling
process, and it never gets all 20 sites. Moreover, it follows external links
to other sites...

Given the above, what "Nutch-architecture" should I use?

Best regards,

Jon

"I didn't realize that I was stupid until I got to know Nutch"

_________________________________________________________________
Lättare att hitta drömresan med MSN Resor http://www.msn.se/resor/