You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by tsmori <ti...@ncsu.edu> on 2009/10/01 15:56:48 UTC

Nutch randomly skipping locations during crawl

This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:

http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site. 

What should I look for to find out why this is happening?


-- 
View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25696893.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch randomly skipping locations during crawl

Posted by Andrzej Bialecki <ab...@getopt.org>.

tsmori wrote:
> Both good ideas. Unfortunately, the content for each user is the same. It's a
> static php file that simply calls information out of our LDAP.
> 
> It's very strange because I cannot see any difference between the user
> files/directories that are fetched and those that aren't. In checking both
> the crawl log and the hadoop log, the missing users are not even fetched. 

Check the segment's crawl_generate and crawl_fetch, and also check your 
crawldb for status. Logs don't always contain this information.

> The issue seems to be that they're not fetched and there's no indication in
> the logs why they aren't.

See above.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Nutch randomly skipping locations during crawl

Posted by tsmori <ti...@ncsu.edu>.

Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.

It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl log and the hadoop log, the missing users are not even fetched. 

If it's a permissions issue, it's a very odd one. All the directories here
have the same group membership and all files and directories under it are
owner, group, and world readable/executable.

The issue seems to be that they're not fetched and there's no indication in
the logs why they aren't.
-- 
View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25705239.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch randomly skipping locations during crawl

Posted by BELLINI ADAM <mb...@msn.com>.

yes check also if some userids dont have some caracteres like ?, @, *, !, =

they are filtred by default :  -[?*!@=]






> Date: Thu, 1 Oct 2009 18:15:38 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch randomly skipping locations during crawl
> 
> tsmori wrote:
> > This is strange. I manage the webservers for a large university library. On
> > our site we have a staff directory where each user has a location for
> > information. The URLs take the form of:
> > 
> > http://mydomain.edu/staff/userid
> > 
> > I've added the staff URL to the urls seed file. But even with a crawl set to
> > depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
> > to only fetch about 50% of the locations in this area of the site. 
> > 
> > What should I look for to find out why this is happening?
> > 
> > 
> 
> * Check that the pages there are not forbidden by robot rules (which may 
> be embedded inside HTML meta tags of index.html, or the top-level 
> robots.txt).
> 
> * check that your crawldb actually contains entries for these pages - 
> perhaps they are being filtered out.
> 
> * check your segments whether these URLs were scheduled for fetching, 
> and if so, then what was the status of fetching.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Re: Nutch randomly skipping locations during crawl

Posted by Andrzej Bialecki <ab...@getopt.org>.

tsmori wrote:
> This is strange. I manage the webservers for a large university library. On
> our site we have a staff directory where each user has a location for
> information. The URLs take the form of:
> 
> http://mydomain.edu/staff/userid
> 
> I've added the staff URL to the urls seed file. But even with a crawl set to
> depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
> to only fetch about 50% of the locations in this area of the site. 
> 
> What should I look for to find out why this is happening?
> 
> 

* Check that the pages there are not forbidden by robot rules (which may 
be embedded inside HTML meta tags of index.html, or the top-level 
robots.txt).

* check that your crawldb actually contains entries for these pages - 
perhaps they are being filtered out.

* check your segments whether these URLs were scheduled for fetching, 
and if so, then what was the status of fetching.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com