You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by tsmori <ti...@ncsu.edu> on 2009/10/01 15:56:48 UTC
Nutch randomly skipping locations during crawl
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:
http://mydomain.edu/staff/userid
I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site.
What should I look for to find out why this is happening?
--
View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25696893.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch randomly skipping locations during crawl
Posted by Andrzej Bialecki <ab...@getopt.org>.
tsmori wrote:
> Both good ideas. Unfortunately, the content for each user is the same. It's a
> static php file that simply calls information out of our LDAP.
>
> It's very strange because I cannot see any difference between the user
> files/directories that are fetched and those that aren't. In checking both
> the crawl log and the hadoop log, the missing users are not even fetched.
Check the segment's crawl_generate and crawl_fetch, and also check your
crawldb for status. Logs don't always contain this information.
> The issue seems to be that they're not fetched and there's no indication in
> the logs why they aren't.
See above.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Nutch randomly skipping locations during crawl
Posted by tsmori <ti...@ncsu.edu>.
Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.
It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl log and the hadoop log, the missing users are not even fetched.
If it's a permissions issue, it's a very odd one. All the directories here
have the same group membership and all files and directories under it are
owner, group, and world readable/executable.
The issue seems to be that they're not fetched and there's no indication in
the logs why they aren't.
--
View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25705239.html
Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Nutch randomly skipping locations during crawl
Posted by BELLINI ADAM <mb...@msn.com>.
yes check also if some userids dont have some caracteres like ?, @, *, !, =
they are filtred by default : -[?*!@=]
> Date: Thu, 1 Oct 2009 18:15:38 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch randomly skipping locations during crawl
>
> tsmori wrote:
> > This is strange. I manage the webservers for a large university library. On
> > our site we have a staff directory where each user has a location for
> > information. The URLs take the form of:
> >
> > http://mydomain.edu/staff/userid
> >
> > I've added the staff URL to the urls seed file. But even with a crawl set to
> > depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
> > to only fetch about 50% of the locations in this area of the site.
> >
> > What should I look for to find out why this is happening?
> >
> >
>
> * Check that the pages there are not forbidden by robot rules (which may
> be embedded inside HTML meta tags of index.html, or the top-level
> robots.txt).
>
> * check that your crawldb actually contains entries for these pages -
> perhaps they are being filtered out.
>
> * check your segments whether these URLs were scheduled for fetching,
> and if so, then what was the status of fetching.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
_________________________________________________________________
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826
Re: Nutch randomly skipping locations during crawl
Posted by Andrzej Bialecki <ab...@getopt.org>.
tsmori wrote:
> This is strange. I manage the webservers for a large university library. On
> our site we have a staff directory where each user has a location for
> information. The URLs take the form of:
>
> http://mydomain.edu/staff/userid
>
> I've added the staff URL to the urls seed file. But even with a crawl set to
> depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
> to only fetch about 50% of the locations in this area of the site.
>
> What should I look for to find out why this is happening?
>
>
* Check that the pages there are not forbidden by robot rules (which may
be embedded inside HTML meta tags of index.html, or the top-level
robots.txt).
* check that your crawldb actually contains entries for these pages -
perhaps they are being filtered out.
* check your segments whether these URLs were scheduled for fetching,
and if so, then what was the status of fetching.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com