You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by lewis john mcgibbney <le...@gmail.com> on 2011/06/30 20:23:57 UTC

UPDATE to no more urls to fetch

Hi Tamanjit,

I thought I had sent this message eariler but obviously not, apologies about
this. I don't seem to be able to post to user@ when replying to your mail
so this is the reason you may or may not have recieved replies.

A couple of things here which spring to mind. Before I cover these, it is
usually helpful to include the threads of previous posts so we can see what
progress (if any) has been made and what kind of suggestions have been
previously advised.

1 )Did you manually delete documents fro, your Solr index? We have commands
available in newer versions of Nutch to improve the quality of our Solr
index in a more effective way e.g. solrdedup, solrclean. Have you been using
any of these?
2) In a situation like this (where we have a partiular URL we wish to know
information about), I have found it beneficial to use the command line
options. The documentation we have for Nutch <1.2 can be found here [1] and
for Nutch 1.3 here [2]. Using various reader classes we are able to dump
information about whole crawldb/linkdb or alternatively pass parameters for
individual links... in this case I think this is what you are after. This
also enables us to understand the actions Nutch is taking when undertaking
your breadth first crawl of the web graph.
3) When you say that it fetched a lot of sites in the index, do you mean in
the site-map? If this is the case then maybe you need to increase the
http.content.limit or something similar within your nutch-site.xml as
anything above this value will be truncated, outlinks will not be included
etc etc. This is also another reason to use the read commands to view what
different configuartion options give us when undertaking this type of crawl.
4) You may also wish to take a look at the time.limits between successive
fetches of URLs within nutch-site. This may alter your results for obtaining
various links in the site-map page you mentioned.

[1] http://wiki.apache.org/nutch/08CommandLineOptions
[2] http://wiki.apache.org/nutch/CommandLineOptions (please note this is
under construction)

-- 
*Lewis*