You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by class acts <cl...@gmail.com> on 2007/04/08 10:43:15 UTC

Incremental indexing and link exploration, /tmp full, nutch design

Hi All,

   First of all thanks to all the developers working on this project,
from the looks of it, this project has great potential.  I've been
playing around with version 0.9 for the past couple of days and I have
a few questions regarding its usage.

At this time I'm particularly interested in doing the following:

1. Mirroring a complete website like abc.com without leaving its
confines.  I followed the 0.8 tutorial and basically did:

# add one site to the list
mkdir sites && echo 'www.abc.com' > sites/urls

# inject that one site to the WebDB?
bin/nutch inject crawl/crawldb sites

# generate segments (whatever this means - I assume it will just add
the www.abc.com link)
bin/nutch generate crawl/crawldb crawl/segments

# put the segment path into s1 (I use csh, hence the `set`)
set s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

# seems to fetch 50 or so pages linked from abc.com
bin/nutch fetch $s1

# put the results in the actual database
bin/nutch updatedb crawl/crawldb $s1


afaik, I think the above does one round of fetching, however there are
still plenty of pages yet to be fetched.  So I did:

while 1
   bin/nutch generate crawl/crawldb crawl/segments -topN 50
   set s1=`ls -d crawl/segments/2* | tail -1`
   bin/nutch fetch $s1
   bin/nutch updatedb crawl/crawldb $s1
end

but I noticed that it just fetches the same 50 sites over and over
again.  Isn't there a way to basically tell it to just keep fetching
pages that haven't been fetched yet?  I would assume the first run
would have found enough links to keep going.  Is the above procedure
meant to be a onetime run only?  How do I know when the site has been
completely indexed?



- Crawling

I'm also interested in performing a crawl of a rather large intranet
(and maybe even the Internet?).  I ran a crawl yesterday starting at
www.freebsd.org with depth 30 and topN 500 (topN number might be
wrong, could be more)  and I noticed that it stopped after only
downloading 160MB saying:

/tmp: write failed, filesystem is full
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)

Is there anyway to override where it considers the scratch directory?
On my freebsd box my tmp partition is 512MB and I'm sure any sane Unix
distribution out there would be the same (even 512MB is quite high).
I think the command-line tools shouldn't use /tmp by default if it
seems to need so much space.  Perhaps using the working directory or
the crawldb directory is better since the likelyhood of there being
more free space there is greater.

Anyway, back to the crawling question.  From reading the posts on this
mailing lists and the nutch wiki, it seems to me that nutch crawl will
basically crawl the internet up to the point specified by the depth
argument (not sure what topN really means). I would like to perform a
crawl starting at some start point A and then do the indexing for
lucene when it's finished so that i can start mining that data.  I
would also like to have nutch "continue" the crawl from where it left
off (not re-crawl the same pages) so that it can add more pages and
find more links to crawl in the future.  How can I tell nutch to "keep
going" or "re-crawl" what pages it has already visited?

Thanks in advance for your help

Re: Incremental indexing and link exploration, /tmp full, nutch design

Posted by Espen Amble Kolstad <es...@trank.no>.

On Sunday 08 April 2007 10:43:15 class acts wrote:
> Hi All,
>
>    First of all thanks to all the developers working on this project,
> from the looks of it, this project has great potential.  I've been
> playing around with version 0.9 for the past couple of days and I have
> a few questions regarding its usage.
>
> At this time I'm particularly interested in doing the following:
>
> 1. Mirroring a complete website like abc.com without leaving its
> confines.  I followed the 0.8 tutorial and basically did:
>
> # add one site to the list
> mkdir sites && echo 'www.abc.com' > sites/urls
>
> # inject that one site to the WebDB?
> bin/nutch inject crawl/crawldb sites
>
> # generate segments (whatever this means - I assume it will just add
> the www.abc.com link)
> bin/nutch generate crawl/crawldb crawl/segments
>
> # put the segment path into s1 (I use csh, hence the `set`)
> set s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
>
> # seems to fetch 50 or so pages linked from abc.com
> bin/nutch fetch $s1
>
> # put the results in the actual database
> bin/nutch updatedb crawl/crawldb $s1
>
>
> afaik, I think the above does one round of fetching, however there are
> still plenty of pages yet to be fetched.  So I did:
>
> while 1
>    bin/nutch generate crawl/crawldb crawl/segments -topN 50
>    set s1=`ls -d crawl/segments/2* | tail -1`
>    bin/nutch fetch $s1
>    bin/nutch updatedb crawl/crawldb $s1
> end
>
> but I noticed that it just fetches the same 50 sites over and over
> again.  Isn't there a way to basically tell it to just keep fetching
> pages that haven't been fetched yet?  I would assume the first run
> would have found enough links to keep going.  Is the above procedure
> meant to be a onetime run only?  How do I know when the site has been
> completely indexed?

Have you checked your logs. It seems the last step, bin/nutch updatedb ..., is 
failing ?!
It would probably make sense to set -topN higher than 50. topN is the number 
of pages to get for a given segment.

>
>
>
> - Crawling
>
> I'm also interested in performing a crawl of a rather large intranet
> (and maybe even the Internet?).  I ran a crawl yesterday starting at
> www.freebsd.org with depth 30 and topN 500 (topN number might be
> wrong, could be more)  and I noticed that it stopped after only
> downloading 160MB saying:
>
> /tmp: write failed, filesystem is full
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>
> Is there anyway to override where it considers the scratch directory?
> On my freebsd box my tmp partition is 512MB and I'm sure any sane Unix
> distribution out there would be the same (even 512MB is quite high).
> I think the command-line tools shouldn't use /tmp by default if it
> seems to need so much space.  Perhaps using the working directory or
> the crawldb directory is better since the likelyhood of there being
> more free space there is greater.
Set tmp dir in conf/hadoop-site.xml with:
<property>
  <name>hadoop.tmp.dir</name>
  <value>/path</value>
  <description>A base for other temporary directories.</description>
</property>


>
> Anyway, back to the crawling question.  From reading the posts on this
> mailing lists and the nutch wiki, it seems to me that nutch crawl will
> basically crawl the internet up to the point specified by the depth
> argument (not sure what topN really means). I would like to perform a
> crawl starting at some start point A and then do the indexing for
> lucene when it's finished so that i can start mining that data.  I
> would also like to have nutch "continue" the crawl from where it left
> off (not re-crawl the same pages) so that it can add more pages and
> find more links to crawl in the future.  How can I tell nutch to "keep
> going" or "re-crawl" what pages it has already visited?

Default setting is to refetch after 30 days. I guess bin/nutch generate will 
generate empty segments when there are no more pages to fetch ?

You need to run bin/nutch invertlinks and bin/nutch index to be able to search 
your fetched pages. You'll find more in the wiki about this.

We have successfully fetched and indexed (soon) about 70M pages on a cluster 
with 3 nodes + master (jobtracker and namenode), so it does work :) However 
we do not store content, only parsed data.

>
> Thanks in advance for your help


- Espen