You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2011/09/27 17:24:43 UTC

Understanding Nutch workflow

I'm trying to understand exactly what the Nutch workflow is and I have a few
questions.  From the tutorial:

bin/nutch inject crawldb urls

This takes a list of urls and creates a database of urls for nutch to fetch.

bin/nutch generate crawldb segments

This generates a segment which contains all of the urls that need to be
fetched.  From what I understand, you can generate multiple segments, but
I'm not sure I see the value in that as the fetching is primarily limited by
your connection, not any one machine.

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1

This takes the segment and fetchs all of the content and stores it in hadoop
for the mapreduce job.  I'm not quite sure how that works, as when I ran the
fetch, my connection showed 12GB of data downloaded, yet the hadoop
directory was using over 40GB of space.  Is this normal?

bin/nutch parse $1

This parses the fetched data using hadoop in order to extract more urls to
fetch.  It doesn't do any actual indexing, however.  Is this correct?

bin/nutch updatedb crawldb $s1

Now the parsed urls are added back to the initial database of urls.

bin/nutch invertlinks linkdb -dir segments

I'm not exactly sure what this does.  The tutorial says "Before indexing we
first invert all of the links, so that we may index incoming anchor text
with the pages."  Does that mean that if there's a link such as < A
HREF="url" > click here for more info < /A > that it adds the "click here
for more info" to the database for indexing in addition to the actual link
content?

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
crawl/linkdb crawl/segments/*

This is where the actual indexing takes place, correct?  Or is Nutch just
posting the various documents to Solr and leaving Solr to do the actual
indexing?  Is this the only step that uses the schema.xml file?


Thanks.

Re: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.

You can the the segment reader to read downloaded content.

> this is helpful -- can someone also explain whether there is mechanism to
> extract full text of pages from where they are stored in mapreduce?
> 
> On Tue, Sep 27, 2011 at 11:24, Bai Shen <ba...@gmail.com> wrote:
> > I'm trying to understand exactly what the Nutch workflow is and I have a
> > few
> > questions.  From the tutorial:
> > 
> > bin/nutch inject crawldb urls
> > 
> > This takes a list of urls and creates a database of urls for nutch to
> > fetch.
> > 
> > bin/nutch generate crawldb segments
> > 
> > This generates a segment which contains all of the urls that need to be
> > fetched.  From what I understand, you can generate multiple segments, but
> > I'm not sure I see the value in that as the fetching is primarily limited
> > by
> > your connection, not any one machine.
> > 
> > s1=`ls -d crawl/segments/2* | tail -1`
> > echo $s1
> > bin/nutch fetch $s1
> > 
> > This takes the segment and fetchs all of the content and stores it in
> > hadoop
> > for the mapreduce job.  I'm not quite sure how that works, as when I ran
> > the
> > fetch, my connection showed 12GB of data downloaded, yet the hadoop
> > directory was using over 40GB of space.  Is this normal?
> > 
> > bin/nutch parse $1
> > 
> > This parses the fetched data using hadoop in order to extract more urls
> > to fetch.  It doesn't do any actual indexing, however.  Is this correct?
> > 
> > bin/nutch updatedb crawldb $s1
> > 
> > Now the parsed urls are added back to the initial database of urls.
> > 
> > bin/nutch invertlinks linkdb -dir segments
> > 
> > I'm not exactly sure what this does.  The tutorial says "Before indexing
> > we first invert all of the links, so that we may index incoming anchor
> > text with the pages."  Does that mean that if there's a link such as < A
> > HREF="url" > click here for more info < /A > that it adds the "click
> > here for more info" to the database for indexing in addition to the
> > actual link content?
> > 
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > 
> > This is where the actual indexing takes place, correct?  Or is Nutch just
> > posting the various documents to Solr and leaving Solr to do the actual
> > indexing?  Is this the only step that uses the schema.xml file?
> > 
> > 
> > Thanks.

Re: Understanding Nutch workflow

Posted by Fred Zimmerman <wf...@nimblebooks.com>.

this is helpful -- can someone also explain whether there is mechanism to
extract full text of pages from where they are stored in mapreduce?


On Tue, Sep 27, 2011 at 11:24, Bai Shen <ba...@gmail.com> wrote:

> I'm trying to understand exactly what the Nutch workflow is and I have a
> few
> questions.  From the tutorial:
>
> bin/nutch inject crawldb urls
>
> This takes a list of urls and creates a database of urls for nutch to
> fetch.
>
> bin/nutch generate crawldb segments
>
> This generates a segment which contains all of the urls that need to be
> fetched.  From what I understand, you can generate multiple segments, but
> I'm not sure I see the value in that as the fetching is primarily limited
> by
> your connection, not any one machine.
>
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> bin/nutch fetch $s1
>
> This takes the segment and fetchs all of the content and stores it in
> hadoop
> for the mapreduce job.  I'm not quite sure how that works, as when I ran
> the
> fetch, my connection showed 12GB of data downloaded, yet the hadoop
> directory was using over 40GB of space.  Is this normal?
>
> bin/nutch parse $1
>
> This parses the fetched data using hadoop in order to extract more urls to
> fetch.  It doesn't do any actual indexing, however.  Is this correct?
>
> bin/nutch updatedb crawldb $s1
>
> Now the parsed urls are added back to the initial database of urls.
>
> bin/nutch invertlinks linkdb -dir segments
>
> I'm not exactly sure what this does.  The tutorial says "Before indexing we
> first invert all of the links, so that we may index incoming anchor text
> with the pages."  Does that mean that if there's a link such as < A
> HREF="url" > click here for more info < /A > that it adds the "click here
> for more info" to the database for indexing in addition to the actual link
> content?
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> This is where the actual indexing takes place, correct?  Or is Nutch just
> posting the various documents to Solr and leaving Solr to do the actual
> indexing?  Is this the only step that uses the schema.xml file?
>
>
> Thanks.
>