You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2011/09/27 17:24:43 UTC

Understanding Nutch workflow

I'm trying to understand exactly what the Nutch workflow is and I have a few
questions.  From the tutorial:

bin/nutch inject crawldb urls

This takes a list of urls and creates a database of urls for nutch to fetch.

bin/nutch generate crawldb segments

This generates a segment which contains all of the urls that need to be
fetched.  From what I understand, you can generate multiple segments, but
I'm not sure I see the value in that as the fetching is primarily limited by
your connection, not any one machine.

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1

This takes the segment and fetchs all of the content and stores it in hadoop
for the mapreduce job.  I'm not quite sure how that works, as when I ran the
fetch, my connection showed 12GB of data downloaded, yet the hadoop
directory was using over 40GB of space.  Is this normal?

bin/nutch parse $1

This parses the fetched data using hadoop in order to extract more urls to
fetch.  It doesn't do any actual indexing, however.  Is this correct?

bin/nutch updatedb crawldb $s1

Now the parsed urls are added back to the initial database of urls.

bin/nutch invertlinks linkdb -dir segments

I'm not exactly sure what this does.  The tutorial says "Before indexing we
first invert all of the links, so that we may index incoming anchor text
with the pages."  Does that mean that if there's a link such as < A
HREF="url" > click here for more info < /A > that it adds the "click here
for more info" to the database for indexing in addition to the actual link
content?

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
crawl/linkdb crawl/segments/*

This is where the actual indexing takes place, correct?  Or is Nutch just
posting the various documents to Solr and leaving Solr to do the actual
indexing?  Is this the only step that uses the schema.xml file?


Thanks.

Re: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.
You can the the segment reader to read downloaded content.

> this is helpful -- can someone also explain whether there is mechanism to
> extract full text of pages from where they are stored in mapreduce?
> 
> On Tue, Sep 27, 2011 at 11:24, Bai Shen <ba...@gmail.com> wrote:
> > I'm trying to understand exactly what the Nutch workflow is and I have a
> > few
> > questions.  From the tutorial:
> > 
> > bin/nutch inject crawldb urls
> > 
> > This takes a list of urls and creates a database of urls for nutch to
> > fetch.
> > 
> > bin/nutch generate crawldb segments
> > 
> > This generates a segment which contains all of the urls that need to be
> > fetched.  From what I understand, you can generate multiple segments, but
> > I'm not sure I see the value in that as the fetching is primarily limited
> > by
> > your connection, not any one machine.
> > 
> > s1=`ls -d crawl/segments/2* | tail -1`
> > echo $s1
> > bin/nutch fetch $s1
> > 
> > This takes the segment and fetchs all of the content and stores it in
> > hadoop
> > for the mapreduce job.  I'm not quite sure how that works, as when I ran
> > the
> > fetch, my connection showed 12GB of data downloaded, yet the hadoop
> > directory was using over 40GB of space.  Is this normal?
> > 
> > bin/nutch parse $1
> > 
> > This parses the fetched data using hadoop in order to extract more urls
> > to fetch.  It doesn't do any actual indexing, however.  Is this correct?
> > 
> > bin/nutch updatedb crawldb $s1
> > 
> > Now the parsed urls are added back to the initial database of urls.
> > 
> > bin/nutch invertlinks linkdb -dir segments
> > 
> > I'm not exactly sure what this does.  The tutorial says "Before indexing
> > we first invert all of the links, so that we may index incoming anchor
> > text with the pages."  Does that mean that if there's a link such as < A
> > HREF="url" > click here for more info < /A > that it adds the "click
> > here for more info" to the database for indexing in addition to the
> > actual link content?
> > 
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > 
> > This is where the actual indexing takes place, correct?  Or is Nutch just
> > posting the various documents to Solr and leaving Solr to do the actual
> > indexing?  Is this the only step that uses the schema.xml file?
> > 
> > 
> > Thanks.

Re: Understanding Nutch workflow

Posted by Fred Zimmerman <wf...@nimblebooks.com>.
this is helpful -- can someone also explain whether there is mechanism to
extract full text of pages from where they are stored in mapreduce?


On Tue, Sep 27, 2011 at 11:24, Bai Shen <ba...@gmail.com> wrote:

> I'm trying to understand exactly what the Nutch workflow is and I have a
> few
> questions.  From the tutorial:
>
> bin/nutch inject crawldb urls
>
> This takes a list of urls and creates a database of urls for nutch to
> fetch.
>
> bin/nutch generate crawldb segments
>
> This generates a segment which contains all of the urls that need to be
> fetched.  From what I understand, you can generate multiple segments, but
> I'm not sure I see the value in that as the fetching is primarily limited
> by
> your connection, not any one machine.
>
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> bin/nutch fetch $s1
>
> This takes the segment and fetchs all of the content and stores it in
> hadoop
> for the mapreduce job.  I'm not quite sure how that works, as when I ran
> the
> fetch, my connection showed 12GB of data downloaded, yet the hadoop
> directory was using over 40GB of space.  Is this normal?
>
> bin/nutch parse $1
>
> This parses the fetched data using hadoop in order to extract more urls to
> fetch.  It doesn't do any actual indexing, however.  Is this correct?
>
> bin/nutch updatedb crawldb $s1
>
> Now the parsed urls are added back to the initial database of urls.
>
> bin/nutch invertlinks linkdb -dir segments
>
> I'm not exactly sure what this does.  The tutorial says "Before indexing we
> first invert all of the links, so that we may index incoming anchor text
> with the pages."  Does that mean that if there's a link such as < A
> HREF="url" > click here for more info < /A > that it adds the "click here
> for more info" to the database for indexing in addition to the actual link
> content?
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> This is where the actual indexing takes place, correct?  Or is Nutch just
> posting the various documents to Solr and leaving Solr to do the actual
> indexing?  Is this the only step that uses the schema.xml file?
>
>
> Thanks.
>

Re: Fwd: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.
> > > How do I see the outpud of the mapred job?  I don't recall seeing
> > 
> > anything
> > 
> > > like that in the log file.
> > 
> > This output on stdout, which can be viewed realtime using the web gui:
> > 11/09/27 16:54:35 INFO mapred.JobClient: Job complete:
> > job_201109261414_0039
> > 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
> > 11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
> > 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all
> > reduces
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
> > 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
> > 11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
> > 11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
> > 11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
> > 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
> > 11/09/27 16:54:37 INFO mapred.JobClient:    
> > FILE_BYTES_WRITTEN=6539342397 11/09/27 16:54:37 INFO mapred.JobClient:  
> >   HDFS_BYTES_WRITTEN=917655979 11/09/27 16:54:37 INFO mapred.JobClient: 
> >  Map-Reduce Framework
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized
> > bytes=3250364133
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle
> > bytes=3250360919
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Combine input
> > records=42643906 11/09/27 16:54:37 INFO mapred.JobClient:    
> > SPLIT_RAW_BYTES=64218 11/09/27 16:54:37 INFO mapred.JobClient:    
> > Reduce input records=6966070 11/09/27 16:54:37 INFO mapred.JobClient:   
> >  Reduce input groups=3065036 11/09/27 16:54:37 INFO mapred.JobClient:   
> >  Combine output
> > records=13178184
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output
> > records=3065036 11/09/27 16:54:37 INFO mapred.JobClient:     Map output
> > records=36431792
> > 
> > web gui?  Is that something that's only available in deploy mode, or can
> 
> you access it in local?

Perhaps in pseudo-distributed mode.

> 
> > > I see.  I've been using 100 threads with 10 per host and it seems to
> > > saturate the current connection pretty well, and that's just from one
> > > machine.  Which is why I was wondering about your splitting of
> > > segments. What machine limitations have you run into?
> > 
> > 10 threads per host? That's a lot, doesn't seem polite to me. Segment
> > size doesn't affect bandwidth.
> > 
> > When I run with less threads per host I get hung up in the fetching. 
> > Some
> 
> sites have more urls than others and the nice value kicks in and I end up
> with idle threads.  I was looking at the docs, but it seems the max urls
> per host is deprecated, so I'm not sure what settings to use in order to
> get them to distribute across the fetcher threads more evenly.

It's replaced by a new switch, check the config. This is current 1.4

  <property>
    <name>generate.count.mode</name>
    <value>domain</value>
  </property>
  <property>
    <name>generate.max.count</name>
    <value>10</value>
  </property>

1.4 also got a feature to kill the threads when it wastes time on single 
hosts.

> 
> > BTW, do you know what the timeline is to have the documentation updated
> > for
> > 
> > > 1.3?
> > 
> > It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
> 
> Okay. Sounds good.

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
I understand.  I'm not demanding or anything.  Just wondering what the
progress is.

On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> OK I understand. I think the main task with the documentation was that the
> fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> speaking
> the community appears to understand this, but I fully recognise that there
> is still a good deal of work to do. Thanks for pointing this out.
>
> Lewis
>
> On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I'm not looking for anything to be created.  It's just that a lot of the
> > documentation seems to be marked as needing updates for 1.3 and I was
> > wondering what the timeline for completing it was.
> >
> > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > What documentation apart from what we have marked as in construction or
> > > TODO
> > > on the wiki would you like to see created?
> > >
> > > It has been a pretty long process getting these resources up-to-date,
> > > however we are getting there!
> > >
> > > >
> > > > >
> > > > > BTW, do you know what the timeline is to have the documentation
> > updated
> > > > for
> > > > > 1.3?
> > > >
> > > > It is as we speak. Lewis did quite a good job for the wiki docs on
> 1.3.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
I'll see, but we're not at the point to deploy to a hadoop cluster yet.

On Wed, Oct 26, 2011 at 10:39 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> 1.3 will cover 1.4. The main point was regarding the change in architecture
> when taking into consideration the new runtime directory structure which
> was
> introduced in Nutch 1.3.
>
> Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the
> agenda but somewhat shelved.
>
> On Wed, Oct 26, 2011 at 4:25 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > Gotcha.  Maybe I'll see about starting a 1.4 version of the tutorial.
>  Not
> > sure if I'll have time, though.
> >
> > On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Thanks, this is now sorted out.
> > >
> > > For refernce, you can sign up and commit your own changes to the Nutch
> > > wiki.
> > >
> > >
> > > Thanks for pointing this out.
> > >
> > > lewis
> > >
> > > On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen <ba...@gmail.com>
> > wrote:
> > >
> > > > BTW, found a typo in the tutorial.  It has the following.
> > > >
> > > > bin/nutch parse $1
> > > >
> > > >
> > > > And it should be this.
> > > >
> > > > bin/nutch parse $s1
> > > >
> > > >
> > > >
> > > > On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > > > OK I understand. I think the main task with the documentation was
> > that
> > > > the
> > > > > fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> > > > > speaking
> > > > > the community appears to understand this, but I fully recognise
> that
> > > > there
> > > > > is still a good deal of work to do. Thanks for pointing this out.
> > > > >
> > > > > Lewis
> > > > >
> > > > > On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <baishen.lists@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I'm not looking for anything to be created.  It's just that a lot
> > of
> > > > the
> > > > > > documentation seems to be marked as needing updates for 1.3 and I
> > was
> > > > > > wondering what the timeline for completing it was.
> > > > > >
> > > > > > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > What documentation apart from what we have marked as in
> > > construction
> > > > or
> > > > > > > TODO
> > > > > > > on the wiki would you like to see created?
> > > > > > >
> > > > > > > It has been a pretty long process getting these resources
> > > up-to-date,
> > > > > > > however we are getting there!
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > BTW, do you know what the timeline is to have the
> > documentation
> > > > > > updated
> > > > > > > > for
> > > > > > > > > 1.3?
> > > > > > > >
> > > > > > > > It is as we speak. Lewis did quite a good job for the wiki
> docs
> > > on
> > > > > 1.3.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Fwd: Understanding Nutch workflow

Posted by lewis john mcgibbney <le...@gmail.com>.
1.3 will cover 1.4. The main point was regarding the change in architecture
when taking into consideration the new runtime directory structure which was
introduced in Nutch 1.3.

Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the
agenda but somewhat shelved.

On Wed, Oct 26, 2011 at 4:25 PM, Bai Shen <ba...@gmail.com> wrote:

> Gotcha.  Maybe I'll see about starting a 1.4 version of the tutorial.  Not
> sure if I'll have time, though.
>
> On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Thanks, this is now sorted out.
> >
> > For refernce, you can sign up and commit your own changes to the Nutch
> > wiki.
> >
> >
> > Thanks for pointing this out.
> >
> > lewis
> >
> > On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen <ba...@gmail.com>
> wrote:
> >
> > > BTW, found a typo in the tutorial.  It has the following.
> > >
> > > bin/nutch parse $1
> > >
> > >
> > > And it should be this.
> > >
> > > bin/nutch parse $s1
> > >
> > >
> > >
> > > On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > OK I understand. I think the main task with the documentation was
> that
> > > the
> > > > fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> > > > speaking
> > > > the community appears to understand this, but I fully recognise that
> > > there
> > > > is still a good deal of work to do. Thanks for pointing this out.
> > > >
> > > > Lewis
> > > >
> > > > On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com>
> > > wrote:
> > > >
> > > > > I'm not looking for anything to be created.  It's just that a lot
> of
> > > the
> > > > > documentation seems to be marked as needing updates for 1.3 and I
> was
> > > > > wondering what the timeline for completing it was.
> > > > >
> > > > > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > What documentation apart from what we have marked as in
> > construction
> > > or
> > > > > > TODO
> > > > > > on the wiki would you like to see created?
> > > > > >
> > > > > > It has been a pretty long process getting these resources
> > up-to-date,
> > > > > > however we are getting there!
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > BTW, do you know what the timeline is to have the
> documentation
> > > > > updated
> > > > > > > for
> > > > > > > > 1.3?
> > > > > > >
> > > > > > > It is as we speak. Lewis did quite a good job for the wiki docs
> > on
> > > > 1.3.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > *Lewis*
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
Gotcha.  Maybe I'll see about starting a 1.4 version of the tutorial.  Not
sure if I'll have time, though.

On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Thanks, this is now sorted out.
>
> For refernce, you can sign up and commit your own changes to the Nutch
> wiki.
>
>
> Thanks for pointing this out.
>
> lewis
>
> On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > BTW, found a typo in the tutorial.  It has the following.
> >
> > bin/nutch parse $1
> >
> >
> > And it should be this.
> >
> > bin/nutch parse $s1
> >
> >
> >
> > On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > OK I understand. I think the main task with the documentation was that
> > the
> > > fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> > > speaking
> > > the community appears to understand this, but I fully recognise that
> > there
> > > is still a good deal of work to do. Thanks for pointing this out.
> > >
> > > Lewis
> > >
> > > On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com>
> > wrote:
> > >
> > > > I'm not looking for anything to be created.  It's just that a lot of
> > the
> > > > documentation seems to be marked as needing updates for 1.3 and I was
> > > > wondering what the timeline for completing it was.
> > > >
> > > > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > What documentation apart from what we have marked as in
> construction
> > or
> > > > > TODO
> > > > > on the wiki would you like to see created?
> > > > >
> > > > > It has been a pretty long process getting these resources
> up-to-date,
> > > > > however we are getting there!
> > > > >
> > > > > >
> > > > > > >
> > > > > > > BTW, do you know what the timeline is to have the documentation
> > > > updated
> > > > > > for
> > > > > > > 1.3?
> > > > > >
> > > > > > It is as we speak. Lewis did quite a good job for the wiki docs
> on
> > > 1.3.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Fwd: Understanding Nutch workflow

Posted by lewis john mcgibbney <le...@gmail.com>.
Thanks, this is now sorted out.

For refernce, you can sign up and commit your own changes to the Nutch wiki.


Thanks for pointing this out.

lewis

On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen <ba...@gmail.com> wrote:

> BTW, found a typo in the tutorial.  It has the following.
>
> bin/nutch parse $1
>
>
> And it should be this.
>
> bin/nutch parse $s1
>
>
>
> On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > OK I understand. I think the main task with the documentation was that
> the
> > fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> > speaking
> > the community appears to understand this, but I fully recognise that
> there
> > is still a good deal of work to do. Thanks for pointing this out.
> >
> > Lewis
> >
> > On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com>
> wrote:
> >
> > > I'm not looking for anything to be created.  It's just that a lot of
> the
> > > documentation seems to be marked as needing updates for 1.3 and I was
> > > wondering what the timeline for completing it was.
> > >
> > > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > What documentation apart from what we have marked as in construction
> or
> > > > TODO
> > > > on the wiki would you like to see created?
> > > >
> > > > It has been a pretty long process getting these resources up-to-date,
> > > > however we are getting there!
> > > >
> > > > >
> > > > > >
> > > > > > BTW, do you know what the timeline is to have the documentation
> > > updated
> > > > > for
> > > > > > 1.3?
> > > > >
> > > > > It is as we speak. Lewis did quite a good job for the wiki docs on
> > 1.3.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
BTW, found a typo in the tutorial.  It has the following.

bin/nutch parse $1


And it should be this.

bin/nutch parse $s1



On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> OK I understand. I think the main task with the documentation was that the
> fundamental architecture changed between Nutch 1.2 & 1.3. Generally
> speaking
> the community appears to understand this, but I fully recognise that there
> is still a good deal of work to do. Thanks for pointing this out.
>
> Lewis
>
> On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I'm not looking for anything to be created.  It's just that a lot of the
> > documentation seems to be marked as needing updates for 1.3 and I was
> > wondering what the timeline for completing it was.
> >
> > On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > What documentation apart from what we have marked as in construction or
> > > TODO
> > > on the wiki would you like to see created?
> > >
> > > It has been a pretty long process getting these resources up-to-date,
> > > however we are getting there!
> > >
> > > >
> > > > >
> > > > > BTW, do you know what the timeline is to have the documentation
> > updated
> > > > for
> > > > > 1.3?
> > > >
> > > > It is as we speak. Lewis did quite a good job for the wiki docs on
> 1.3.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Fwd: Understanding Nutch workflow

Posted by lewis john mcgibbney <le...@gmail.com>.
OK I understand. I think the main task with the documentation was that the
fundamental architecture changed between Nutch 1.2 & 1.3. Generally speaking
the community appears to understand this, but I fully recognise that there
is still a good deal of work to do. Thanks for pointing this out.

Lewis

On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen <ba...@gmail.com> wrote:

> I'm not looking for anything to be created.  It's just that a lot of the
> documentation seems to be marked as needing updates for 1.3 and I was
> wondering what the timeline for completing it was.
>
> On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi,
> >
> > What documentation apart from what we have marked as in construction or
> > TODO
> > on the wiki would you like to see created?
> >
> > It has been a pretty long process getting these resources up-to-date,
> > however we are getting there!
> >
> > >
> > > >
> > > > BTW, do you know what the timeline is to have the documentation
> updated
> > > for
> > > > 1.3?
> > >
> > > It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
I'm not looking for anything to be created.  It's just that a lot of the
documentation seems to be marked as needing updates for 1.3 and I was
wondering what the timeline for completing it was.

On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
>
> What documentation apart from what we have marked as in construction or
> TODO
> on the wiki would you like to see created?
>
> It has been a pretty long process getting these resources up-to-date,
> however we are getting there!
>
> >
> > >
> > > BTW, do you know what the timeline is to have the documentation updated
> > for
> > > 1.3?
> >
> > It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
> >
>
>
>
> --
> *Lewis*
>

Re: Fwd: Understanding Nutch workflow

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi,

What documentation apart from what we have marked as in construction or TODO
on the wiki would you like to see created?

It has been a pretty long process getting these resources up-to-date,
however we are getting there!

>
> >
> > BTW, do you know what the timeline is to have the documentation updated
> for
> > 1.3?
>
> It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
>



-- 
*Lewis*

Re: Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
>

> > How do I see the outpud of the mapred job?  I don't recall seeing
> anything
> > like that in the log file.
>
> This output on stdout, which can be viewed realtime using the web gui:
> 11/09/27 16:54:35 INFO mapred.JobClient: Job complete:
> job_201109261414_0039
> 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
> 11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
> 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
> 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all
> reduces
> waiting after reserving slots (ms)=0
> 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
> 11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
> 11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
> 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
> 11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
> 11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
> 11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
> 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
> 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
> 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6539342397
> 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=917655979
> 11/09/27 16:54:37 INFO mapred.JobClient:   Map-Reduce Framework
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized
> bytes=3250364133
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle
> bytes=3250360919
> 11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
> 11/09/27 16:54:37 INFO mapred.JobClient:     Combine input records=42643906
> 11/09/27 16:54:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=64218
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input records=6966070
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input groups=3065036
> 11/09/27 16:54:37 INFO mapred.JobClient:     Combine output
> records=13178184
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output records=3065036
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output records=36431792
>
> web gui?  Is that something that's only available in deploy mode, or can
you access it in local?


> > I see.  I've been using 100 threads with 10 per host and it seems to
> > saturate the current connection pretty well, and that's just from one
> > machine.  Which is why I was wondering about your splitting of segments.
> > What machine limitations have you run into?
>
> 10 threads per host? That's a lot, doesn't seem polite to me. Segment size
> doesn't affect bandwidth.
>
> When I run with less threads per host I get hung up in the fetching.  Some
sites have more urls than others and the nice value kicks in and I end up
with idle threads.  I was looking at the docs, but it seems the max urls per
host is deprecated, so I'm not sure what settings to use in order to get
them to distribute across the fetcher threads more evenly.

> BTW, do you know what the timeline is to have the documentation updated
> for
> > 1.3?
>
> It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
>

Okay. Sounds good.

Re: Fwd: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.
> Not sure why gmail keeps sending my replies to people instead of back to
> the list.  Have to keep a better eye out for it.
> 
> ---------- Forwarded message ----------
> From: Bai Shen <ba...@gmail.com>
> Date: Tue, Sep 27, 2011 at 1:38 PM
> Subject: Re: Understanding Nutch workflow
> To: markus.jelsma@openindex.io
> 
> > > I didn't mean that the segment would contain every unfetched url that
> > > was in the db, if that's what you mean.
> > > 
> > > I don't think I've hit more than 5000 urls in my current segments.  At
> > > least that's the highest I've seen the queue.  Is there a way to
> > 
> > determine
> > 
> > > how many urls are in a segment?
> > 
> > Sure, segment X contains the same number of URL's as there are reduce
> > output
> > records in the partioner job for X. You can see that statistic in the
> > output
> > of every mapred job.
> 
> How do I see the outpud of the mapred job?  I don't recall seeing anything
> like that in the log file.

This output on stdout, which can be viewed realtime using the web gui:
11/09/27 16:54:35 INFO mapred.JobClient: Job complete: job_201109261414_0039
11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6539342397
11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=917655979
11/09/27 16:54:37 INFO mapred.JobClient:   Map-Reduce Framework
11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized 
bytes=3250364133
11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle bytes=3250360919
11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
11/09/27 16:54:37 INFO mapred.JobClient:     Combine input records=42643906
11/09/27 16:54:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=64218
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input records=6966070
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input groups=3065036
11/09/27 16:54:37 INFO mapred.JobClient:     Combine output records=13178184
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output records=3065036
11/09/27 16:54:37 INFO mapred.JobClient:     Map output records=36431792


> 
> > > What kind of connection do you use to fetch 500k urls?  What are your
> > > fetcher threads set to?
> > 
> > We usually don't exceed 30mbit/second in short bursts per node with 128
> > threads. This only happens for many small fetch queues, e.g. a few URL's
> > (e.g.
> > 2) for 250.000 domains. Then it's fast.
> > 
> > I see.  I've been using 100 threads with 10 per host and it seems to
> 
> saturate the current connection pretty well, and that's just from one
> machine.  Which is why I was wondering about your splitting of segments.
> What machine limitations have you run into?

10 threads per host? That's a lot, doesn't seem polite to me. Segment size 
doesn't affect bandwidth.

> 
> > > So the downloaded data gets stored in the segment directories, not the
> > > mapreduce temp files?  Why does mapreduce get so large then?
> > 
> > It is stored in the tmp during the job and writte to to the segment in
> > the reducer.
> > 
> > The mapred jobs require a factor of four for overhead?  The fetch
> 
> downloaded 12GB of data, but the mapred dir was around 50GB(I think).  Just
> trying to understand what it's doing to use all that space.
> 
> > > And any parse filter plugins are only used to search for urls, right? 
> > > So if I'm worried about additional indexing, this is not the place to
> > > be looking, correct?
> > 
> > Nono, a parse filter can, for instance, extract information from the
> > parsed DOM such as headings, meta elements or whatever and output it as
> > a field.
> 
> I see.  I don't think I'll need that, but we'll see once I get the rest
> working.
> 
> >  > What do you mean?  What is the current schema if not schema.xml?  My
> > > 
> > > understanding is that the schema.xml file in the Nutch conf dir should
> > > be the same as the schema.xml file in Solr.
> > 
> > The provided schema file is only an example, Nutch does not use it but
> > Solr does. You must copy the schema from Nutch to Solr, that's all. We
> > ship it for
> > completeness. Later we might ship other Solr files for better integration
> > on
> > the Solr side such as Velocity template files.
> > 
> > Ah, okay.  So I just need to add the Nutch fields to my current Solr
> 
> schema.

Yes.

> 
> > > If I want to modify and add additional indexing, how would I set that
> > > up? I swapped out the schema.xml file, but wasn't able to get the
> > > solrindex command to work.  It kicked back the error that I was
> > > missing the site field.
> > 
> > If you want to add new fields  you must create or modify indexing plugins
> > such
> > as index-basic, index-more, index-anchor.
> 
> Okay.  I'll take a look at the plugin documentation and see if I can figure
> that out.
> 
> 
> 
> BTW, do you know what the timeline is to have the documentation updated for
> 1.3?

It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.

Fwd: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
Not sure why gmail keeps sending my replies to people instead of back to the
list.  Have to keep a better eye out for it.

---------- Forwarded message ----------
From: Bai Shen <ba...@gmail.com>
Date: Tue, Sep 27, 2011 at 1:38 PM
Subject: Re: Understanding Nutch workflow
To: markus.jelsma@openindex.io


>

> > I didn't mean that the segment would contain every unfetched url that was
> > in the db, if that's what you mean.
> >
> > I don't think I've hit more than 5000 urls in my current segments.  At
> > least that's the highest I've seen the queue.  Is there a way to
> determine
> > how many urls are in a segment?
>
> Sure, segment X contains the same number of URL's as there are reduce
> output
> records in the partioner job for X. You can see that statistic in the
> output
> of every mapred job.
>
>
How do I see the outpud of the mapred job?  I don't recall seeing anything
like that in the log file.


>  >
> > What kind of connection do you use to fetch 500k urls?  What are your
> > fetcher threads set to?
>
> We usually don't exceed 30mbit/second in short bursts per node with 128
> threads. This only happens for many small fetch queues, e.g. a few URL's
> (e.g.
> 2) for 250.000 domains. Then it's fast.
>
> I see.  I've been using 100 threads with 10 per host and it seems to
saturate the current connection pretty well, and that's just from one
machine.  Which is why I was wondering about your splitting of segments.
What machine limitations have you run into?


> > So the downloaded data gets stored in the segment directories, not the
> > mapreduce temp files?  Why does mapreduce get so large then?
>
> It is stored in the tmp during the job and writte to to the segment in the
> reducer.
>
> The mapred jobs require a factor of four for overhead?  The fetch
downloaded 12GB of data, but the mapred dir was around 50GB(I think).  Just
trying to understand what it's doing to use all that space.


> >
> > And any parse filter plugins are only used to search for urls, right?  So
> > if I'm worried about additional indexing, this is not the place to be
> > looking, correct?
>
> Nono, a parse filter can, for instance, extract information from the parsed
> DOM such as headings, meta elements or whatever and output it as a field.
>
>
I see.  I don't think I'll need that, but we'll see once I get the rest
working.


>  > What do you mean?  What is the current schema if not schema.xml?  My
> > understanding is that the schema.xml file in the Nutch conf dir should be
> > the same as the schema.xml file in Solr.
>
> >
> The provided schema file is only an example, Nutch does not use it but Solr
> does. You must copy the schema from Nutch to Solr, that's all. We ship it
> for
> completeness. Later we might ship other Solr files for better integration
> on
> the Solr side such as Velocity template files.
>
> Ah, okay.  So I just need to add the Nutch fields to my current Solr
schema.


>  >
> > If I want to modify and add additional indexing, how would I set that up?
> > I swapped out the schema.xml file, but wasn't able to get the solrindex
> > command to work.  It kicked back the error that I was missing the site
> > field.
>
> If you want to add new fields  you must create or modify indexing plugins
> such
> as index-basic, index-more, index-anchor.
>

Okay.  I'll take a look at the plugin documentation and see if I can figure
that out.



BTW, do you know what the timeline is to have the documentation updated for
1.3?

Re: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.
> On Tue, Sep 27, 2011 at 11:52 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > > I'm trying to understand exactly what the Nutch workflow is and I have
> > > a few questions.  From the tutorial:
> > > 
> > > bin/nutch inject crawldb urls
> > > 
> > > This takes a list of urls and creates a database of urls for nutch to
> > > fetch.
> > 
> > Yes, but it also merges if crawldb already exists.
> 
> Ah, right.
> 
> > > bin/nutch generate crawldb segments
> > > 
> > > This generates a segment which contains all of the urls that need to be
> > > fetched.  From what I understand, you can generate multiple segments,
> > > but I'm not sure I see the value in that as the fetching is primarily
> > > limited by your connection, not any one machine.
> > 
> > It does not contain all URL's due for fetch.  The generator is limited by
> > many
> > options. Check nutch-default for settings and descriptions. Generating
> > multiple segments is useful. We prefer, due to hardware limitations,
> > segments
> > of no more than 500.000 URL's each. So we create many small segments,
> > it's easier to handle.
> 
> I didn't mean that the segment would contain every unfetched url that was
> in the db, if that's what you mean.
> 
> I don't think I've hit more than 5000 urls in my current segments.  At
> least that's the highest I've seen the queue.  Is there a way to determine
> how many urls are in a segment?

Sure, segment X contains the same number of URL's as there are reduce output 
records in the partioner job for X. You can see that statistic in the output 
of every mapred job.

> 
> What kind of connection do you use to fetch 500k urls?  What are your
> fetcher threads set to?

We usually don't exceed 30mbit/second in short bursts per node with 128 
threads. This only happens for many small fetch queues, e.g. a few URL's (e.g. 
2) for 250.000 domains. Then it's fast.

> 
> > > s1=`ls -d crawl/segments/2* | tail -1`
> > > echo $s1
> > > bin/nutch fetch $s1
> > > 
> > > This takes the segment and fetchs all of the content and stores it in
> > > hadoop for the mapreduce job.  I'm not quite sure how that works, as
> > > when I ran the fetch, my connection showed 12GB of data downloaded,
> > > yet the hadoop directory was using over 40GB of space.  Is this
> > > normal?
> > 
> > The content dir in the segments contains actually downloaded data with
> > some overhead. The rest is generated by the various jobs.
> 
> So the downloaded data gets stored in the segment directories, not the
> mapreduce temp files?  Why does mapreduce get so large then?

It is stored in the tmp during the job and writte to to the segment in the 
reducer.

> 
> > > bin/nutch parse $1
> > > 
> > > This parses the fetched data using hadoop in order to extract more urls
> > 
> > to
> > 
> > > fetch.  It doesn't do any actual indexing, however.  Is this correct?
> > 
> > Correct. It also executes optional parse filter plugins and normalizes
> > and filters all extracted URL's.
> 
> And any parse filter plugins are only used to search for urls, right?  So
> if I'm worried about additional indexing, this is not the place to be
> looking, correct?

Nono, a parse filter can, for instance, extract information from the parsed 
DOM such as headings, meta elements or whatever and output it as a field. 

> 
> > > bin/nutch invertlinks linkdb -dir segments
> > > 
> > > I'm not exactly sure what this does.  The tutorial says "Before
> > > indexing
> > 
> > we
> > 
> > > first invert all of the links, so that we may index incoming anchor
> > > text with the pages."  Does that mean that if there's a link such as <
> > > A HREF="url" > click here for more info < /A > that it adds the "click
> > > here for more info" to the database for indexing in addition to the
> > > actual
> > 
> > link
> > 
> > > content?
> > 
> > It's not required anymore in current Nutch 1.4-dev It builds a data
> > structure
> > of all URL's with all their inlinks and anchors. You can use this to do
> > better
> > scoring of relevance in your search engine.
> 
> Gotcha.  I'm still on 1.3, however, so I'll need to keep it in the process.

Sure. :)

> 
> > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > 
> > > This is where the actual indexing takes place, correct?  Or is Nutch
> > > just posting the various documents to Solr and leaving Solr to do the
> > > actual indexing?  Is this the only step that uses the schema.xml file?
> > 
> > Correct but does not use the schema.xml. It will index all fields as
> > dictated
> > by your index filter plugins. Check the current schema, it lists the
> > fields anddthe plugins used for those fields.
> 
> What do you mean?  What is the current schema if not schema.xml?  My
> understanding is that the schema.xml file in the Nutch conf dir should be
> the same as the schema.xml file in Solr.

The provided schema file is only an example, Nutch does not use it but Solr 
does. You must copy the schema from Nutch to Solr, that's all. We ship it for 
completeness. Later we might ship other Solr files for better integration on 
the Solr side such as Velocity template files.

> 
> If I want to modify and add additional indexing, how would I set that up? 
> I swapped out the schema.xml file, but wasn't able to get the solrindex
> command to work.  It kicked back the error that I was missing the site
> field.

If you want to add new fields  you must create or modify indexing plugins such 
as index-basic, index-more, index-anchor.

Re: Understanding Nutch workflow

Posted by Bai Shen <ba...@gmail.com>.
On Tue, Sep 27, 2011 at 11:52 AM, Markus Jelsma
<ma...@openindex.io>wrote:

>
> > I'm trying to understand exactly what the Nutch workflow is and I have a
> > few questions.  From the tutorial:
> >
> > bin/nutch inject crawldb urls
> >
> > This takes a list of urls and creates a database of urls for nutch to
> > fetch.
>
> Yes, but it also merges if crawldb already exists.
>
>
Ah, right.


> >
> > bin/nutch generate crawldb segments
> >
> > This generates a segment which contains all of the urls that need to be
> > fetched.  From what I understand, you can generate multiple segments, but
> > I'm not sure I see the value in that as the fetching is primarily limited
> > by your connection, not any one machine.
>
> It does not contain all URL's due for fetch.  The generator is limited by
> many
> options. Check nutch-default for settings and descriptions. Generating
> multiple segments is useful. We prefer, due to hardware limitations,
> segments
> of no more than 500.000 URL's each. So we create many small segments, it's
> easier to handle.
>
>
I didn't mean that the segment would contain every unfetched url that was in
the db, if that's what you mean.

I don't think I've hit more than 5000 urls in my current segments.  At least
that's the highest I've seen the queue.  Is there a way to determine how
many urls are in a segment?

What kind of connection do you use to fetch 500k urls?  What are your
fetcher threads set to?


> >
> > s1=`ls -d crawl/segments/2* | tail -1`
> > echo $s1
> > bin/nutch fetch $s1
> >
> > This takes the segment and fetchs all of the content and stores it in
> > hadoop for the mapreduce job.  I'm not quite sure how that works, as when
> > I ran the fetch, my connection showed 12GB of data downloaded, yet the
> > hadoop directory was using over 40GB of space.  Is this normal?
>
> The content dir in the segments contains actually downloaded data with some
> overhead. The rest is generated by the various jobs.
>
>
So the downloaded data gets stored in the segment directories, not the
mapreduce temp files?  Why does mapreduce get so large then?


> >
> > bin/nutch parse $1
> >
> > This parses the fetched data using hadoop in order to extract more urls
> to
> > fetch.  It doesn't do any actual indexing, however.  Is this correct?
>
> Correct. It also executes optional parse filter plugins and normalizes and
> filters all extracted URL's.
>
>
And any parse filter plugins are only used to search for urls, right?  So if
I'm worried about additional indexing, this is not the place to be looking,
correct?


> > bin/nutch invertlinks linkdb -dir segments
> >
> > I'm not exactly sure what this does.  The tutorial says "Before indexing
> we
> > first invert all of the links, so that we may index incoming anchor text
> > with the pages."  Does that mean that if there's a link such as < A
> > HREF="url" > click here for more info < /A > that it adds the "click here
> > for more info" to the database for indexing in addition to the actual
> link
> > content?
>
> It's not required anymore in current Nutch 1.4-dev It builds a data
> structure
> of all URL's with all their inlinks and anchors. You can use this to do
> better
> scoring of relevance in your search engine.
>
>
Gotcha.  I'm still on 1.3, however, so I'll need to keep it in the process.


> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> >
> > This is where the actual indexing takes place, correct?  Or is Nutch just
> > posting the various documents to Solr and leaving Solr to do the actual
> > indexing?  Is this the only step that uses the schema.xml file?
>
> Correct but does not use the schema.xml. It will index all fields as
> dictated
> by your index filter plugins. Check the current schema, it lists the fields
> anddthe plugins used for those fields.
>
>
What do you mean?  What is the current schema if not schema.xml?  My
understanding is that the schema.xml file in the Nutch conf dir should be
the same as the schema.xml file in Solr.

If I want to modify and add additional indexing, how would I set that up?  I
swapped out the schema.xml file, but wasn't able to get the solrindex
command to work.  It kicked back the error that I was missing the site
field.

Re: Understanding Nutch workflow

Posted by Markus Jelsma <ma...@openindex.io>.
> I'm trying to understand exactly what the Nutch workflow is and I have a
> few questions.  From the tutorial:
> 
> bin/nutch inject crawldb urls
> 
> This takes a list of urls and creates a database of urls for nutch to
> fetch.

Yes, but it also merges if crawldb already exists.

> 
> bin/nutch generate crawldb segments
> 
> This generates a segment which contains all of the urls that need to be
> fetched.  From what I understand, you can generate multiple segments, but
> I'm not sure I see the value in that as the fetching is primarily limited
> by your connection, not any one machine.

It does not contain all URL's due for fetch.  The generator is limited by many 
options. Check nutch-default for settings and descriptions. Generating 
multiple segments is useful. We prefer, due to hardware limitations, segments 
of no more than 500.000 URL's each. So we create many small segments, it's 
easier to handle.

> 
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> bin/nutch fetch $s1
> 
> This takes the segment and fetchs all of the content and stores it in
> hadoop for the mapreduce job.  I'm not quite sure how that works, as when
> I ran the fetch, my connection showed 12GB of data downloaded, yet the
> hadoop directory was using over 40GB of space.  Is this normal?

The content dir in the segments contains actually downloaded data with some 
overhead. The rest is generated by the various jobs.

> 
> bin/nutch parse $1
> 
> This parses the fetched data using hadoop in order to extract more urls to
> fetch.  It doesn't do any actual indexing, however.  Is this correct?

Correct. It also executes optional parse filter plugins and normalizes and 
filters all extracted URL's.

> 
> bin/nutch updatedb crawldb $s1
> 
> Now the parsed urls are added back to the initial database of urls.

Correct.

> 
> bin/nutch invertlinks linkdb -dir segments
> 
> I'm not exactly sure what this does.  The tutorial says "Before indexing we
> first invert all of the links, so that we may index incoming anchor text
> with the pages."  Does that mean that if there's a link such as < A
> HREF="url" > click here for more info < /A > that it adds the "click here
> for more info" to the database for indexing in addition to the actual link
> content?

It's not required anymore in current Nutch 1.4-dev. It builds a data structure 
of all URL's with all their inlinks and anchors. You can use this to do better 
scoring of relevance in your search engine.

> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
> 
> This is where the actual indexing takes place, correct?  Or is Nutch just
> posting the various documents to Solr and leaving Solr to do the actual
> indexing?  Is this the only step that uses the schema.xml file?

Correct but does not use the schema.xml. It will index all fields as dictated 
by your index filter plugins. Check the current schema, it lists the fields 
anddthe plugins used for those fields.

> 
> 
> Thanks.