You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2011/09/22 14:26:45 UTC

Nutch crawl vs other commands

So I was able to get Nutch up and working using the crawl command.  I set my
depth and topN and it ran and indexed the pages for me.

But not I'm trying to split out the separate pieces in order to distribute
them and add my own parser.  I'm running the following.

bin/nutch generate crawl/crawldb crawl/segments
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
bin/nutch fetch $SEGMENT -noParsing
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize


I don't see any way to determine how deep to crawl.  Is this possible, or do
I have to manually manage the db?  And if so, how do I do that?

And as a side note, why does Nutch invoke hadoop during the fetch command
even though I have noParsing set?  After fetching my links, my machine
churns for around twenty minutes before finally ending, even though all the
fetch threads completed already.

Thanks.

Fwd: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

Not sure why this didn't go to the list.

---------- Forwarded message ----------
From: Markus Jelsma <ma...@openindex.io>
Date: Thu, Sep 22, 2011 at 3:17 PM
Subject: Re: Nutch crawl vs other commands
To: Bai Shen <ba...@gmail.com>


hi, reply to the list

> On Thu, Sep 22, 2011 at 2:01 PM, Markus Jelsma
>
> <ma...@openindex.io>wrote:
> > Not really. Once the you've got many links pointing to eachother, the
> > concept
> > of depth no longer really applies. You don't have to manage the DB
> > manually as
> > it will regulate itself (either by using a custom fetch scheduler).
> > Nutch will select URL's due for fetch and will in the end exhaust the
> > full list of URL's, unless you're crawling the internet. Fetched URL's
> > will be refetched over time.
>
> So what's the best way to set up a schedule?  The fetching and parsing
> steps seem pretty linked due to the segments, etc.
>
> > Because the fetcher runs as a Hadoop mapred job. When the actual fetch
> > finishes Hadoop must write the contents, merge spilled records etc. This
> > is part of how mapred works.
>
> But shouldn't that be happening during the parse stage?  The fetcher is
> constantly writing out data to the mapred job while it's fetching.  Once
> it's done, that should be it AFAIK.  And then the parse command runs the
> mapred job.
>
> Somewhere in the first 1.x version. Later it became a parse option that
>
> > actually never worked anyway until it was fixed in the current 1.4-dev.
> > Still,
> > it's not recommended to parse during the fetch stage.
>
> -nods-  That's why I'm doing noParsing.  But you said that doesn't do
> anything anymore.  So what would the updated commands be fore v1.3?
>
> >  As a mentioned in the other reply, it writes out data:
> >
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/java/org/apach
> > e/nutch/fetcher/FetcherOutputFormat.java?view=markup
> >
> > This will take a while indeed and it won't log anything during its
> > execution.
>
> But that should be happening during the fetching, not after, right?

Re: Nutch crawl vs other commands

Posted by Markus Jelsma <ma...@openindex.io>.

> I'm using 1.3.  This is a new setup, so I'm running the latest versions.
> 
> I did inject the urls already.  It's just that the part I was having issues
> with was the fetch, etc.  I'm using the steps at Lucid Imagination » Using
> Nutch with
> Solr<http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> that I alredy had Nutch set up and configured.
> 
> When did noParsing change?  I noticed that the Nutch wiki is out of date,
> so I'm not sure what the current setups are.

Somewhere in the first 1.x version. Later it became a parse option that 
actually never worked anyway until it was fixed in the current 1.4-dev. Still, 
it's not recommended to parse during the fetch stage.

> 
> The log data made some mention of hadoop, but I don't remember what it was.
> I'll see if it happens again and post the message.

As a mentioned in the other reply, it writes out data:
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java?view=markup

This will take a while indeed and it won't log anything during its execution.
> 
> On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Hi Bai,
> > 
> > You haven't mentioned which Nutch version you're using... this would be
> > good
> > if you could.
> > 
> > You haven't injected any seed URLs into your crawldb. From memory I think
> > the -topN parameter should be passed to the generate command.
> > 
> > Just to note, it is not necessary to set noParsing while executing the
> > fetch
> > command. This is already default behaviour. Not sure why your machine is
> > churning but this shouldn't be happening. Do you have any log data to
> > suggest why this is the case.
> > 
> > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com> wrote:
> > > So I was able to get Nutch up and working using the crawl command.  I
> > > set my
> > > depth and topN and it ran and indexed the pages for me.
> > > 
> > > But not I'm trying to split out the separate pieces in order to
> > 
> > distribute
> > 
> > > them and add my own parser.  I'm running the following.
> > > 
> > > bin/nutch generate crawl/crawldb crawl/segments
> > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > bin/nutch fetch $SEGMENT -noParsing
> > > bin/nutch parse $SEGMENT
> > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > 
> > > 
> > > I don't see any way to determine how deep to crawl.  Is this possible,
> > > or do
> > > I have to manually manage the db?  And if so, how do I do that?
> > > 
> > > And as a side note, why does Nutch invoke hadoop during the fetch
> > > command even though I have noParsing set?  After fetching my links, my
> > > machine churns for around twenty minutes before finally ending, even
> > > though all
> > 
> > the
> > 
> > > fetch threads completed already.
> > > 
> > > Thanks.
> > 
> > --
> > *Lewis*

Re: Nutch crawl vs other commands

Posted by Markus Jelsma <ma...@openindex.io>.

excellent!

On Friday 23 September 2011 16:01:42 lewis john mcgibbney wrote:
> this has been fixed
> 
> Thanks for raising and looking in to this guys
> 
> On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Hmm, the wiki tutorial seems wrong. You must parse before updating any
> > DB.
> > 
> > On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> > > I looked at the tutorial, and it's doing pretty much the same thing as
> > 
> > the
> > 
> > > lucid link I referenced earlier.  It just leaves out the noParsing and
> > 
> > also
> > 
> > > swaps the updatedb and parse commands.  Does the order make a
> > > difference?
> > > 
> > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> > > 
> > > lewis.mcgibbney@gmail.com> wrote:
> > > > Hi Bai,
> > > > 
> > > > I hope various comments have helped you somewhat, however I another
> > 
> > small
> > 
> > > > one as well. please see below
> > > > 
> > > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com>
> > 
> > wrote:
> > > > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > > > versions.
> > > > > 
> > > > > I did inject the urls already.  It's just that the part I was
> > > > > having
> > > > 
> > > > issues
> > > > 
> > > > > with was the fetch, etc.  I'm using the steps at Lucid Imagination
> > > > > »
> > > > 
> > > > Using
> > > > 
> > > > > Nutch with Solr<
> > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > > > > that I alredy had Nutch set up and configured.
> > > > > 
> > > > > When did noParsing change?  I noticed that the Nutch wiki is out of
> > > > > date, so
> > > > > I'm not sure what the current setups are.
> > > > 
> > > > You will find the official Nutch tutorial and command line options
> > > > (for what
> > > > you require) up-to-date, these can be found on the wiki. If you have
> > > > anything to add please do.
> > > > 
> > > > > The log data made some mention of hadoop, but I don't remember what
> > 
> > it
> > 
> > > > was.
> > > > 
> > > > > I'll see if it happens again and post the message.
> > > > > 
> > > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > > > 
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > > Hi Bai,
> > > > > > 
> > > > > > You haven't mentioned which Nutch version you're using... this
> > 
> > would
> > 
> > > > > > be good
> > > > > > if you could.
> > > > > > 
> > > > > > You haven't injected any seed URLs into your crawldb. From memory
> > > > > > I
> > > > 
> > > > think
> > > > 
> > > > > > the -topN parameter should be passed to the generate command.
> > > > > > 
> > > > > > Just to note, it is not necessary to set noParsing while
> > > > > > executing the fetch
> > > > > > command. This is already default behaviour. Not sure why your
> > 
> > machine
> > 
> > > > is
> > > > 
> > > > > > churning but this shouldn't be happening. Do you have any log
> > > > > > data
> > 
> > to
> > 
> > > > > > suggest why this is the case.
> > > > > > 
> > > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen
> > > > > > <baishen.lists@gmail.com
> > > > > 
> > > > > wrote:
> > > > > > > So I was able to get Nutch up and working using the crawl
> > 
> > command.
> > 
> > > > > > > I
> > > > > 
> > > > > set
> > > > > 
> > > > > > > my
> > > > > > > depth and topN and it ran and indexed the pages for me.
> > > > > > > 
> > > > > > > But not I'm trying to split out the separate pieces in order to
> > > > > > 
> > > > > > distribute
> > > > > > 
> > > > > > > them and add my own parser.  I'm running the following.
> > > > > > > 
> > > > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > > > bin/nutch parse $SEGMENT
> > > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > > > 
> > > > > > > 
> > > > > > > I don't see any way to determine how deep to crawl.  Is this
> > > > 
> > > > possible,
> > > > 
> > > > > or
> > > > > 
> > > > > > > do
> > > > > > > I have to manually manage the db?  And if so, how do I do that?
> > > > > > > 
> > > > > > > And as a side note, why does Nutch invoke hadoop during the
> > > > > > > fetch
> > > > > 
> > > > > command
> > > > > 
> > > > > > > even though I have noParsing set?  After fetching my links, my
> > > > 
> > > > machine
> > > > 
> > > > > > > churns for around twenty minutes before finally ending, even
> > 
> > though
> > 
> > > > all
> > > > 
> > > > > > the
> > > > > > 
> > > > > > > fetch threads completed already.
> > > > > > > 
> > > > > > > Thanks.
> > > > > > 
> > > > > > --
> > > > > > *Lewis*
> > > > 
> > > > --
> > > > *Lewis*
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch crawl vs other commands

Posted by lewis john mcgibbney <le...@gmail.com>.

excellent Bai,

thanks for pointing this out. It has been fixed.

On Fri, Sep 23, 2011 at 8:03 PM, Bai Shen <ba...@gmail.com> wrote:

> You need to change the two code blocks underneath that as well.  They still
> show the update before the parse.
>
> bin/nutch fetch $s2
> bin/nutch updatedb crawldb $s2
> bin/nutch parse $s2
>
>
>
> On Fri, Sep 23, 2011 at 10:01 AM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > this has been fixed
> >
> > Thanks for raising and looking in to this guys
> >
> > On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Hmm, the wiki tutorial seems wrong. You must parse before updating any
> > DB.
> > >
> > >
> > >
> > > On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> > > > I looked at the tutorial, and it's doing pretty much the same thing
> as
> > > the
> > > > lucid link I referenced earlier.  It just leaves out the noParsing
> and
> > > also
> > > > swaps the updatedb and parse commands.  Does the order make a
> > difference?
> > > >
> > > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> > > >
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > Hi Bai,
> > > > >
> > > > > I hope various comments have helped you somewhat, however I another
> > > small
> > > > > one as well. please see below
> > > > >
> > > > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <baishen.lists@gmail.com
> >
> > > wrote:
> > > > > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > > > > versions.
> > > > > >
> > > > > > I did inject the urls already.  It's just that the part I was
> > having
> > > > >
> > > > > issues
> > > > >
> > > > > > with was the fetch, etc.  I'm using the steps at Lucid
> Imagination
> > »
> > > > >
> > > > > Using
> > > > >
> > > > > > Nutch with Solr<
> > > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> >except
> > > > > > that I alredy had Nutch set up and configured.
> > > > > >
> > > > > > When did noParsing change?  I noticed that the Nutch wiki is out
> of
> > > > > > date, so
> > > > > > I'm not sure what the current setups are.
> > > > >
> > > > > You will find the official Nutch tutorial and command line options
> > (for
> > > > > what
> > > > > you require) up-to-date, these can be found on the wiki. If you
> have
> > > > > anything to add please do.
> > > > >
> > > > > > The log data made some mention of hadoop, but I don't remember
> what
> > > it
> > > > >
> > > > > was.
> > > > >
> > > > > > I'll see if it happens again and post the message.
> > > > > >
> > > > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > > > >
> > > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > > > Hi Bai,
> > > > > > >
> > > > > > > You haven't mentioned which Nutch version you're using... this
> > > would
> > > > > > > be good
> > > > > > > if you could.
> > > > > > >
> > > > > > > You haven't injected any seed URLs into your crawldb. From
> memory
> > I
> > > > >
> > > > > think
> > > > >
> > > > > > > the -topN parameter should be passed to the generate command.
> > > > > > >
> > > > > > > Just to note, it is not necessary to set noParsing while
> > executing
> > > > > > > the fetch
> > > > > > > command. This is already default behaviour. Not sure why your
> > > machine
> > > > >
> > > > > is
> > > > >
> > > > > > > churning but this shouldn't be happening. Do you have any log
> > data
> > > to
> > > > > > > suggest why this is the case.
> > > > > > >
> > > > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <
> > baishen.lists@gmail.com
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > So I was able to get Nutch up and working using the crawl
> > > command.
> > > > > > > > I
> > > > > >
> > > > > > set
> > > > > >
> > > > > > > > my
> > > > > > > > depth and topN and it ran and indexed the pages for me.
> > > > > > > >
> > > > > > > > But not I'm trying to split out the separate pieces in order
> to
> > > > > > >
> > > > > > > distribute
> > > > > > >
> > > > > > > > them and add my own parser.  I'm running the following.
> > > > > > > >
> > > > > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > > > > bin/nutch parse $SEGMENT
> > > > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't see any way to determine how deep to crawl.  Is this
> > > > >
> > > > > possible,
> > > > >
> > > > > > or
> > > > > >
> > > > > > > > do
> > > > > > > > I have to manually manage the db?  And if so, how do I do
> that?
> > > > > > > >
> > > > > > > > And as a side note, why does Nutch invoke hadoop during the
> > fetch
> > > > > >
> > > > > > command
> > > > > >
> > > > > > > > even though I have noParsing set?  After fetching my links,
> my
> > > > >
> > > > > machine
> > > > >
> > > > > > > > churns for around twenty minutes before finally ending, even
> > > though
> > > > >
> > > > > all
> > > > >
> > > > > > > the
> > > > > > >
> > > > > > > > fetch threads completed already.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > >
> > > > > --
> > > > > *Lewis*
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

You need to change the two code blocks underneath that as well.  They still
show the update before the parse.

bin/nutch fetch $s2
bin/nutch updatedb crawldb $s2
bin/nutch parse $s2



On Fri, Sep 23, 2011 at 10:01 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> this has been fixed
>
> Thanks for raising and looking in to this guys
>
> On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > Hmm, the wiki tutorial seems wrong. You must parse before updating any
> DB.
> >
> >
> >
> > On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> > > I looked at the tutorial, and it's doing pretty much the same thing as
> > the
> > > lucid link I referenced earlier.  It just leaves out the noParsing and
> > also
> > > swaps the updatedb and parse commands.  Does the order make a
> difference?
> > >
> > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> > >
> > > lewis.mcgibbney@gmail.com> wrote:
> > > > Hi Bai,
> > > >
> > > > I hope various comments have helped you somewhat, however I another
> > small
> > > > one as well. please see below
> > > >
> > > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com>
> > wrote:
> > > > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > > > versions.
> > > > >
> > > > > I did inject the urls already.  It's just that the part I was
> having
> > > >
> > > > issues
> > > >
> > > > > with was the fetch, etc.  I'm using the steps at Lucid Imagination
> »
> > > >
> > > > Using
> > > >
> > > > > Nutch with Solr<
> > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > > > > that I alredy had Nutch set up and configured.
> > > > >
> > > > > When did noParsing change?  I noticed that the Nutch wiki is out of
> > > > > date, so
> > > > > I'm not sure what the current setups are.
> > > >
> > > > You will find the official Nutch tutorial and command line options
> (for
> > > > what
> > > > you require) up-to-date, these can be found on the wiki. If you have
> > > > anything to add please do.
> > > >
> > > > > The log data made some mention of hadoop, but I don't remember what
> > it
> > > >
> > > > was.
> > > >
> > > > > I'll see if it happens again and post the message.
> > > > >
> > > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > > >
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > > Hi Bai,
> > > > > >
> > > > > > You haven't mentioned which Nutch version you're using... this
> > would
> > > > > > be good
> > > > > > if you could.
> > > > > >
> > > > > > You haven't injected any seed URLs into your crawldb. From memory
> I
> > > >
> > > > think
> > > >
> > > > > > the -topN parameter should be passed to the generate command.
> > > > > >
> > > > > > Just to note, it is not necessary to set noParsing while
> executing
> > > > > > the fetch
> > > > > > command. This is already default behaviour. Not sure why your
> > machine
> > > >
> > > > is
> > > >
> > > > > > churning but this shouldn't be happening. Do you have any log
> data
> > to
> > > > > > suggest why this is the case.
> > > > > >
> > > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <
> baishen.lists@gmail.com
> > >
> > > > >
> > > > > wrote:
> > > > > > > So I was able to get Nutch up and working using the crawl
> > command.
> > > > > > > I
> > > > >
> > > > > set
> > > > >
> > > > > > > my
> > > > > > > depth and topN and it ran and indexed the pages for me.
> > > > > > >
> > > > > > > But not I'm trying to split out the separate pieces in order to
> > > > > >
> > > > > > distribute
> > > > > >
> > > > > > > them and add my own parser.  I'm running the following.
> > > > > > >
> > > > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > > > bin/nutch parse $SEGMENT
> > > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > > >
> > > > > > >
> > > > > > > I don't see any way to determine how deep to crawl.  Is this
> > > >
> > > > possible,
> > > >
> > > > > or
> > > > >
> > > > > > > do
> > > > > > > I have to manually manage the db?  And if so, how do I do that?
> > > > > > >
> > > > > > > And as a side note, why does Nutch invoke hadoop during the
> fetch
> > > > >
> > > > > command
> > > > >
> > > > > > > even though I have noParsing set?  After fetching my links, my
> > > >
> > > > machine
> > > >
> > > > > > > churns for around twenty minutes before finally ending, even
> > though
> > > >
> > > > all
> > > >
> > > > > > the
> > > > > >
> > > > > > > fetch threads completed already.
> > > > > > >
> > > > > > > Thanks.
> > > > > >
> > > > > > --
> > > > > > *Lewis*
> > > >
> > > > --
> > > > *Lewis*
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch crawl vs other commands

Posted by lewis john mcgibbney <le...@gmail.com>.

this has been fixed

Thanks for raising and looking in to this guys

On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hmm, the wiki tutorial seems wrong. You must parse before updating any DB.
>
>
>
> On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> > I looked at the tutorial, and it's doing pretty much the same thing as
> the
> > lucid link I referenced earlier.  It just leaves out the noParsing and
> also
> > swaps the updatedb and parse commands.  Does the order make a difference?
> >
> > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> >
> > lewis.mcgibbney@gmail.com> wrote:
> > > Hi Bai,
> > >
> > > I hope various comments have helped you somewhat, however I another
> small
> > > one as well. please see below
> > >
> > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com>
> wrote:
> > > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > > versions.
> > > >
> > > > I did inject the urls already.  It's just that the part I was having
> > >
> > > issues
> > >
> > > > with was the fetch, etc.  I'm using the steps at Lucid Imagination »
> > >
> > > Using
> > >
> > > > Nutch with Solr<
> > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > > > that I alredy had Nutch set up and configured.
> > > >
> > > > When did noParsing change?  I noticed that the Nutch wiki is out of
> > > > date, so
> > > > I'm not sure what the current setups are.
> > >
> > > You will find the official Nutch tutorial and command line options (for
> > > what
> > > you require) up-to-date, these can be found on the wiki. If you have
> > > anything to add please do.
> > >
> > > > The log data made some mention of hadoop, but I don't remember what
> it
> > >
> > > was.
> > >
> > > > I'll see if it happens again and post the message.
> > > >
> > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > >
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > Hi Bai,
> > > > >
> > > > > You haven't mentioned which Nutch version you're using... this
> would
> > > > > be good
> > > > > if you could.
> > > > >
> > > > > You haven't injected any seed URLs into your crawldb. From memory I
> > >
> > > think
> > >
> > > > > the -topN parameter should be passed to the generate command.
> > > > >
> > > > > Just to note, it is not necessary to set noParsing while executing
> > > > > the fetch
> > > > > command. This is already default behaviour. Not sure why your
> machine
> > >
> > > is
> > >
> > > > > churning but this shouldn't be happening. Do you have any log data
> to
> > > > > suggest why this is the case.
> > > > >
> > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <baishen.lists@gmail.com
> >
> > > >
> > > > wrote:
> > > > > > So I was able to get Nutch up and working using the crawl
> command.
> > > > > > I
> > > >
> > > > set
> > > >
> > > > > > my
> > > > > > depth and topN and it ran and indexed the pages for me.
> > > > > >
> > > > > > But not I'm trying to split out the separate pieces in order to
> > > > >
> > > > > distribute
> > > > >
> > > > > > them and add my own parser.  I'm running the following.
> > > > > >
> > > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > > bin/nutch parse $SEGMENT
> > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > >
> > > > > >
> > > > > > I don't see any way to determine how deep to crawl.  Is this
> > >
> > > possible,
> > >
> > > > or
> > > >
> > > > > > do
> > > > > > I have to manually manage the db?  And if so, how do I do that?
> > > > > >
> > > > > > And as a side note, why does Nutch invoke hadoop during the fetch
> > > >
> > > > command
> > > >
> > > > > > even though I have noParsing set?  After fetching my links, my
> > >
> > > machine
> > >
> > > > > > churns for around twenty minutes before finally ending, even
> though
> > >
> > > all
> > >
> > > > > the
> > > > >
> > > > > > fetch threads completed already.
> > > > > >
> > > > > > Thanks.
> > > > >
> > > > > --
> > > > > *Lewis*
> > >
> > > --
> > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*Lewis*

Re: Nutch crawl vs other commands

Posted by Markus Jelsma <ma...@openindex.io>.

Hmm, the wiki tutorial seems wrong. You must parse before updating any DB.



On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> I looked at the tutorial, and it's doing pretty much the same thing as the
> lucid link I referenced earlier.  It just leaves out the noParsing and also
> swaps the updatedb and parse commands.  Does the order make a difference?
> 
> On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Hi Bai,
> > 
> > I hope various comments have helped you somewhat, however I another small
> > one as well. please see below
> > 
> > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com> wrote:
> > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > versions.
> > > 
> > > I did inject the urls already.  It's just that the part I was having
> > 
> > issues
> > 
> > > with was the fetch, etc.  I'm using the steps at Lucid Imagination »
> > 
> > Using
> > 
> > > Nutch with Solr<
> > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > > that I alredy had Nutch set up and configured.
> > > 
> > > When did noParsing change?  I noticed that the Nutch wiki is out of
> > > date, so
> > > I'm not sure what the current setups are.
> > 
> > You will find the official Nutch tutorial and command line options (for
> > what
> > you require) up-to-date, these can be found on the wiki. If you have
> > anything to add please do.
> > 
> > > The log data made some mention of hadoop, but I don't remember what it
> > 
> > was.
> > 
> > > I'll see if it happens again and post the message.
> > > 
> > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > 
> > > lewis.mcgibbney@gmail.com> wrote:
> > > > Hi Bai,
> > > > 
> > > > You haven't mentioned which Nutch version you're using... this would
> > > > be good
> > > > if you could.
> > > > 
> > > > You haven't injected any seed URLs into your crawldb. From memory I
> > 
> > think
> > 
> > > > the -topN parameter should be passed to the generate command.
> > > > 
> > > > Just to note, it is not necessary to set noParsing while executing
> > > > the fetch
> > > > command. This is already default behaviour. Not sure why your machine
> > 
> > is
> > 
> > > > churning but this shouldn't be happening. Do you have any log data to
> > > > suggest why this is the case.
> > > > 
> > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com>
> > > 
> > > wrote:
> > > > > So I was able to get Nutch up and working using the crawl command. 
> > > > > I
> > > 
> > > set
> > > 
> > > > > my
> > > > > depth and topN and it ran and indexed the pages for me.
> > > > > 
> > > > > But not I'm trying to split out the separate pieces in order to
> > > > 
> > > > distribute
> > > > 
> > > > > them and add my own parser.  I'm running the following.
> > > > > 
> > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > bin/nutch parse $SEGMENT
> > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > 
> > > > > 
> > > > > I don't see any way to determine how deep to crawl.  Is this
> > 
> > possible,
> > 
> > > or
> > > 
> > > > > do
> > > > > I have to manually manage the db?  And if so, how do I do that?
> > > > > 
> > > > > And as a side note, why does Nutch invoke hadoop during the fetch
> > > 
> > > command
> > > 
> > > > > even though I have noParsing set?  After fetching my links, my
> > 
> > machine
> > 
> > > > > churns for around twenty minutes before finally ending, even though
> > 
> > all
> > 
> > > > the
> > > > 
> > > > > fetch threads completed already.
> > > > > 
> > > > > Thanks.
> > > > 
> > > > --
> > > > *Lewis*
> > 
> > --
> > *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

Also, doing it the tutorial way is only giving me around 125 links at a
time.  Previously, the queue was in the 5k range.  As far as I can tell, the
only differences were not using the noParsing tag and not using the filter
and normalize tags.

On Fri, Sep 23, 2011 at 9:15 AM, Bai Shen <ba...@gmail.com> wrote:

> I looked at the tutorial, and it's doing pretty much the same thing as the
> lucid link I referenced earlier.  It just leaves out the noParsing and also
> swaps the updatedb and parse commands.  Does the order make a difference?
>
>
> On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Bai,
>>
>> I hope various comments have helped you somewhat, however I another small
>> one as well. please see below
>>
>> On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com>
>> wrote:
>>
>> > I'm using 1.3.  This is a new setup, so I'm running the latest versions.
>> >
>> > I did inject the urls already.  It's just that the part I was having
>> issues
>> > with was the fetch, etc.  I'm using the steps at Lucid Imagination »
>> Using
>> > Nutch with Solr<
>> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
>> > that I alredy had Nutch set up and configured.
>> >
>> > When did noParsing change?  I noticed that the Nutch wiki is out of
>> date,
>> > so
>> > I'm not sure what the current setups are.
>> >
>>
>> You will find the official Nutch tutorial and command line options (for
>> what
>> you require) up-to-date, these can be found on the wiki. If you have
>> anything to add please do.
>>
>>
>> > The log data made some mention of hadoop, but I don't remember what it
>> was.
>> > I'll see if it happens again and post the message.
>> >
>> > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
>> > lewis.mcgibbney@gmail.com> wrote:
>> >
>> > > Hi Bai,
>> > >
>> > > You haven't mentioned which Nutch version you're using... this would
>> be
>> > > good
>> > > if you could.
>> > >
>> > > You haven't injected any seed URLs into your crawldb. From memory I
>> think
>> > > the -topN parameter should be passed to the generate command.
>> > >
>> > > Just to note, it is not necessary to set noParsing while executing the
>> > > fetch
>> > > command. This is already default behaviour. Not sure why your machine
>> is
>> > > churning but this shouldn't be happening. Do you have any log data to
>> > > suggest why this is the case.
>> > >
>> > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com>
>> > wrote:
>> > >
>> > > > So I was able to get Nutch up and working using the crawl command.
>>  I
>> > set
>> > > > my
>> > > > depth and topN and it ran and indexed the pages for me.
>> > > >
>> > > > But not I'm trying to split out the separate pieces in order to
>> > > distribute
>> > > > them and add my own parser.  I'm running the following.
>> > > >
>> > > > bin/nutch generate crawl/crawldb crawl/segments
>> > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
>> > > > bin/nutch fetch $SEGMENT -noParsing
>> > > > bin/nutch parse $SEGMENT
>> > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
>> > > >
>> > > >
>> > > > I don't see any way to determine how deep to crawl.  Is this
>> possible,
>> > or
>> > > > do
>> > > > I have to manually manage the db?  And if so, how do I do that?
>> > > >
>> > > > And as a side note, why does Nutch invoke hadoop during the fetch
>> > command
>> > > > even though I have noParsing set?  After fetching my links, my
>> machine
>> > > > churns for around twenty minutes before finally ending, even though
>> all
>> > > the
>> > > > fetch threads completed already.
>> > > >
>> > > > Thanks.
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Lewis*
>> > >
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

I looked at the tutorial, and it's doing pretty much the same thing as the
lucid link I referenced earlier.  It just leaves out the noParsing and also
swaps the updatedb and parse commands.  Does the order make a difference?

On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bai,
>
> I hope various comments have helped you somewhat, however I another small
> one as well. please see below
>
> On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I'm using 1.3.  This is a new setup, so I'm running the latest versions.
> >
> > I did inject the urls already.  It's just that the part I was having
> issues
> > with was the fetch, etc.  I'm using the steps at Lucid Imagination »
> Using
> > Nutch with Solr<
> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> > that I alredy had Nutch set up and configured.
> >
> > When did noParsing change?  I noticed that the Nutch wiki is out of date,
> > so
> > I'm not sure what the current setups are.
> >
>
> You will find the official Nutch tutorial and command line options (for
> what
> you require) up-to-date, these can be found on the wiki. If you have
> anything to add please do.
>
>
> > The log data made some mention of hadoop, but I don't remember what it
> was.
> > I'll see if it happens again and post the message.
> >
> > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi Bai,
> > >
> > > You haven't mentioned which Nutch version you're using... this would be
> > > good
> > > if you could.
> > >
> > > You haven't injected any seed URLs into your crawldb. From memory I
> think
> > > the -topN parameter should be passed to the generate command.
> > >
> > > Just to note, it is not necessary to set noParsing while executing the
> > > fetch
> > > command. This is already default behaviour. Not sure why your machine
> is
> > > churning but this shouldn't be happening. Do you have any log data to
> > > suggest why this is the case.
> > >
> > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com>
> > wrote:
> > >
> > > > So I was able to get Nutch up and working using the crawl command.  I
> > set
> > > > my
> > > > depth and topN and it ran and indexed the pages for me.
> > > >
> > > > But not I'm trying to split out the separate pieces in order to
> > > distribute
> > > > them and add my own parser.  I'm running the following.
> > > >
> > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > bin/nutch fetch $SEGMENT -noParsing
> > > > bin/nutch parse $SEGMENT
> > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > >
> > > >
> > > > I don't see any way to determine how deep to crawl.  Is this
> possible,
> > or
> > > > do
> > > > I have to manually manage the db?  And if so, how do I do that?
> > > >
> > > > And as a side note, why does Nutch invoke hadoop during the fetch
> > command
> > > > even though I have noParsing set?  After fetching my links, my
> machine
> > > > churns for around twenty minutes before finally ending, even though
> all
> > > the
> > > > fetch threads completed already.
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch crawl vs other commands

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Bai,

I hope various comments have helped you somewhat, however I another small
one as well. please see below

On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <ba...@gmail.com> wrote:

> I'm using 1.3.  This is a new setup, so I'm running the latest versions.
>
> I did inject the urls already.  It's just that the part I was having issues
> with was the fetch, etc.  I'm using the steps at Lucid Imagination » Using
> Nutch with Solr<
> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> that I alredy had Nutch set up and configured.
>
> When did noParsing change?  I noticed that the Nutch wiki is out of date,
> so
> I'm not sure what the current setups are.
>

You will find the official Nutch tutorial and command line options (for what
you require) up-to-date, these can be found on the wiki. If you have
anything to add please do.


> The log data made some mention of hadoop, but I don't remember what it was.
> I'll see if it happens again and post the message.
>
> On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Bai,
> >
> > You haven't mentioned which Nutch version you're using... this would be
> > good
> > if you could.
> >
> > You haven't injected any seed URLs into your crawldb. From memory I think
> > the -topN parameter should be passed to the generate command.
> >
> > Just to note, it is not necessary to set noParsing while executing the
> > fetch
> > command. This is already default behaviour. Not sure why your machine is
> > churning but this shouldn't be happening. Do you have any log data to
> > suggest why this is the case.
> >
> > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com>
> wrote:
> >
> > > So I was able to get Nutch up and working using the crawl command.  I
> set
> > > my
> > > depth and topN and it ran and indexed the pages for me.
> > >
> > > But not I'm trying to split out the separate pieces in order to
> > distribute
> > > them and add my own parser.  I'm running the following.
> > >
> > > bin/nutch generate crawl/crawldb crawl/segments
> > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > bin/nutch fetch $SEGMENT -noParsing
> > > bin/nutch parse $SEGMENT
> > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > >
> > >
> > > I don't see any way to determine how deep to crawl.  Is this possible,
> or
> > > do
> > > I have to manually manage the db?  And if so, how do I do that?
> > >
> > > And as a side note, why does Nutch invoke hadoop during the fetch
> command
> > > even though I have noParsing set?  After fetching my links, my machine
> > > churns for around twenty minutes before finally ending, even though all
> > the
> > > fetch threads completed already.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

I'm using 1.3.  This is a new setup, so I'm running the latest versions.

I did inject the urls already.  It's just that the part I was having issues
with was the fetch, etc.  I'm using the steps at Lucid Imagination » Using
Nutch with Solr<http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
that I alredy had Nutch set up and configured.

When did noParsing change?  I noticed that the Nutch wiki is out of date, so
I'm not sure what the current setups are.

The log data made some mention of hadoop, but I don't remember what it was.
I'll see if it happens again and post the message.

On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bai,
>
> You haven't mentioned which Nutch version you're using... this would be
> good
> if you could.
>
> You haven't injected any seed URLs into your crawldb. From memory I think
> the -topN parameter should be passed to the generate command.
>
> Just to note, it is not necessary to set noParsing while executing the
> fetch
> command. This is already default behaviour. Not sure why your machine is
> churning but this shouldn't be happening. Do you have any log data to
> suggest why this is the case.
>
> On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > So I was able to get Nutch up and working using the crawl command.  I set
> > my
> > depth and topN and it ran and indexed the pages for me.
> >
> > But not I'm trying to split out the separate pieces in order to
> distribute
> > them and add my own parser.  I'm running the following.
> >
> > bin/nutch generate crawl/crawldb crawl/segments
> > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > bin/nutch fetch $SEGMENT -noParsing
> > bin/nutch parse $SEGMENT
> > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> >
> >
> > I don't see any way to determine how deep to crawl.  Is this possible, or
> > do
> > I have to manually manage the db?  And if so, how do I do that?
> >
> > And as a side note, why does Nutch invoke hadoop during the fetch command
> > even though I have noParsing set?  After fetching my links, my machine
> > churns for around twenty minutes before finally ending, even though all
> the
> > fetch threads completed already.
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch crawl vs other commands

Posted by Bai Shen <ba...@gmail.com>.

My hadoop.log file has this at the end.

2011-09-22 13:18:52,971 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:53,971 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:54,231 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-09-22 13:18:54,972 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:54,972 INFO  fetcher.Fetcher - -activeThreads=0
2011-09-22 13:19:23,425 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2011-09-22 13:31:37,660 INFO  fetcher.Fetcher - Fetcher: finished at
2011-09-22 13:31:37, elapsed: 00:24:06

I'm trying to figure out what's going on during that 10-15 minutes at the
end.  The machine loads up one core during the time and nothing shows up in
the console.

On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bai,
>
> You haven't mentioned which Nutch version you're using... this would be
> good
> if you could.
>
> You haven't injected any seed URLs into your crawldb. From memory I think
> the -topN parameter should be passed to the generate command.
>
> Just to note, it is not necessary to set noParsing while executing the
> fetch
> command. This is already default behaviour. Not sure why your machine is
> churning but this shouldn't be happening. Do you have any log data to
> suggest why this is the case.
>
> On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > So I was able to get Nutch up and working using the crawl command.  I set
> > my
> > depth and topN and it ran and indexed the pages for me.
> >
> > But not I'm trying to split out the separate pieces in order to
> distribute
> > them and add my own parser.  I'm running the following.
> >
> > bin/nutch generate crawl/crawldb crawl/segments
> > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > bin/nutch fetch $SEGMENT -noParsing
> > bin/nutch parse $SEGMENT
> > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> >
> >
> > I don't see any way to determine how deep to crawl.  Is this possible, or
> > do
> > I have to manually manage the db?  And if so, how do I do that?
> >
> > And as a side note, why does Nutch invoke hadoop during the fetch command
> > even though I have noParsing set?  After fetching my links, my machine
> > churns for around twenty minutes before finally ending, even though all
> the
> > fetch threads completed already.
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch crawl vs other commands

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Bai,

You haven't mentioned which Nutch version you're using... this would be good
if you could.

You haven't injected any seed URLs into your crawldb. From memory I think
the -topN parameter should be passed to the generate command.

Just to note, it is not necessary to set noParsing while executing the fetch
command. This is already default behaviour. Not sure why your machine is
churning but this shouldn't be happening. Do you have any log data to
suggest why this is the case.

On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <ba...@gmail.com> wrote:

> So I was able to get Nutch up and working using the crawl command.  I set
> my
> depth and topN and it ran and indexed the pages for me.
>
> But not I'm trying to split out the separate pieces in order to distribute
> them and add my own parser.  I'm running the following.
>
> bin/nutch generate crawl/crawldb crawl/segments
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch fetch $SEGMENT -noParsing
> bin/nutch parse $SEGMENT
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
>
>
> I don't see any way to determine how deep to crawl.  Is this possible, or
> do
> I have to manually manage the db?  And if so, how do I do that?
>
> And as a side note, why does Nutch invoke hadoop during the fetch command
> even though I have noParsing set?  After fetching my links, my machine
> churns for around twenty minutes before finally ending, even though all the
> fetch threads completed already.
>
> Thanks.
>

-- 
*Lewis*

Re: Nutch crawl vs other commands

Posted by Markus Jelsma <ma...@openindex.io>.

> So I was able to get Nutch up and working using the crawl command.  I set
> my depth and topN and it ran and indexed the pages for me.
> 
> But not I'm trying to split out the separate pieces in order to distribute
> them and add my own parser.  I'm running the following.
> 
> bin/nutch generate crawl/crawldb crawl/segments
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch fetch $SEGMENT -noParsing
> bin/nutch parse $SEGMENT
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> 
> 
> I don't see any way to determine how deep to crawl.  Is this possible, or
> do I have to manually manage the db?  And if so, how do I do that?

Not really. Once the you've got many links pointing to eachother, the concept 
of depth no longer really applies. You don't have to manage the DB manually as 
it will regulate itself (either by using a custom fetch scheduler). 
Nutch will select URL's due for fetch and will in the end exhaust the full 
list of URL's, unless you're crawling the internet. Fetched URL's will be 
refetched over time.

> 
> And as a side note, why does Nutch invoke hadoop during the fetch command
> even though I have noParsing set?  After fetching my links, my machine
> churns for around twenty minutes before finally ending, even though all the
> fetch threads completed already.

Because the fetcher runs as a Hadoop mapred job. When the actual fetch 
finishes Hadoop must write the contents, merge spilled records etc. This is 
part of how mapred works.

> 
> Thanks.