You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jeff Zhou <je...@gmail.com> on 2011/02/18 14:40:37 UTC

What is the end point of a pure crawl?

Hi,

I want to separate parsing from crawling in Nutch. In other words, I want to
crawl thousands of URLs and save the contents in local drive, and parse the
contents later after crawling is completed.

What is the end point (Java class, line of code, etc.) for the crawling?

Thanks,
Jeff

Re: What is the end point of a pure crawl?

Posted by Alexander Aristov <al...@gmail.com>.

hi

don't use the crawl command. it's one-stop command which does everything
from injection to parsing.

instead use dedicated commands: inject, generate etc.

See nutch how-to, step by step section.
http://wiki.apache.org/nutch/NutchTutorial


Best Regards
Alexander Aristov


On 20 February 2011 05:42, jeffersonzhou <je...@gmail.com> wrote:

> When I run the following command, nutch does everything from url injection
> to indexing.
>
> bin/nutch crawl urls -dir crawl_dir -depth 5 -topN 10
>
> I want to separate crawling from parsing from indexing by using customized
> commands. To be more specific, I want to focus on crawling first. When
> crawling is done, I want to run parsing and update the crawlDB. After all
> are done, I want to run indexing. To do so, I will need to know what are
> the
> end points for each of the three modules.
>
> Hope that clarify what I want to do.
>
> Please advise!
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Friday, February 18, 2011 8:57 AM
> To: user@nutch.apache.org
> Cc: Jeff Zhou
> Subject: Re: What is the end point of a pure crawl?
>
> I'm not sure what you mean but generating a segment and just fetching it
> (possibly with the -noParse option, depends your config) will just download
> the
> URL's into the segment.
>
> On Friday 18 February 2011 14:40:37 Jeff Zhou wrote:
> > Hi,
> >
> > I want to separate parsing from crawling in Nutch. In other words, I want
> > to crawl thousands of URLs and save the contents in local drive, and
> parse
> > the contents later after crawling is completed.
> >
> > What is the end point (Java class, line of code, etc.) for the crawling?
> >
> > Thanks,
> > Jeff
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

RE: What is the end point of a pure crawl?

Posted by jeffersonzhou <je...@gmail.com>.

When I run the following command, nutch does everything from url injection
to indexing.

bin/nutch crawl urls -dir crawl_dir -depth 5 -topN 10

I want to separate crawling from parsing from indexing by using customized
commands. To be more specific, I want to focus on crawling first. When
crawling is done, I want to run parsing and update the crawlDB. After all
are done, I want to run indexing. To do so, I will need to know what are the
end points for each of the three modules.

Hope that clarify what I want to do.

Please advise!

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Friday, February 18, 2011 8:57 AM
To: user@nutch.apache.org
Cc: Jeff Zhou
Subject: Re: What is the end point of a pure crawl?

I'm not sure what you mean but generating a segment and just fetching it 
(possibly with the -noParse option, depends your config) will just download
the 
URL's into the segment.

On Friday 18 February 2011 14:40:37 Jeff Zhou wrote:
> Hi,
> 
> I want to separate parsing from crawling in Nutch. In other words, I want
> to crawl thousands of URLs and save the contents in local drive, and parse
> the contents later after crawling is completed.
> 
> What is the end point (Java class, line of code, etc.) for the crawling?
> 
> Thanks,
> Jeff

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: What is the end point of a pure crawl?

Posted by Markus Jelsma <ma...@openindex.io>.

I'm not sure what you mean but generating a segment and just fetching it 
(possibly with the -noParse option, depends your config) will just download the 
URL's into the segment.

On Friday 18 February 2011 14:40:37 Jeff Zhou wrote:
> Hi,
> 
> I want to separate parsing from crawling in Nutch. In other words, I want
> to crawl thousands of URLs and save the contents in local drive, and parse
> the contents later after crawling is completed.
> 
> What is the end point (Java class, line of code, etc.) for the crawling?
> 
> Thanks,
> Jeff

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350