You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cheng Li <ch...@usc.edu> on 2011/07/20 12:04:55 UTC

help, src modify to optimize the crawl

Hi,

    I tried to use Nutch to crawl craiglist.   The seed I use is




    http://losangeles.craigslist.org/wst/ctd/
http://losangeles.craigslist.org/sfv/ctd/
http://losangeles.craigslist.org/lac/ctd/
http://losangeles.craigslist.org/sgv/ctd/
http://losangeles.craigslist.org/lgb/ctd/
http://losangeles.craigslist.org/ant/ctd/

http://losangeles.craigslist.org/wst/cto/
http://losangeles.craigslist.org/sfv/cto/
http://losangeles.craigslist.org/lac/cto/
http://losangeles.craigslist.org/sgv/cto/
http://losangeles.craigslist.org/lgb/cto/
http://losangeles.craigslist.org/ant/cto/


  What I want to get is the result page like this one , for example ,
http://losangeles.craigslist.org/lac/ctd/2501038362.html  , which is a
specific car selling page .
  What I DON'T what to get is the result page like this one , for example ,
http://losangeles.craigslist.org/cta/.

 However , in my query result , I can always have results like
http://losangeles.craigslist.org/cta/.

 Actually , I can get this kind of this website from craiglist, just part of
them , but not all of them.  I tried to adjust the crawl command line
parameter, but there is no much change .

 So what I plan to do is to modify the crawl code in Nutch src code. Where
can I start ?  What kind of work can I do to optimize the crawl process in
src code ?

-- 
Cheng Li

Re: help, src modify to optimize the crawl

Posted by lewis john mcgibbney <le...@gmail.com>.

I dont think this has anything to so with modifying the crawl src. It
doesn't infact have anything to do with optimization either. Try using your
URLFilters e.g. regex

It is important to try and understand what type of pages we can filter out
from a Nutch crawl using the filters provided.

HTH

On Wed, Jul 20, 2011 at 11:04 AM, Cheng Li <ch...@usc.edu> wrote:

> Hi,
>
>    I tried to use Nutch to crawl craiglist.   The seed I use is
>
>
>
>
>    http://losangeles.craigslist.org/wst/ctd/
> http://losangeles.craigslist.org/sfv/ctd/
> http://losangeles.craigslist.org/lac/ctd/
> http://losangeles.craigslist.org/sgv/ctd/
> http://losangeles.craigslist.org/lgb/ctd/
> http://losangeles.craigslist.org/ant/ctd/
>
> http://losangeles.craigslist.org/wst/cto/
> http://losangeles.craigslist.org/sfv/cto/
> http://losangeles.craigslist.org/lac/cto/
> http://losangeles.craigslist.org/sgv/cto/
> http://losangeles.craigslist.org/lgb/cto/
> http://losangeles.craigslist.org/ant/cto/
>
>
>  What I want to get is the result page like this one , for example ,
> http://losangeles.craigslist.org/lac/ctd/2501038362.html  , which is a
> specific car selling page .
>  What I DON'T what to get is the result page like this one , for example ,
> http://losangeles.craigslist.org/cta/.
>
>  However , in my query result , I can always have results like
> http://losangeles.craigslist.org/cta/.
>
>  Actually , I can get this kind of this website from craiglist, just part
> of
> them , but not all of them.  I tried to adjust the crawl command line
> parameter, but there is no much change .
>
>  So what I plan to do is to modify the crawl code in Nutch src code. Where
> can I start ?  What kind of work can I do to optimize the crawl process in
> src code ?
>
> --
> Cheng Li
>



-- 
*Lewis*