You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 高睿 <ga...@163.com> on 2013/01/15 06:07:11 UTC

What urls does Nutch crawl?

Hi,

I'm customizing nutch 2.1 for crawling blogs from several authors. Each author's blog has list page and article pages.

Say, I want to crawl articles in 50 article lists (each have 30 articles). I add the article list links in the feed.txt, and specify '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it will crawl all the list pages and the articles in each list. But, actually, it seems the urls that nutch crawled becomes more and more, and takes more and more time (3 hours -> more than 24 hours).

Could someone explain me what happens? Does nutch 2.1 always start crawling from the seed folder and follow the 'depth' parameter? What should I do to meet my requirement?
Thanks.

Regards,
Rui

Re: Re: What urls does Nutch crawl?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Alvaro,
Are you please able to open a Jira ticket and sutmit a patch for this?
I would be very willing to work help integrate this into the development
branch.
Thank you for dropping in on this one.
Lewis


On Tue, Apr 2, 2013 at 1:19 PM, Alvaro Cabrerizo <to...@gmail.com> wrote:

> Hi:
>
> What i've detected  using nutch2 (bin/nutch crawl ...) is that on every
> cycle (generate, fetch, update) the system fetches all the urls stored in
> the database (accumulo, mysql or whatever you use). For example, if my
> seeds.txt contains http://localhost that points to a local apache with a
> welcome page (no outlinks) and i run:
>
>  ./bin/nutch crawl  conf/urls -depth 10
>
> On every step, nutch will fetch http://localhost. It means ten times. When
> wild crawling, every step will fetch all the links fetched before plus all
> the new links it should fetch during the current round. Looking the
>  fetcher code (FetcherJob.java) the method "run" gets the batchid
> identifier from the arguments, but the generator stores this number on the
> configuration. As the fetcher cant get the generator id, it fetches "all".
> A simple solution could be (it works for me :) ) check wether the batchid
> argument is null, and if it is the case get the value from the
> configuration:
>
>
>  public Map<String,Object> run(Map<String,Object> args) throws Exception {
>    ...
>     String batchId = (String)args.get(Nutch.ARG_BATCH);
>     ...
>     if (batchId == null) { //the argument value is null
>       batchId = getConf().get(GeneratorJob.BATCH_ID,
> Nutch.ALL_BATCH_ID_STR); //get the value stored in the configuration or
> fetch all if that is null.
>     }
>
>
>
> Regards.
>
>
>
>
>
>
> On Thu, Jan 17, 2013 at 11:47 AM, 高睿 <ga...@163.com> wrote:
>
> > Yes, that is my case.
> > Remove all previous data is an option, but the data will be lost.
> > I want to write a plugin to empty 'outlinks' for the article page. So,
> the
> > crawling will be terminated at the article urls, therefore no additional
> > links will be stored in DB.
> >
> >
> >
> > At 2013-01-16 04:24:11,"Sebastian Nagel" <wa...@googlemail.com>
> > wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in
> > the seed url folder and >- contains URLs of the 50 article lists >If
> yes: >
> > -depth 2 >will crawl these 50 URLs and for each article list all its 30
> > outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue
> > crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second
> > cycle, and then the links found again, and so on: it will >continue to
> > crawl the whole web. To limit the crawl to exactly the 1550 docs >either
> > remove all previously crawled data to start again from scratch >or have a
> > look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet
> > adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331>and
> > https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name
> > -depth does not mean a "limitation of a certain linkage depth" (that's
> >the
> > meaning in "scoring-depth") but the number of crawl cycles or rounds.
> >If a
> > crawl is started from scratch the results are identical in most cases. >
> > >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure
> about
> > your question here. I'm using the Nutch2.1 default configuration, and run
> > command: bin/nutch crawl urls -solr
> http://localhost:8080/solr/collection2-threads 10 -depth 2 -topN 1000 >>
> The 'urls' folder includes the blog
> > index pages (each index page includes a list of article pages). >> I
> think
> > the plugin 'parse-html' and 'parse-tika' are currently responsible for
> > parse the links from the html. Should I clean the outlinks in an
> additional
> > Parse plugin in order to prevent nutch from crawling the outlinks in the
> > article page? >>  >>  >>  >> At 2013-01-15 13:31:11,"Lewis John
> Mcgibbney" <
> > lewis.mcgibbney@gmail.com> wrote: >>> I take it you are updating the
> > database with the crawl data? This will mark >>> all links extracted
> during
> > parse phase (depending upon your config) as due >>> for fetching. When
> you
> > generate these links will be populated within the >>> batchId's and Nutch
> > will attempt to fetch them. >>> Please also search out list archives for
> > the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday,
> > January 14, 2013, 高睿 <ga...@163.com> wrote: >>>> Hi, >>>> >>>> I'm
> > customizing nutch 2.1 for crawling blogs from several authors. Each >>>
> > author's blog has list page and article pages. >>>> >>>> Say, I want to
> > crawl articles in 50 article lists (each have 30 >>> articles). I add the
> > article list links in the feed.txt, and specify >>> '-depth 2' and '-topN
> > 2000'. My expectation is each time I run nutch, it >>> will crawl all the
> > list pages and the articles in each list. But, actually, >>> it seems the
> > urls that nutch crawled becomes more and more, and takes more >>> and
> more
> > time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me
> > what happens? Does nutch 2.1 always start >>> crawling from the seed
> folder
> > and follow the 'depth' parameter? What should >>> I do to meet my
> > requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> --
>  >>>
> > *Lewis*
>



-- 
*Lewis*

Re: Re: What urls does Nutch crawl?

Posted by Alvaro Cabrerizo <to...@gmail.com>.

Hi:

What i've detected  using nutch2 (bin/nutch crawl ...) is that on every
cycle (generate, fetch, update) the system fetches all the urls stored in
the database (accumulo, mysql or whatever you use). For example, if my
seeds.txt contains http://localhost that points to a local apache with a
welcome page (no outlinks) and i run:

 ./bin/nutch crawl  conf/urls -depth 10

On every step, nutch will fetch http://localhost. It means ten times. When
wild crawling, every step will fetch all the links fetched before plus all
the new links it should fetch during the current round. Looking the
 fetcher code (FetcherJob.java) the method "run" gets the batchid
identifier from the arguments, but the generator stores this number on the
configuration. As the fetcher cant get the generator id, it fetches "all".
A simple solution could be (it works for me :) ) check wether the batchid
argument is null, and if it is the case get the value from the
configuration:


 public Map<String,Object> run(Map<String,Object> args) throws Exception {
   ...
    String batchId = (String)args.get(Nutch.ARG_BATCH);
    ...
    if (batchId == null) { //the argument value is null
      batchId = getConf().get(GeneratorJob.BATCH_ID,
Nutch.ALL_BATCH_ID_STR); //get the value stored in the configuration or
fetch all if that is null.
    }



Regards.






On Thu, Jan 17, 2013 at 11:47 AM, 高睿 <ga...@163.com> wrote:

> Yes, that is my case.
> Remove all previous data is an option, but the data will be lost.
> I want to write a plugin to empty 'outlinks' for the article page. So, the
> crawling will be terminated at the article urls, therefore no additional
> links will be stored in DB.
>
>
>
> At 2013-01-16 04:24:11,"Sebastian Nagel" <wa...@googlemail.com>
> wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in
> the seed url folder and >- contains URLs of the 50 article lists >If yes: >
> -depth 2 >will crawl these 50 URLs and for each article list all its 30
> outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue
> crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second
> cycle, and then the links found again, and so on: it will >continue to
> crawl the whole web. To limit the crawl to exactly the 1550 docs >either
> remove all previously crawled data to start again from scratch >or have a
> look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet
> adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331 >and
> https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name
> -depth does not mean a "limitation of a certain linkage depth" (that's >the
> meaning in "scoring-depth") but the number of crawl cycles or rounds. >If a
> crawl is started from scratch the results are identical in most cases. >
> >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure about
> your question here. I'm using the Nutch2.1 default configuration, and run
> command: bin/nutch crawl urls -solr http://localhost:8080/solr/collection2-threads 10 -depth 2 -topN 1000 >> The 'urls' folder includes the blog
> index pages (each index page includes a list of article pages). >> I think
> the plugin 'parse-html' and 'parse-tika' are currently responsible for
> parse the links from the html. Should I clean the outlinks in an additional
> Parse plugin in order to prevent nutch from crawling the outlinks in the
> article page? >>  >>  >>  >> At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <
> lewis.mcgibbney@gmail.com> wrote: >>> I take it you are updating the
> database with the crawl data? This will mark >>> all links extracted during
> parse phase (depending upon your config) as due >>> for fetching. When you
> generate these links will be populated within the >>> batchId's and Nutch
> will attempt to fetch them. >>> Please also search out list archives for
> the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday,
> January 14, 2013, 高睿 <ga...@163.com> wrote: >>>> Hi, >>>> >>>> I'm
> customizing nutch 2.1 for crawling blogs from several authors. Each >>>
> author's blog has list page and article pages. >>>> >>>> Say, I want to
> crawl articles in 50 article lists (each have 30 >>> articles). I add the
> article list links in the feed.txt, and specify >>> '-depth 2' and '-topN
> 2000'. My expectation is each time I run nutch, it >>> will crawl all the
> list pages and the articles in each list. But, actually, >>> it seems the
> urls that nutch crawled becomes more and more, and takes more >>> and more
> time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me
> what happens? Does nutch 2.1 always start >>> crawling from the seed folder
> and follow the 'depth' parameter? What should >>> I do to meet my
> requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> --  >>>
> *Lewis*

Re:Re: What urls does Nutch crawl?

Posted by 高睿 <ga...@163.com>.

Yes, that is my case.
Remove all previous data is an option, but the data will be lost.
I want to write a plugin to empty 'outlinks' for the article page. So, the crawling will be terminated at the article urls, therefore no additional links will be stored in DB.



At 2013-01-16 04:24:11,"Sebastian Nagel" <wa...@googlemail.com> wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in the seed url folder and >- contains URLs of the 50 article lists >If yes: > -depth 2 >will crawl these 50 URLs and for each article list all its 30 outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second cycle, and then the links found again, and so on: it will >continue to crawl the whole web. To limit the crawl to exactly the 1550 docs >either remove all previously crawled data to start again from scratch >or have a look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331 >and https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name -depth does not mean a "limitation of a certain linkage depth" (that's >the meaning in "scoring-depth") but the number of crawl cycles or rounds. >If a crawl is started from scratch the results are identical in most cases. > >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure about your question here. I'm using the Nutch2.1 default configuration, and run command: bin/nutch crawl urls -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000 >> The 'urls' folder includes the blog index pages (each index page includes a list of article pages). >> I think the plugin 'parse-html' and 'parse-tika' are currently responsible for parse the links from the html. Should I clean the outlinks in an additional Parse plugin in order to prevent nutch from crawling the outlinks in the article page? >>  >>  >>  >> At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <le...@gmail.com> wrote: >>> I take it you are updating the database with the crawl data? This will mark >>> all links extracted during parse phase (depending upon your config) as due >>> for fetching. When you generate these links will be populated within the >>> batchId's and Nutch will attempt to fetch them. >>> Please also search out list archives for the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday, January 14, 2013, 高睿 <ga...@163.com> wrote: >>>> Hi, >>>> >>>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each >>> author's blog has list page and article pages. >>>> >>>> Say, I want to crawl articles in 50 article lists (each have 30 >>> articles). I add the article list links in the feed.txt, and specify >>> '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it >>> will crawl all the list pages and the articles in each list. But, actually, >>> it seems the urls that nutch crawled becomes more and more, and takes more >>> and more time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me what happens? Does nutch 2.1 always start >>> crawling from the seed folder and follow the 'depth' parameter? What should >>> I do to meet my requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> --  >>> *Lewis*

Re: What urls does Nutch crawl?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

did I understood you correctly?
- feed.txt is placed in the seed url folder and
- contains URLs of the 50 article lists
If yes:
 -depth 2
will crawl these 50 URLs and for each article list all its 30 outlinks,
in total 50 + 50*30 = 1550 documents.

If you continue crawling Nutch fetch the outlinks of the 1500 docs fetched
in the second cycle, and then the links found again, and so on: it will
continue to crawl the whole web. To limit the crawl to exactly the 1550 docs
either remove all previously crawled data to start again from scratch
or have a look at the plugin "scoring-depth" (it's new and,
unfortunately, not yet adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331
and https://issues.apache.org/jira/browse/NUTCH-1508).

The option name -depth does not mean a "limitation of a certain linkage depth" (that's
the meaning in "scoring-depth") but the number of crawl cycles or rounds.
If a crawl is started from scratch the results are identical in most cases.

Sebastian

On 01/15/2013 06:53 PM, 高睿 wrote:
> I'm not quite sure about your question here. I'm using the Nutch2.1 default configuration, and run command: bin/nutch crawl urls -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000
> The 'urls' folder includes the blog index pages (each index page includes a list of article pages).
> I think the plugin 'parse-html' and 'parse-tika' are currently responsible for parse the links from the html. Should I clean the outlinks in an additional Parse plugin in order to prevent nutch from crawling the outlinks in the article page?
> 
> 
> 
> At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>> I take it you are updating the database with the crawl data? This will mark
>> all links extracted during parse phase (depending upon your config) as due
>> for fetching. When you generate these links will be populated within the
>> batchId's and Nutch will attempt to fetch them.
>> Please also search out list archives for the definition of the depth
>> parameter.
>> Lewis
>>
>> On Monday, January 14, 2013, 高睿 <ga...@163.com> wrote:
>>> Hi,
>>>
>>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
>> author's blog has list page and article pages.
>>>
>>> Say, I want to crawl articles in 50 article lists (each have 30
>> articles). I add the article list links in the feed.txt, and specify
>> '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
>> will crawl all the list pages and the articles in each list. But, actually,
>> it seems the urls that nutch crawled becomes more and more, and takes more
>> and more time (3 hours -> more than 24 hours).
>>>
>>> Could someone explain me what happens? Does nutch 2.1 always start
>> crawling from the seed folder and follow the 'depth' parameter? What should
>> I do to meet my requirement?
>>> Thanks.
>>>
>>> Regards,
>>> Rui
>>>
>>
>> -- 
>> *Lewis*

Re:Re: What urls does Nutch crawl?

Posted by 高睿 <ga...@163.com>.

I'm not quite sure about your question here. I'm using the Nutch2.1 default configuration, and run command: bin/nutch crawl urls -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000
The 'urls' folder includes the blog index pages (each index page includes a list of article pages).
I think the plugin 'parse-html' and 'parse-tika' are currently responsible for parse the links from the html. Should I clean the outlinks in an additional Parse plugin in order to prevent nutch from crawling the outlinks in the article page?



At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>I take it you are updating the database with the crawl data? This will mark
>all links extracted during parse phase (depending upon your config) as due
>for fetching. When you generate these links will be populated within the
>batchId's and Nutch will attempt to fetch them.
>Please also search out list archives for the definition of the depth
>parameter.
>Lewis
>
>On Monday, January 14, 2013, 高睿 <ga...@163.com> wrote:
>> Hi,
>>
>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
>author's blog has list page and article pages.
>>
>> Say, I want to crawl articles in 50 article lists (each have 30
>articles). I add the article list links in the feed.txt, and specify
>'-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
>will crawl all the list pages and the articles in each list. But, actually,
>it seems the urls that nutch crawled becomes more and more, and takes more
>and more time (3 hours -> more than 24 hours).
>>
>> Could someone explain me what happens? Does nutch 2.1 always start
>crawling from the seed folder and follow the 'depth' parameter? What should
>I do to meet my requirement?
>> Thanks.
>>
>> Regards,
>> Rui
>>
>
>-- 
>*Lewis*

Re: What urls does Nutch crawl?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I take it you are updating the database with the crawl data? This will mark
all links extracted during parse phase (depending upon your config) as due
for fetching. When you generate these links will be populated within the
batchId's and Nutch will attempt to fetch them.
Please also search out list archives for the definition of the depth
parameter.
Lewis

On Monday, January 14, 2013, 高睿 <ga...@163.com> wrote:
> Hi,
>
> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
author's blog has list page and article pages.
>
> Say, I want to crawl articles in 50 article lists (each have 30
articles). I add the article list links in the feed.txt, and specify
'-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
will crawl all the list pages and the articles in each list. But, actually,
it seems the urls that nutch crawled becomes more and more, and takes more
and more time (3 hours -> more than 24 hours).
>
> Could someone explain me what happens? Does nutch 2.1 always start
crawling from the seed folder and follow the 'depth' parameter? What should
I do to meet my requirement?
> Thanks.
>
> Regards,
> Rui
>

-- 
*Lewis*