You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/16 15:16:53 UTC

Some question about the generator

Hello,

there are two things I don't understand regarding the generator:

1.) If I set the generate.max.count value to a value, e.g. 3000, it 
seems that this value is ignored. In every run about 20000 pages are 
fetched.

TOTAL urls: 102396
retry 0:    101679
retry 1:    325
retry 2:    392
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):    33072
status 2 (db_fetched):      57146
status 3 (db_gone): 6878
status 4 (db_redir_temp):   2510
status 5 (db_redir_perm):   2509
status 6 (db_notmodified):  281
CrawlDb statistics: done

After a generate / fetch / parse / update cycle:

TOTAL urls:     122885
retry 0:        121816
retry 1:        677
retry 2:        392
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        32153
status 2 (db_fetched):  75366
status 3 (db_gone):     9167
status 4 (db_redir_temp):       2979
status 5 (db_redir_perm):       2878
status 6 (db_notmodified):      342
CrawlDb statistics: done

2.) The next thing is related to the first one:

The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain 
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

But when the fetcher is running it fetches many urls which the generator 
told me it had skipped before, like:

2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching 
http://cms.uni-kassel.de/unicms/index.php?id=27436
2011-08-16 13:56:31,706 INFO  fetcher.Fetcher - fetching 
http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1

A second example:

2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain 
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching 
http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf
2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching 
http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf

Did I do something wrong? I don't get it :)

Thank you all

Re: Some question about the generator

Posted by Markus Jelsma <ma...@openindex.io>.

Selects number of fetchers. Always one in local mode and possibly multiple in 
distributed mode. 

> what is doing -numFetchers option in generator?

Re: Some question about the generator

Posted by Radim Kolar <hs...@sendmail.cz>.

what is doing -numFetchers option in generator?

Re: Some question about the generator

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 16.08.2011 16:27, Julien Nioche wrote:
>>> 1) generate.max.count sets a limit on the number of URLs for a single
>>>> host or domain - this is different from the overall limit set by the
>>>> generate -top parameter.
>>>>
>>>> 2) the generator only skips the URLs which are beyond the max number
>>>> allowed for the host (in your case 3K). This does not mean that ALL
>> urls
>>>> for that host are skipped
>>>>
>>>> Makes sense?
>>>
>>> Hey Julien, thank you. Yes, your description makes sense for me. So if I
>>> want to fetch a list with only 3k urls, I just have to run:
>>>
>>> ./nutch parse $seg -topN 3000
>>
>> No, topN applies to the generator.
>>
>
> good catch Markus - I'd read generate.
> Marek - this has nothing to do with the parsing

Yeah, right, I meant generate. My fault. :-)

>
>>
>>>
>>> right?
>>>
>>> But I still don't get this message:
>>> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
>>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>>>
>>> What is meant by "more than 3000 URLs for all 1 segments"? Skipping
>>> means then, that "it will skip after 3k urls"?
>>
>> generate.max.count=3000 then all urls above 3000 for a given host/domain
>> are
>> skipped when generating the segment.
>>
>>>
>>> But for now you helped to solve my problem. :)
>>>
>>>> On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>
>>   wrote:
>>>>> Hello,
>>>>>
>>>>> there are two things I don't understand regarding the generator:
>>>>>
>>>>> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
>>>>> seems that this value is ignored. In every run about 20000 pages are
>>>>> fetched.
>>>>>
>>>>> TOTAL urls: 102396
>>>>> retry 0:    101679
>>>>> retry 1:    325
>>>>> retry 2:    392
>>>>> min score:  1.0
>>>>> avg score:  1.0
>>>>> max score:  1.0
>>>>> status 1 (db_unfetched):    33072
>>>>> status 2 (db_fetched):      57146
>>>>> status 3 (db_gone): 6878
>>>>> status 4 (db_redir_temp):   2510
>>>>> status 5 (db_redir_perm):   2509
>>>>> status 6 (db_notmodified):  281
>>>>> CrawlDb statistics: done
>>>>>
>>>>> After a generate / fetch / parse / update cycle:
>>>>>
>>>>> TOTAL urls:     122885
>>>>> retry 0:        121816
>>>>> retry 1:        677
>>>>> retry 2:        392
>>>>> min score:      1.0
>>>>> avg score:      1.0
>>>>> max score:      1.0
>>>>> status 1 (db_unfetched):        32153
>>>>> status 2 (db_fetched):  75366
>>>>> status 3 (db_gone):     9167
>>>>> status 4 (db_redir_temp):       2979
>>>>> status 5 (db_redir_perm):       2878
>>>>> status 6 (db_notmodified):      342
>>>>> CrawlDb statistics: done
>>>>>
>>>>> 2.) The next thing is related to the first one:
>>>>>
>>>>> The generator tells me in the log files:
>>>>> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
>>>>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
>> skipping
>>>>>
>>>>> But when the fetcher is running it fetches many urls which the
>> generator
>>>>> told me it had skipped before, like:
>>>>>
>>>>> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
>>>>> http://cms.uni-kassel.de/**unicms/index.php?id=27436<
>> http://cms.uni-kass
>>>>> el.de/unicms/index.php?id=27436>  2011-08-16 13:56:31,706 INFO
>>>>> fetcher.Fetcher - fetching
>>>>> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
>> http://cms.uni-
>>>>> kassel.de/unicms/index.php?id=24287&L=1>
>>>>>
>>>>> A second example:
>>>>>
>>>>> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
>>>>> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
>>>>> skipping
>>>>>
>>>>> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
>>>>> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
>>>>> Visualizing_and_Optimizing-**Paper.pdf<
>> http://www.iset.uni-kassel.de/abt
>>>>> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
>>>>> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
>>>>> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
>>>>> 2010_Degner_Staffelstein.pdf<
>> http://www.iset.uni-kassel.de/abt/FB-A/publ
>>>>> ication/2010/2010_Degner_Staffelstein.pdf>
>>>>>
>>>>> Did I do something wrong? I don't get it :)
>>>>>
>>>>> Thank you all
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>

Re: Some question about the generator

Posted by Julien Nioche <li...@gmail.com>.

> > 1) generate.max.count sets a limit on the number of URLs for a single
> > > host or domain - this is different from the overall limit set by the
> > > generate -top parameter.
> > >
> > > 2) the generator only skips the URLs which are beyond the max number
> > > allowed for the host (in your case 3K). This does not mean that ALL
> urls
> > > for that host are skipped
> > >
> > > Makes sense?
> >
> > Hey Julien, thank you. Yes, your description makes sense for me. So if I
> > want to fetch a list with only 3k urls, I just have to run:
> >
> > ./nutch parse $seg -topN 3000
>
> No, topN applies to the generator.
>

good catch Markus - I'd read generate.
Marek - this has nothing to do with the parsing


>
> >
> > right?
> >
> > But I still don't get this message:
> > 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >
> > What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> > means then, that "it will skip after 3k urls"?
>
> generate.max.count=3000 then all urls above 3000 for a given host/domain
> are
> skipped when generating the segment.
>
> >
> > But for now you helped to solve my problem. :)
> >
> > > On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>
>  wrote:
> > >> Hello,
> > >>
> > >> there are two things I don't understand regarding the generator:
> > >>
> > >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> > >> seems that this value is ignored. In every run about 20000 pages are
> > >> fetched.
> > >>
> > >> TOTAL urls: 102396
> > >> retry 0:    101679
> > >> retry 1:    325
> > >> retry 2:    392
> > >> min score:  1.0
> > >> avg score:  1.0
> > >> max score:  1.0
> > >> status 1 (db_unfetched):    33072
> > >> status 2 (db_fetched):      57146
> > >> status 3 (db_gone): 6878
> > >> status 4 (db_redir_temp):   2510
> > >> status 5 (db_redir_perm):   2509
> > >> status 6 (db_notmodified):  281
> > >> CrawlDb statistics: done
> > >>
> > >> After a generate / fetch / parse / update cycle:
> > >>
> > >> TOTAL urls:     122885
> > >> retry 0:        121816
> > >> retry 1:        677
> > >> retry 2:        392
> > >> min score:      1.0
> > >> avg score:      1.0
> > >> max score:      1.0
> > >> status 1 (db_unfetched):        32153
> > >> status 2 (db_fetched):  75366
> > >> status 3 (db_gone):     9167
> > >> status 4 (db_redir_temp):       2979
> > >> status 5 (db_redir_perm):       2878
> > >> status 6 (db_notmodified):      342
> > >> CrawlDb statistics: done
> > >>
> > >> 2.) The next thing is related to the first one:
> > >>
> > >> The generator tells me in the log files:
> > >> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> > >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
> > >>
> > >> But when the fetcher is running it fetches many urls which the
> generator
> > >> told me it had skipped before, like:
> > >>
> > >> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<
> http://cms.uni-kass
> > >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
> > >> fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
> http://cms.uni-
> > >> kassel.de/unicms/index.php?id=24287&L=1>
> > >>
> > >> A second example:
> > >>
> > >> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> > >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> > >> skipping
> > >>
> > >> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> > >> Visualizing_and_Optimizing-**Paper.pdf<
> http://www.iset.uni-kassel.de/abt
> > >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> > >> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> > >> 2010_Degner_Staffelstein.pdf<
> http://www.iset.uni-kassel.de/abt/FB-A/publ
> > >> ication/2010/2010_Degner_Staffelstein.pdf>
> > >>
> > >> Did I do something wrong? I don't get it :)
> > >>
> > >> Thank you all
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Some question about the generator

Posted by Markus Jelsma <ma...@openindex.io>.


On Tuesday 16 August 2011 16:17:26 Marek Bachmann wrote:
> On 16.08.2011 15:53, Julien Nioche wrote:
> > 1) generate.max.count sets a limit on the number of URLs for a single
> > host or domain - this is different from the overall limit set by the
> > generate -top parameter.
> > 
> > 2) the generator only skips the URLs which are beyond the max number
> > allowed for the host (in your case 3K). This does not mean that ALL urls
> > for that host are skipped
> > 
> > Makes sense?
> 
> Hey Julien, thank you. Yes, your description makes sense for me. So if I
> want to fetch a list with only 3k urls, I just have to run:
> 
> ./nutch parse $seg -topN 3000

No, topN applies to the generator.

> 
> right?
> 
> But I still don't get this message:
> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> 
> What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> means then, that "it will skip after 3k urls"?

generate.max.count=3000 then all urls above 3000 for a given host/domain are 
skipped when generating the segment.

> 
> But for now you helped to solve my problem. :)
> 
> > On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>  wrote:
> >> Hello,
> >> 
> >> there are two things I don't understand regarding the generator:
> >> 
> >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> >> seems that this value is ignored. In every run about 20000 pages are
> >> fetched.
> >> 
> >> TOTAL urls: 102396
> >> retry 0:    101679
> >> retry 1:    325
> >> retry 2:    392
> >> min score:  1.0
> >> avg score:  1.0
> >> max score:  1.0
> >> status 1 (db_unfetched):    33072
> >> status 2 (db_fetched):      57146
> >> status 3 (db_gone): 6878
> >> status 4 (db_redir_temp):   2510
> >> status 5 (db_redir_perm):   2509
> >> status 6 (db_notmodified):  281
> >> CrawlDb statistics: done
> >> 
> >> After a generate / fetch / parse / update cycle:
> >> 
> >> TOTAL urls:     122885
> >> retry 0:        121816
> >> retry 1:        677
> >> retry 2:        392
> >> min score:      1.0
> >> avg score:      1.0
> >> max score:      1.0
> >> status 1 (db_unfetched):        32153
> >> status 2 (db_fetched):  75366
> >> status 3 (db_gone):     9167
> >> status 4 (db_redir_temp):       2979
> >> status 5 (db_redir_perm):       2878
> >> status 6 (db_notmodified):      342
> >> CrawlDb statistics: done
> >> 
> >> 2.) The next thing is related to the first one:
> >> 
> >> The generator tells me in the log files:
> >> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >> 
> >> But when the fetcher is running it fetches many urls which the generator
> >> told me it had skipped before, like:
> >> 
> >> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kass
> >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO 
> >> fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-
> >> kassel.de/unicms/index.php?id=24287&L=1>
> >> 
> >> A second example:
> >> 
> >> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> >> skipping
> >> 
> >> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> >> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt
> >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> >> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> >> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publ
> >> ication/2010/2010_Degner_Staffelstein.pdf>
> >> 
> >> Did I do something wrong? I don't get it :)
> >> 
> >> Thank you all

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Some question about the generator

Posted by Julien Nioche <li...@gmail.com>.

> Yes, your description makes sense for me. So if I want to fetch a list with
> only 3k urls, I just have to run:
> ./nutch parse $seg -topN 3000
>
> right?
>

yes

>
> But I still don't get this message:
>
> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> What is meant by "more than 3000 URLs for all 1 segments"? Skipping means
> then, that "it will skip after 3k urls"?
>

 yes



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Some question about the generator

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 16.08.2011 15:53, Julien Nioche wrote:
> 1) generate.max.count sets a limit on the number of URLs for a single host
> or domain - this is different from the overall limit set by the generate
> -top parameter.
>
> 2) the generator only skips the URLs which are beyond the max number allowed
> for the host (in your case 3K). This does not mean that ALL urls for that
> host are skipped
>
> Makes sense?

Hey Julien, thank you. Yes, your description makes sense for me. So if I 
want to fetch a list with only 3k urls, I just have to run:

./nutch parse $seg -topN 3000

right?

But I still don't get this message:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

What is meant by "more than 3000 URLs for all 1 segments"? Skipping 
means then, that "it will skip after 3k urls"?

But for now you helped to solve my problem. :)


> On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>  wrote:
>
>> Hello,
>>
>> there are two things I don't understand regarding the generator:
>>
>> 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
>> that this value is ignored. In every run about 20000 pages are fetched.
>>
>> TOTAL urls: 102396
>> retry 0:    101679
>> retry 1:    325
>> retry 2:    392
>> min score:  1.0
>> avg score:  1.0
>> max score:  1.0
>> status 1 (db_unfetched):    33072
>> status 2 (db_fetched):      57146
>> status 3 (db_gone): 6878
>> status 4 (db_redir_temp):   2510
>> status 5 (db_redir_perm):   2509
>> status 6 (db_notmodified):  281
>> CrawlDb statistics: done
>>
>> After a generate / fetch / parse / update cycle:
>>
>> TOTAL urls:     122885
>> retry 0:        121816
>> retry 1:        677
>> retry 2:        392
>> min score:      1.0
>> avg score:      1.0
>> max score:      1.0
>> status 1 (db_unfetched):        32153
>> status 2 (db_fetched):  75366
>> status 3 (db_gone):     9167
>> status 4 (db_redir_temp):       2979
>> status 5 (db_redir_perm):       2878
>> status 6 (db_notmodified):      342
>> CrawlDb statistics: done
>>
>> 2.) The next thing is related to the first one:
>>
>> The generator tells me in the log files:
>> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>>
>> But when the fetcher is running it fetches many urls which the generator
>> told me it had skipped before, like:
>>
>> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
>> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
>> 2011-08-16 13:56:31,706 INFO  fetcher.Fetcher - fetching
>> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>
>>
>> A second example:
>>
>> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
>> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
>> skipping
>>
>> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
>> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
>> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
>> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
>> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
>> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>
>>
>> Did I do something wrong? I don't get it :)
>>
>> Thank you all
>>
>>
>
>

Re: Some question about the generator

Posted by Julien Nioche <li...@gmail.com>.

1) generate.max.count sets a limit on the number of URLs for a single host
or domain - this is different from the overall limit set by the generate
-top parameter.

2) the generator only skips the URLs which are beyond the max number allowed
for the host (in your case 3K). This does not mean that ALL urls for that
host are skipped

Makes sense?

On 16 August 2011 14:16, Marek Bachmann <m....@uni-kassel.de> wrote:

> Hello,
>
> there are two things I don't understand regarding the generator:
>
> 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
> that this value is ignored. In every run about 20000 pages are fetched.
>
> TOTAL urls: 102396
> retry 0:    101679
> retry 1:    325
> retry 2:    392
> min score:  1.0
> avg score:  1.0
> max score:  1.0
> status 1 (db_unfetched):    33072
> status 2 (db_fetched):      57146
> status 3 (db_gone): 6878
> status 4 (db_redir_temp):   2510
> status 5 (db_redir_perm):   2509
> status 6 (db_notmodified):  281
> CrawlDb statistics: done
>
> After a generate / fetch / parse / update cycle:
>
> TOTAL urls:     122885
> retry 0:        121816
> retry 1:        677
> retry 2:        392
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 1 (db_unfetched):        32153
> status 2 (db_fetched):  75366
> status 3 (db_gone):     9167
> status 4 (db_redir_temp):       2979
> status 5 (db_redir_perm):       2878
> status 6 (db_notmodified):      342
> CrawlDb statistics: done
>
> 2.) The next thing is related to the first one:
>
> The generator tells me in the log files:
> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> But when the fetcher is running it fetches many urls which the generator
> told me it had skipped before, like:
>
> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
> 2011-08-16 13:56:31,706 INFO  fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>
>
> A second example:
>
> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
>
> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>
>
> Did I do something wrong? I don't get it :)
>
> Thank you all
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com