You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/16 15:16:53 UTC
Some question about the generator
Hello,
there are two things I don't understand regarding the generator:
1.) If I set the generate.max.count value to a value, e.g. 3000, it
seems that this value is ignored. In every run about 20000 pages are
fetched.
TOTAL urls: 102396
retry 0: 101679
retry 1: 325
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 33072
status 2 (db_fetched): 57146
status 3 (db_gone): 6878
status 4 (db_redir_temp): 2510
status 5 (db_redir_perm): 2509
status 6 (db_notmodified): 281
CrawlDb statistics: done
After a generate / fetch / parse / update cycle:
TOTAL urls: 122885
retry 0: 121816
retry 1: 677
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 32153
status 2 (db_fetched): 75366
status 3 (db_gone): 9167
status 4 (db_redir_temp): 2979
status 5 (db_redir_perm): 2878
status 6 (db_notmodified): 342
CrawlDb statistics: done
2.) The next thing is related to the first one:
The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
But when the fetcher is running it fetches many urls which the generator
told me it had skipped before, like:
2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
http://cms.uni-kassel.de/unicms/index.php?id=27436
2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching
http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1
A second example:
2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf
2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf
Did I do something wrong? I don't get it :)
Thank you all
Re: Some question about the generator
Posted by Markus Jelsma <ma...@openindex.io>.
Selects number of fetchers. Always one in local mode and possibly multiple in
distributed mode.
> what is doing -numFetchers option in generator?
Re: Some question about the generator
Posted by Radim Kolar <hs...@sendmail.cz>.
what is doing -numFetchers option in generator?
Re: Some question about the generator
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 16.08.2011 16:27, Julien Nioche wrote:
>>> 1) generate.max.count sets a limit on the number of URLs for a single
>>>> host or domain - this is different from the overall limit set by the
>>>> generate -top parameter.
>>>>
>>>> 2) the generator only skips the URLs which are beyond the max number
>>>> allowed for the host (in your case 3K). This does not mean that ALL
>> urls
>>>> for that host are skipped
>>>>
>>>> Makes sense?
>>>
>>> Hey Julien, thank you. Yes, your description makes sense for me. So if I
>>> want to fetch a list with only 3k urls, I just have to run:
>>>
>>> ./nutch parse $seg -topN 3000
>>
>> No, topN applies to the generator.
>>
>
> good catch Markus - I'd read generate.
> Marek - this has nothing to do with the parsing
Yeah, right, I meant generate. My fault. :-)
>
>>
>>>
>>> right?
>>>
>>> But I still don't get this message:
>>> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
>>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>>>
>>> What is meant by "more than 3000 URLs for all 1 segments"? Skipping
>>> means then, that "it will skip after 3k urls"?
>>
>> generate.max.count=3000 then all urls above 3000 for a given host/domain
>> are
>> skipped when generating the segment.
>>
>>>
>>> But for now you helped to solve my problem. :)
>>>
>>>> On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>
>> wrote:
>>>>> Hello,
>>>>>
>>>>> there are two things I don't understand regarding the generator:
>>>>>
>>>>> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
>>>>> seems that this value is ignored. In every run about 20000 pages are
>>>>> fetched.
>>>>>
>>>>> TOTAL urls: 102396
>>>>> retry 0: 101679
>>>>> retry 1: 325
>>>>> retry 2: 392
>>>>> min score: 1.0
>>>>> avg score: 1.0
>>>>> max score: 1.0
>>>>> status 1 (db_unfetched): 33072
>>>>> status 2 (db_fetched): 57146
>>>>> status 3 (db_gone): 6878
>>>>> status 4 (db_redir_temp): 2510
>>>>> status 5 (db_redir_perm): 2509
>>>>> status 6 (db_notmodified): 281
>>>>> CrawlDb statistics: done
>>>>>
>>>>> After a generate / fetch / parse / update cycle:
>>>>>
>>>>> TOTAL urls: 122885
>>>>> retry 0: 121816
>>>>> retry 1: 677
>>>>> retry 2: 392
>>>>> min score: 1.0
>>>>> avg score: 1.0
>>>>> max score: 1.0
>>>>> status 1 (db_unfetched): 32153
>>>>> status 2 (db_fetched): 75366
>>>>> status 3 (db_gone): 9167
>>>>> status 4 (db_redir_temp): 2979
>>>>> status 5 (db_redir_perm): 2878
>>>>> status 6 (db_notmodified): 342
>>>>> CrawlDb statistics: done
>>>>>
>>>>> 2.) The next thing is related to the first one:
>>>>>
>>>>> The generator tells me in the log files:
>>>>> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
>>>>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
>> skipping
>>>>>
>>>>> But when the fetcher is running it fetches many urls which the
>> generator
>>>>> told me it had skipped before, like:
>>>>>
>>>>> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
>>>>> http://cms.uni-kassel.de/**unicms/index.php?id=27436<
>> http://cms.uni-kass
>>>>> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
>>>>> fetcher.Fetcher - fetching
>>>>> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
>> http://cms.uni-
>>>>> kassel.de/unicms/index.php?id=24287&L=1>
>>>>>
>>>>> A second example:
>>>>>
>>>>> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
>>>>> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
>>>>> skipping
>>>>>
>>>>> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
>>>>> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
>>>>> Visualizing_and_Optimizing-**Paper.pdf<
>> http://www.iset.uni-kassel.de/abt
>>>>> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
>>>>> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
>>>>> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
>>>>> 2010_Degner_Staffelstein.pdf<
>> http://www.iset.uni-kassel.de/abt/FB-A/publ
>>>>> ication/2010/2010_Degner_Staffelstein.pdf>
>>>>>
>>>>> Did I do something wrong? I don't get it :)
>>>>>
>>>>> Thank you all
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
Re: Some question about the generator
Posted by Julien Nioche <li...@gmail.com>.
> > 1) generate.max.count sets a limit on the number of URLs for a single
> > > host or domain - this is different from the overall limit set by the
> > > generate -top parameter.
> > >
> > > 2) the generator only skips the URLs which are beyond the max number
> > > allowed for the host (in your case 3K). This does not mean that ALL
> urls
> > > for that host are skipped
> > >
> > > Makes sense?
> >
> > Hey Julien, thank you. Yes, your description makes sense for me. So if I
> > want to fetch a list with only 3k urls, I just have to run:
> >
> > ./nutch parse $seg -topN 3000
>
> No, topN applies to the generator.
>
good catch Markus - I'd read generate.
Marek - this has nothing to do with the parsing
>
> >
> > right?
> >
> > But I still don't get this message:
> > 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >
> > What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> > means then, that "it will skip after 3k urls"?
>
> generate.max.count=3000 then all urls above 3000 for a given host/domain
> are
> skipped when generating the segment.
>
> >
> > But for now you helped to solve my problem. :)
> >
> > > On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de>
> wrote:
> > >> Hello,
> > >>
> > >> there are two things I don't understand regarding the generator:
> > >>
> > >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> > >> seems that this value is ignored. In every run about 20000 pages are
> > >> fetched.
> > >>
> > >> TOTAL urls: 102396
> > >> retry 0: 101679
> > >> retry 1: 325
> > >> retry 2: 392
> > >> min score: 1.0
> > >> avg score: 1.0
> > >> max score: 1.0
> > >> status 1 (db_unfetched): 33072
> > >> status 2 (db_fetched): 57146
> > >> status 3 (db_gone): 6878
> > >> status 4 (db_redir_temp): 2510
> > >> status 5 (db_redir_perm): 2509
> > >> status 6 (db_notmodified): 281
> > >> CrawlDb statistics: done
> > >>
> > >> After a generate / fetch / parse / update cycle:
> > >>
> > >> TOTAL urls: 122885
> > >> retry 0: 121816
> > >> retry 1: 677
> > >> retry 2: 392
> > >> min score: 1.0
> > >> avg score: 1.0
> > >> max score: 1.0
> > >> status 1 (db_unfetched): 32153
> > >> status 2 (db_fetched): 75366
> > >> status 3 (db_gone): 9167
> > >> status 4 (db_redir_temp): 2979
> > >> status 5 (db_redir_perm): 2878
> > >> status 6 (db_notmodified): 342
> > >> CrawlDb statistics: done
> > >>
> > >> 2.) The next thing is related to the first one:
> > >>
> > >> The generator tells me in the log files:
> > >> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> > >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
> > >>
> > >> But when the fetcher is running it fetches many urls which the
> generator
> > >> told me it had skipped before, like:
> > >>
> > >> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<
> http://cms.uni-kass
> > >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
> > >> fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
> http://cms.uni-
> > >> kassel.de/unicms/index.php?id=24287&L=1>
> > >>
> > >> A second example:
> > >>
> > >> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
> > >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> > >> skipping
> > >>
> > >> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> > >> Visualizing_and_Optimizing-**Paper.pdf<
> http://www.iset.uni-kassel.de/abt
> > >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> > >> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> > >> 2010_Degner_Staffelstein.pdf<
> http://www.iset.uni-kassel.de/abt/FB-A/publ
> > >> ication/2010/2010_Degner_Staffelstein.pdf>
> > >>
> > >> Did I do something wrong? I don't get it :)
> > >>
> > >> Thank you all
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Some question about the generator
Posted by Markus Jelsma <ma...@openindex.io>.
On Tuesday 16 August 2011 16:17:26 Marek Bachmann wrote:
> On 16.08.2011 15:53, Julien Nioche wrote:
> > 1) generate.max.count sets a limit on the number of URLs for a single
> > host or domain - this is different from the overall limit set by the
> > generate -top parameter.
> >
> > 2) the generator only skips the URLs which are beyond the max number
> > allowed for the host (in your case 3K). This does not mean that ALL urls
> > for that host are skipped
> >
> > Makes sense?
>
> Hey Julien, thank you. Yes, your description makes sense for me. So if I
> want to fetch a list with only 3k urls, I just have to run:
>
> ./nutch parse $seg -topN 3000
No, topN applies to the generator.
>
> right?
>
> But I still don't get this message:
> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> means then, that "it will skip after 3k urls"?
generate.max.count=3000 then all urls above 3000 for a given host/domain are
skipped when generating the segment.
>
> But for now you helped to solve my problem. :)
>
> > On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de> wrote:
> >> Hello,
> >>
> >> there are two things I don't understand regarding the generator:
> >>
> >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> >> seems that this value is ignored. In every run about 20000 pages are
> >> fetched.
> >>
> >> TOTAL urls: 102396
> >> retry 0: 101679
> >> retry 1: 325
> >> retry 2: 392
> >> min score: 1.0
> >> avg score: 1.0
> >> max score: 1.0
> >> status 1 (db_unfetched): 33072
> >> status 2 (db_fetched): 57146
> >> status 3 (db_gone): 6878
> >> status 4 (db_redir_temp): 2510
> >> status 5 (db_redir_perm): 2509
> >> status 6 (db_notmodified): 281
> >> CrawlDb statistics: done
> >>
> >> After a generate / fetch / parse / update cycle:
> >>
> >> TOTAL urls: 122885
> >> retry 0: 121816
> >> retry 1: 677
> >> retry 2: 392
> >> min score: 1.0
> >> avg score: 1.0
> >> max score: 1.0
> >> status 1 (db_unfetched): 32153
> >> status 2 (db_fetched): 75366
> >> status 3 (db_gone): 9167
> >> status 4 (db_redir_temp): 2979
> >> status 5 (db_redir_perm): 2878
> >> status 6 (db_notmodified): 342
> >> CrawlDb statistics: done
> >>
> >> 2.) The next thing is related to the first one:
> >>
> >> The generator tells me in the log files:
> >> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >>
> >> But when the fetcher is running it fetches many urls which the generator
> >> told me it had skipped before, like:
> >>
> >> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kass
> >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
> >> fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-
> >> kassel.de/unicms/index.php?id=24287&L=1>
> >>
> >> A second example:
> >>
> >> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
> >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> >> skipping
> >>
> >> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> >> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt
> >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> >> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> >> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publ
> >> ication/2010/2010_Degner_Staffelstein.pdf>
> >>
> >> Did I do something wrong? I don't get it :)
> >>
> >> Thank you all
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Some question about the generator
Posted by Julien Nioche <li...@gmail.com>.
> Yes, your description makes sense for me. So if I want to fetch a list with
> only 3k urls, I just have to run:
> ./nutch parse $seg -topN 3000
>
> right?
>
yes
>
> But I still don't get this message:
>
> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> What is meant by "more than 3000 URLs for all 1 segments"? Skipping means
> then, that "it will skip after 3k urls"?
>
yes
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Some question about the generator
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 16.08.2011 15:53, Julien Nioche wrote:
> 1) generate.max.count sets a limit on the number of URLs for a single host
> or domain - this is different from the overall limit set by the generate
> -top parameter.
>
> 2) the generator only skips the URLs which are beyond the max number allowed
> for the host (in your case 3K). This does not mean that ALL urls for that
> host are skipped
>
> Makes sense?
Hey Julien, thank you. Yes, your description makes sense for me. So if I
want to fetch a list with only 3k urls, I just have to run:
./nutch parse $seg -topN 3000
right?
But I still don't get this message:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
What is meant by "more than 3000 URLs for all 1 segments"? Skipping
means then, that "it will skip after 3k urls"?
But for now you helped to solve my problem. :)
> On 16 August 2011 14:16, Marek Bachmann<m....@uni-kassel.de> wrote:
>
>> Hello,
>>
>> there are two things I don't understand regarding the generator:
>>
>> 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
>> that this value is ignored. In every run about 20000 pages are fetched.
>>
>> TOTAL urls: 102396
>> retry 0: 101679
>> retry 1: 325
>> retry 2: 392
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>> status 1 (db_unfetched): 33072
>> status 2 (db_fetched): 57146
>> status 3 (db_gone): 6878
>> status 4 (db_redir_temp): 2510
>> status 5 (db_redir_perm): 2509
>> status 6 (db_notmodified): 281
>> CrawlDb statistics: done
>>
>> After a generate / fetch / parse / update cycle:
>>
>> TOTAL urls: 122885
>> retry 0: 121816
>> retry 1: 677
>> retry 2: 392
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>> status 1 (db_unfetched): 32153
>> status 2 (db_fetched): 75366
>> status 3 (db_gone): 9167
>> status 4 (db_redir_temp): 2979
>> status 5 (db_redir_perm): 2878
>> status 6 (db_notmodified): 342
>> CrawlDb statistics: done
>>
>> 2.) The next thing is related to the first one:
>>
>> The generator tells me in the log files:
>> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
>> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>>
>> But when the fetcher is running it fetches many urls which the generator
>> told me it had skipped before, like:
>>
>> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
>> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
>> 2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching
>> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>
>>
>> A second example:
>>
>> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
>> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
>> skipping
>>
>> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
>> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
>> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
>> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
>> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
>> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>
>>
>> Did I do something wrong? I don't get it :)
>>
>> Thank you all
>>
>>
>
>
Re: Some question about the generator
Posted by Julien Nioche <li...@gmail.com>.
1) generate.max.count sets a limit on the number of URLs for a single host
or domain - this is different from the overall limit set by the generate
-top parameter.
2) the generator only skips the URLs which are beyond the max number allowed
for the host (in your case 3K). This does not mean that ALL urls for that
host are skipped
Makes sense?
On 16 August 2011 14:16, Marek Bachmann <m....@uni-kassel.de> wrote:
> Hello,
>
> there are two things I don't understand regarding the generator:
>
> 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
> that this value is ignored. In every run about 20000 pages are fetched.
>
> TOTAL urls: 102396
> retry 0: 101679
> retry 1: 325
> retry 2: 392
> min score: 1.0
> avg score: 1.0
> max score: 1.0
> status 1 (db_unfetched): 33072
> status 2 (db_fetched): 57146
> status 3 (db_gone): 6878
> status 4 (db_redir_temp): 2510
> status 5 (db_redir_perm): 2509
> status 6 (db_notmodified): 281
> CrawlDb statistics: done
>
> After a generate / fetch / parse / update cycle:
>
> TOTAL urls: 122885
> retry 0: 121816
> retry 1: 677
> retry 2: 392
> min score: 1.0
> avg score: 1.0
> max score: 1.0
> status 1 (db_unfetched): 32153
> status 2 (db_fetched): 75366
> status 3 (db_gone): 9167
> status 4 (db_redir_temp): 2979
> status 5 (db_redir_perm): 2878
> status 6 (db_notmodified): 342
> CrawlDb statistics: done
>
> 2.) The next thing is related to the first one:
>
> The generator tells me in the log files:
> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> But when the fetcher is running it fetches many urls which the generator
> told me it had skipped before, like:
>
> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
> 2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>
>
> A second example:
>
> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
>
> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>
>
> Did I do something wrong? I don't get it :)
>
> Thank you all
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com