You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vangelis karv <ka...@hotmail.com> on 2014/02/14 12:39:10 UTC

Threads

Hello people!

Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
Is it possible some of them to collide or fetch the same page without them knowing? 
I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 

Vangelis

RE: Scoring plugin

Posted by Vangelis karv <ka...@hotmail.com>.

I think I got what you mean! Thanks a lot for your help! I will post again if i am in trouble! 

> Date: Wed, 19 Feb 2014 19:38:56 +0100
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Scoring plugin
> 
> > But then, if I wanted to update the score of an inlink based on an image url I would call
> updateScore().inlinkedStoreData.setScore(whatever)?
> 
> 1) never call updateScore() or any other method defined in the plugin interface by yourself!
> 
> > By doing that, I would change the score of a Score Datum, not the score of the actual Webpage. Am
> I missing something here?
> 
> Hasn't the score of a page containing 25 image links been already set to 251 (as expected) by
>   row.setScore(aleks+adjust);
> ?
> ScoreDatums are only used to transfer scores from a page to all pages it links to.
> You can change a ScoreDdistributeScoreToOutlinks()atum's score inside distributeScoreToOutlinks()
> and read the saved
> value in updateScore(). ScoreDatum a temporary "container" for scores inside the update job,
> nothing persistent.
> 
> On 02/19/2014 10:41 AM, Vangelis karv wrote:
> > 
> > 
> >> I understood right you want the score only be based on the number
> >> of image outlinks. Would an empty method updateScore() help?
> >> You don't want to have a score influenced by inlinks, right?
> > 
> > You are right! Maybe, an empty updateScore() would help. 
> > 
> > But then, if I wanted to update the score of an inlink based on an image url I would call updateScore().inlinkedStoreData.setScore(whatever)?
> > By doing that, I would change the score of a Score Datum, not the score of the actual Webpage. Am I missing something here? 
> > 
> >  		 	   		  
> > 
>

Re: Scoring plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

> But then, if I wanted to update the score of an inlink based on an image url I would call
updateScore().inlinkedStoreData.setScore(whatever)?

1) never call updateScore() or any other method defined in the plugin interface by yourself!

> By doing that, I would change the score of a Score Datum, not the score of the actual Webpage. Am
I missing something here?

Hasn't the score of a page containing 25 image links been already set to 251 (as expected) by
  row.setScore(aleks+adjust);
?
ScoreDatums are only used to transfer scores from a page to all pages it links to.
You can change a ScoreDdistributeScoreToOutlinks()atum's score inside distributeScoreToOutlinks()
and read the saved
value in updateScore(). ScoreDatum a temporary "container" for scores inside the update job,
nothing persistent.

On 02/19/2014 10:41 AM, Vangelis karv wrote:
> 
> 
>> I understood right you want the score only be based on the number
>> of image outlinks. Would an empty method updateScore() help?
>> You don't want to have a score influenced by inlinks, right?
> 
> You are right! Maybe, an empty updateScore() would help. 
> 
> But then, if I wanted to update the score of an inlink based on an image url I would call updateScore().inlinkedStoreData.setScore(whatever)?
> By doing that, I would change the score of a Score Datum, not the score of the actual Webpage. Am I missing something here? 
> 
>  		 	   		  
>

RE: Scoring plugin

Posted by Vangelis karv <ka...@hotmail.com>.


> I understood right you want the score only be based on the number
> of image outlinks. Would an empty method updateScore() help?
> You don't want to have a score influenced by inlinks, right?

You are right! Maybe, an empty updateScore() would help. 

But then, if I wanted to update the score of an inlink based on an image url I would call updateScore().inlinkedStoreData.setScore(whatever)?
By doing that, I would change the score of a Score Datum, not the score of the actual Webpage. Am I missing something here?

Re: Scoring plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

> No.Just my plugin is used.
In this case you only need to implement
distributeScoreToOutlinks() also updateScore()
so that it does what you want! :)

> ScoreDatum's score is the same as the row's score with the same url?
ScoreDatum is used to transfer a score from a page to all it outlinks.
In distributeScoreToOutlinks() the ScoreDatum's score is usually set
for all outlinks, and in updateScore() the scores of all inlinks
are used to update the score of the target page.

Both methods are called automatically in the update-phase of a
generate-fetch-parse-update cycle.

I understood right you want the score only be based on the number
of image outlinks. Would an empty method updateScore() help?
You don't want to have a score influenced by inlinks, right?


On 02/18/2014 12:08 PM, Vangelis karv wrote:
> Hi Sebastian!
> 
> 
>> are there any other scoring plugins used? Esp. scoring-opic which is on per default.
> 
> No.Just my plugin is used.
> 
>> The score is the result of sequentially calling the corresponding methods of all scoring filters.
>>
>> Second, the interface o.a.n.scoring.ScoringFilter defines more methods. Despite
>> distributeScoreToOutlinks() also updateScore() is run every cycle.
> 
> I have checked that interface. But that does not answer my previous questions.  :)
> 
>> So have a look what these methods do. In case, also check all other enabled scoring filters.
>> Outside scoring filters the score of a page is never changed.
> 
> ScoreDatum's score is the same as the row's score with the same url? Whenever I try to change the score of an Inlink through updateScore(), the score in MySQL is not the one I am expecting. 
>  		 	   		  
>

RE: Scoring plugin

Posted by Vangelis karv <ka...@hotmail.com>.

Hi Sebastian!


> are there any other scoring plugins used? Esp. scoring-opic which is on per default.

No.Just my plugin is used.

> The score is the result of sequentially calling the corresponding methods of all scoring filters.
>
> Second, the interface o.a.n.scoring.ScoringFilter defines more methods. Despite
> distributeScoreToOutlinks() also updateScore() is run every cycle.

I have checked that interface. But that does not answer my previous questions.  :)

> So have a look what these methods do. In case, also check all other enabled scoring filters.
> Outside scoring filters the score of a page is never changed.

ScoreDatum's score is the same as the row's score with the same url? Whenever I try to change the score of an Inlink through updateScore(), the score in MySQL is not the one I am expecting.

Re: Scoring plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Vangelis,

are there any other scoring plugins used? Esp. scoring-opic which is on per default.
The score is the result of sequentially calling the corresponding methods of all scoring filters.

Second, the interface o.a.n.scoring.ScoringFilter defines more methods. Despite
distributeScoreToOutlinks() also updateScore() is run every cycle.
So have a look what these methods do. In case, also check all other enabled scoring filters.
Outside scoring filters the score of a page is never changed.

Sebastian

On 02/17/2014 01:30 PM, Vangelis karv wrote:
> My exact problem is the following: I want to make a scoring function that whenever a URL contains an .jpg image, the URL's score is increased by 10. In method distributeScoreToOutlinks i added these: 
> 
> for(ScoreDatum free : scoreData){
>  try{
>  String aleos = free.getUrl();
> 
> if(aleos.contains(".jpg"))
>  {  
>  adjust+=10.0f;
>  }
>  
>  }catch(Exception e){}
>  
>  }
>  
>  float aleks = row.getScore();
>  
>  row.setScore(aleks+adjust);
>  
>  For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 1004 e.t.c. . 
>  I want that page's score to stay at 251 and not be refetched and reupdated. I think my problem is that Nutch at the beginning of the loop cycle, reupdates http://www.uefa.com/ which is my prime URL.
>  
>  Any ideas?
>  Thank you in advance!
>  
>  		 	   		  
>

Scoring plugin

Posted by Vangelis karv <ka...@hotmail.com>.

My exact problem is the following: I want to make a scoring function that whenever a URL contains an .jpg image, the URL's score is increased by 10. In method distributeScoreToOutlinks i added these: 

for(ScoreDatum free : scoreData){
 try{
 String aleos = free.getUrl();

if(aleos.contains(".jpg"))
 {  
 adjust+=10.0f;
 }
 
 }catch(Exception e){}
 
 }
 
 float aleks = row.getScore();
 
 row.setScore(aleks+adjust);
 
 For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 1004 e.t.c. . 
 I want that page's score to stay at 251 and not be refetched and reupdated. I think my problem is that Nutch at the beginning of the loop cycle, reupdates http://www.uefa.com/ which is my prime URL.
 
 Any ideas?
 Thank you in advance!

Re: Threads

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Vangelis,

please, open a new thread (in sense of mailing) for a new topic.

Thanks,
Sebastian

On 02/17/2014 12:12 PM, Vangelis karv wrote:
> My exact problem is the following: I want to make a scoring function that whenever a URL contains an .jpg image, the URL's score is increased by 10. In method distributeScoreToOutlinks i added these: 
> 
> for(ScoreDatum free : scoreData){
>           try{
>             String aleos = free.getUrl();
>                 
>           if(aleos.contains(".jpg"))
>           {  
>               adjust+=10.0f;
>           }
>                   
>       }catch(Exception e){}
>       
>       }
>       
>       float aleks = row.getScore();
> 
>       row.setScore(aleks+adjust);
> 
> For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 1004 e.t.c. . 
> I want that page's score to stay at 251 and not be refetched and reupdated. I think my problem is that Nutch at the beginning of the loop cycle, reupdates http://www.uefa.com/ which is my prime URL.
> 
> Any ideas?
> Thank you in advance!
> 
> From: karvounis_b@hotmail.com
> To: user@nutch.apache.org
> Subject: RE: Threads
> Date: Mon, 17 Feb 2014 11:28:43 +0200
> 
> 
> 
> 
> Thank you Sebastian for your trouble!
> 
> I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that it is only in 1.x.
> Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).
> 
>> Date: Sun, 16 Feb 2014 14:52:20 +0100
>> From: wastl.nagel@googlemail.com
>> To: user@nutch.apache.org
>> Subject: Re: Threads
>>
>> Hi Vangelis,
>>
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
>> Yes, if http.redirect.max == 0 (wich is the default).
>>
>>> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder.
>> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds FetchItemQueues which is
>> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long it never stored entirely
>> in memory.
>>
>>> Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) /
>>> (numberOfThreads) ?
>> After having fetched a url, a FetcherThread asks for a new URL. If it does not get one because all
>> queues are blocked for politeness, it sleeps a second and tries again. The exact number of urls
>> processed by a thread is random, but ideally the number should be approx. equal for each thread. Of
>> course, there should not be much more threads than queues (hosts, domains, ips), at least, if
>> fetcher.threads.per.queue == 1.
>>
>> Sebastian
>>
>>
>> On 02/14/2014 01:20 PM, Vangelis karv wrote:
>>> Thank you Marcus for your fast response! 
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
>>> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?
>>>
>>> Happy Valentine's Day!
>>>
>>>> Subject: RE: Threads
>>>> From: markus.jelsma@openindex.io
>>>> To: user@nutch.apache.org
>>>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>>>
>>>> Hi,
>>>>
>>>> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
>>>>
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Vangelis karv <ka...@hotmail.com>
>>>>> Sent: Friday 14th February 2014 12:39
>>>>> To: user@nutch.apache.org
>>>>> Subject: Threads
>>>>>
>>>>> Hello people!
>>>>>
>>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
>>>>> Is it possible some of them to collide or fetch the same page without them knowing? 
>>>>> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
>>>>>
>>>>> Vangelis
>>>>
>>>  		 	   		  
>>>
>>
>  		 	   		   		 	   		  
>

RE: Threads

Posted by Vangelis karv <ka...@hotmail.com>.

My exact problem is the following: I want to make a scoring function that whenever a URL contains an .jpg image, the URL's score is increased by 10. In method distributeScoreToOutlinks i added these: 

for(ScoreDatum free : scoreData){
          try{
            String aleos = free.getUrl();
                
          if(aleos.contains(".jpg"))
          {  
              adjust+=10.0f;
          }
                  
      }catch(Exception e){}
      
      }
      
      float aleks = row.getScore();

      row.setScore(aleks+adjust);

For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 1004 e.t.c. . 
I want that page's score to stay at 251 and not be refetched and reupdated. I think my problem is that Nutch at the beginning of the loop cycle, reupdates http://www.uefa.com/ which is my prime URL.

Any ideas?
Thank you in advance!

From: karvounis_b@hotmail.com
To: user@nutch.apache.org
Subject: RE: Threads
Date: Mon, 17 Feb 2014 11:28:43 +0200




Thank you Sebastian for your trouble!

I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that it is only in 1.x.
Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).

> Date: Sun, 16 Feb 2014 14:52:20 +0100
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Threads
> 
> Hi Vangelis,
> 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
> Yes, if http.redirect.max == 0 (wich is the default).
> 
> > 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder.
> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds FetchItemQueues which is
> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long it never stored entirely
> in memory.
> 
> > Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) /
> > (numberOfThreads) ?
> After having fetched a url, a FetcherThread asks for a new URL. If it does not get one because all
> queues are blocked for politeness, it sleeps a second and tries again. The exact number of urls
> processed by a thread is random, but ideally the number should be approx. equal for each thread. Of
> course, there should not be much more threads than queues (hosts, domains, ips), at least, if
> fetcher.threads.per.queue == 1.
> 
> Sebastian
> 
> 
> On 02/14/2014 01:20 PM, Vangelis karv wrote:
> > Thank you Marcus for your fast response! 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
> > 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?
> > 
> > Happy Valentine's Day!
> > 
> >> Subject: RE: Threads
> >> From: markus.jelsma@openindex.io
> >> To: user@nutch.apache.org
> >> Date: Fri, 14 Feb 2014 11:45:16 +0000
> >>
> >> Hi,
> >>
> >> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
> >>
> >> Markus
> >>
> >>  
> >>  
> >> -----Original message-----
> >>> From:Vangelis karv <ka...@hotmail.com>
> >>> Sent: Friday 14th February 2014 12:39
> >>> To: user@nutch.apache.org
> >>> Subject: Threads
> >>>
> >>> Hello people!
> >>>
> >>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
> >>> Is it possible some of them to collide or fetch the same page without them knowing? 
> >>> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
> >>>
> >>> Vangelis
> >>
> >  		 	   		  
> > 
>

Re: Threads

Posted by Sebastian Nagel <wa...@googlemail.com>.

> I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that
it is only in 1.x.
Yes.
> Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).
For 2.x redirects are only recorded, never followed immediately.
If a page has already been fetched it will not get re-fetched again (only after some "longer" time).

On 02/17/2014 10:28 AM, Vangelis karv wrote:
> Thank you Sebastian for your trouble!
> 
> I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that it is only in 1.x.
> Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).
> 
>> Date: Sun, 16 Feb 2014 14:52:20 +0100
>> From: wastl.nagel@googlemail.com
>> To: user@nutch.apache.org
>> Subject: Re: Threads
>>
>> Hi Vangelis,
>>
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
>> Yes, if http.redirect.max == 0 (wich is the default).
>>
>>> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder.
>> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds FetchItemQueues which is
>> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long it never stored entirely
>> in memory.
>>
>>> Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) /
>>> (numberOfThreads) ?
>> After having fetched a url, a FetcherThread asks for a new URL. If it does not get one because all
>> queues are blocked for politeness, it sleeps a second and tries again. The exact number of urls
>> processed by a thread is random, but ideally the number should be approx. equal for each thread. Of
>> course, there should not be much more threads than queues (hosts, domains, ips), at least, if
>> fetcher.threads.per.queue == 1.
>>
>> Sebastian
>>
>>
>> On 02/14/2014 01:20 PM, Vangelis karv wrote:
>>> Thank you Marcus for your fast response! 
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
>>> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?
>>>
>>> Happy Valentine's Day!
>>>
>>>> Subject: RE: Threads
>>>> From: markus.jelsma@openindex.io
>>>> To: user@nutch.apache.org
>>>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>>>
>>>> Hi,
>>>>
>>>> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
>>>>
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Vangelis karv <ka...@hotmail.com>
>>>>> Sent: Friday 14th February 2014 12:39
>>>>> To: user@nutch.apache.org
>>>>> Subject: Threads
>>>>>
>>>>> Hello people!
>>>>>
>>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
>>>>> Is it possible some of them to collide or fetch the same page without them knowing? 
>>>>> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
>>>>>
>>>>> Vangelis
>>>>
>>>  		 	   		  
>>>
>>
>  		 	   		  
>

RE: Threads

Posted by Vangelis karv <ka...@hotmail.com>.

Thank you Sebastian for your trouble!

I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that it is only in 1.x.
Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).

> Date: Sun, 16 Feb 2014 14:52:20 +0100
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Threads
> 
> Hi Vangelis,
> 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
> Yes, if http.redirect.max == 0 (wich is the default).
> 
> > 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder.
> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds FetchItemQueues which is
> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long it never stored entirely
> in memory.
> 
> > Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) /
> > (numberOfThreads) ?
> After having fetched a url, a FetcherThread asks for a new URL. If it does not get one because all
> queues are blocked for politeness, it sleeps a second and tries again. The exact number of urls
> processed by a thread is random, but ideally the number should be approx. equal for each thread. Of
> course, there should not be much more threads than queues (hosts, domains, ips), at least, if
> fetcher.threads.per.queue == 1.
> 
> Sebastian
> 
> 
> On 02/14/2014 01:20 PM, Vangelis karv wrote:
> > Thank you Marcus for your fast response! 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
> > 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?
> > 
> > Happy Valentine's Day!
> > 
> >> Subject: RE: Threads
> >> From: markus.jelsma@openindex.io
> >> To: user@nutch.apache.org
> >> Date: Fri, 14 Feb 2014 11:45:16 +0000
> >>
> >> Hi,
> >>
> >> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
> >>
> >> Markus
> >>
> >>  
> >>  
> >> -----Original message-----
> >>> From:Vangelis karv <ka...@hotmail.com>
> >>> Sent: Friday 14th February 2014 12:39
> >>> To: user@nutch.apache.org
> >>> Subject: Threads
> >>>
> >>> Hello people!
> >>>
> >>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
> >>> Is it possible some of them to collide or fetch the same page without them knowing? 
> >>> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
> >>>
> >>> Vangelis
> >>
> >  		 	   		  
> > 
>

Re: Threads

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Vangelis,

> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
Yes, if http.redirect.max == 0 (wich is the default).

> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder.
QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds FetchItemQueues which is
a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long it never stored entirely
in memory.

> Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) /
> (numberOfThreads) ?
After having fetched a url, a FetcherThread asks for a new URL. If it does not get one because all
queues are blocked for politeness, it sleeps a second and tries again. The exact number of urls
processed by a thread is random, but ideally the number should be approx. equal for each thread. Of
course, there should not be much more threads than queues (hosts, domains, ips), at least, if
fetcher.threads.per.queue == 1.

Sebastian


On 02/14/2014 01:20 PM, Vangelis karv wrote:
> Thank you Marcus for your fast response! 
> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
> 2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?
> 
> Happy Valentine's Day!
> 
>> Subject: RE: Threads
>> From: markus.jelsma@openindex.io
>> To: user@nutch.apache.org
>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>
>> Hi,
>>
>> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
>>
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:Vangelis karv <ka...@hotmail.com>
>>> Sent: Friday 14th February 2014 12:39
>>> To: user@nutch.apache.org
>>> Subject: Threads
>>>
>>> Hello people!
>>>
>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
>>> Is it possible some of them to collide or fetch the same page without them knowing? 
>>> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
>>>
>>> Vangelis
>>
>  		 	   		  
>

RE: Threads

Posted by Vangelis karv <ka...@hotmail.com>.

Thank you Marcus for your fast response! 
1) If www.somesite.com redirects to www.somesite.com, will it fetch it at the next cycle?
2) I understand that the whole set of urls to be fetched is saved at QueueFeeder. Each thread will be assigned a number of urls to fetch equal to: (wholeSetToBeFetched) / (numberOfThreads) ?

Happy Valentine's Day!

> Subject: RE: Threads
> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Date: Fri, 14 Feb 2014 11:45:16 +0000
> 
> Hi,
> 
> They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.
> 
> Markus
> 
>  
>  
> -----Original message-----
> > From:Vangelis karv <ka...@hotmail.com>
> > Sent: Friday 14th February 2014 12:39
> > To: user@nutch.apache.org
> > Subject: Threads
> > 
> > Hello people!
> > 
> > Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
> > Is it possible some of them to collide or fetch the same page without them knowing? 
> > I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
> > 
> > Vangelis
>

RE: Threads

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

They take records or (FetchItems) from the QueueFeeder. Queues are based on domain, host or ip and a URL exists only once, so nothing collides. The redirect will be followed in the next fetch cycle.

Markus

 
 
-----Original message-----
> From:Vangelis karv <ka...@hotmail.com>
> Sent: Friday 14th February 2014 12:39
> To: user@nutch.apache.org
> Subject: Threads
> 
> Hello people!
> 
> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, who tells them what pages each one of them will fetch?
> Is it possible some of them to collide or fetch the same page without them knowing? 
> I read the code and found that if the redirect is to the same page, it will not follow that redirect. Any advice would be very helpful! 
> 
> Vangelis