You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/06/29 19:19:43 UTC

Fetch queue's total size

Hi,

 

I'm wondering why this value never exceeds 500? While watching the fetch log, i cannot determine the number of remaining fetches because as long as there are more than 500 due, the threads just wiggle between 490 and 500.

 

Is there a way to configure this? I haven't found a setting with a value of 500 anywhere but it would be most convenient to know how much is left in the entire queue.

 

Thanks,

Re: Fetch queue's total size

Posted by Markus Jelsma <ma...@buyways.nl>.

On Tuesday 29 June 2010 20:50:01 Julien Nioche wrote:
> Markus,
> 
> 
> The depth of the queue is simply 50 * number of threads, so I gather that
> you are using 10 threads. There is a JIRA where we discussed making this
> value parametrable.
> 
> the threads just wiggle between 490 and 500.
> 
> 
> you probably mean 'the total size wiggles...'?
yes
> 
> The number of remaining URLs to fetch is not known from the Fetcher as it
> reads the fetchlist as it goes. The fact that you can see the real number
>  of remaining URLs when > 500 is simply due to the fact that all the input
>  URLs have been read and all the remaining ones are in the queue.
> 
> 
> The 50x value could be set in a parameter however this is not the issue
> here. The point it that the queue is what's stored in memory whereas the
> total number of URLS is the queue + what's left to be read from HDFS.
> 
> I'd suggest using the mapreduce webapp to monitor the progress and not
> simply looking at the logs (i.e. you need to run it in distributed mode).
> There are now details about how many URLs have been fetched successfully or
> not + of course the progress of the map operations which indicates how much
> has been read from HDFS. Since you know how many URLs you put in the
> fetchlist in the first place, it would be trivial to work out what's left.
thanks, i understand and try the monitor.
> 
> HTH
> 
> Julien
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Fetch queue's total size

Posted by Julien Nioche <li...@gmail.com>.

Markus,



> I'm wondering why this value never exceeds 500? While watching the fetch
> log, i cannot determine the number of remaining fetches because as long as
> there are more than 500 due,


The depth of the queue is simply 50 * number of threads, so I gather that
you are using 10 threads. There is a JIRA where we discussed making this
value parametrable.

the threads just wiggle between 490 and 500.
>

you probably mean 'the total size wiggles...'?

The number of remaining URLs to fetch is not known from the Fetcher as it
reads the fetchlist as it goes. The fact that you can see the real number of
remaining URLs when > 500 is simply due to the fact that all the input URLs
have been read and all the remaining ones are in the queue.


>
> Is there a way to configure this? I haven't found a setting with a value of
> 500 anywhere but it would be most convenient to know how much is left in the
> entire queue.
>


The 50x value could be set in a parameter however this is not the issue
here. The point it that the queue is what's stored in memory whereas the
total number of URLS is the queue + what's left to be read from HDFS.

I'd suggest using the mapreduce webapp to monitor the progress and not
simply looking at the logs (i.e. you need to run it in distributed mode).
There are now details about how many URLs have been fetched successfully or
not + of course the progress of the map operations which indicates how much
has been read from HDFS. Since you know how many URLs you put in the
fetchlist in the first place, it would be trivial to work out what's left.

HTH

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com