You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/06/24 04:08:32 UTC

True Value of fetchQueues.totalSize

Hi Folks,

It is very common for us to see logging such as the following

fetching
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414

What I've noticed for some time is that fetchQueues.totalSize never seems
to exceed 2500. Some primitive log analysis will verify this observation.
What this value represents (above) is that there are 414 items currently
queued up to be fetched. The fetching task will be more or less complete
once every member of the queue has been fetched.

The problem I see here is that when the fetchQueues.totalSize=2500, this is
actually not a true reflection of how many URLs are queued! This value
seems to be feeding from some other parent value > 2500.

I'm going to debug and investigate what is going on here however I thought
I would ask if anyone else knows why the value seems to hit the ceiling at
2500.

Thanks
Lewis

-- 
*Lewis*

Re: True Value of fetchQueues.totalSize

Posted by feng lu <am...@gmail.com>.
Hi Lewis

Because the input url queue size is controled by QueueFeeder class and it
control the input queue size for FetchItemQueues. We can find following
code on QueueFeeder run method.

int feed = size - queues.getTotalSize();  // check if the queue is full
        if (feed <= 0) {
          // queues are full - spin-wait until they have some free space
          try {
            Thread.sleep(1000);
          } catch (Exception e) {};
          continue;
        } else {
          LOG.debug("-feeding " + feed + " input urls ...");
           ....
          }
        }

And the QueueFeeder is initialized in the run method on Fetcher class.

    int threadCount = getConf().getInt("fetcher.threads.fetch", 10);
    if (LOG.isInfoEnabled()) { LOG.info("Fetcher: threads: " +
threadCount); }

     ...

    int queueDepthMuliplier =
 getConf().getInt("fetcher.queue.depth.multiplier", 50);

    feeder = new QueueFeeder(input, fetchQueues, threadCount *
queueDepthMuliplier);

So the default of queue size is 500, because the default value of
threadCount is 10 and queueDepthMuliplier is 50. But why the queue
limitation is 2500 in the output log. Because you use the bin/crawl script
to run the crawl job and the fetch task can set the fetcher thread count
through input command parameter. We can find the fetch command on bin/crawl
file.

  # fetching the segment
  echo "Fetching : $SEGMENT"
  "$bin/nutch" fetch $commonOptions -D
fetcher.timelimit.mins=$timeLimitFetch   "$CRAWL_PATH"/segments/$SEGMENT
-noParsing -threads $numThreads

Here we set the fetch thread using the value of numThreads variable and the
default value of this variable is 50

  # num threads for fetching
  numThreads=50

So here the queue limitation is 50 * 50 = 2500. If you want to increase the
fetch queue size you can increase the value of fetch thread count or
fetcher queue depth multiplier count. But if the fetch thread count is
default, the better way is to increase the value of
fetcher.queue.depth.multiplier parameter in configuration file.



On Wed, Jun 24, 2015 at 10:08 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
>
> It is very common for us to see logging such as the following
>
> fetching
> >
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
>
> What I've noticed for some time is that fetchQueues.totalSize never seems
> to exceed 2500. Some primitive log analysis will verify this observation.
> What this value represents (above) is that there are 414 items currently
> queued up to be fetched. The fetching task will be more or less complete
> once every member of the queue has been fetched.
>
> The problem I see here is that when the fetchQueues.totalSize=2500, this is
> actually not a true reflection of how many URLs are queued! This value
> seems to be feeding from some other parent value > 2500.
>
> I'm going to debug and investigate what is going on here however I thought
> I would ask if anyone else knows why the value seems to hit the ceiling at
> 2500.
>
> Thanks
> Lewis
>
> --
> *Lewis*
>



-- 
Don't Grow Old, Grow Up... :-)