You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Preetam Pradeepkumar Shingavi <sh...@usc.edu> on 2015/02/08 19:18:52 UTC

Fetch queue size, Multiple seed URLs and Maximum Depth

Hi,

I have configured NUTCH with seed URL in local/url/seed.txt with just 1 URL
to test (depth=2) :

https://www.aoncadis.org/home.htm

DOUBTS :
1. Fetch queue size :

Watching at the LOGs first time, while NUTCH crawls, it shows (see
below *fetchQueues.totalSize=29
*which changes to something like 389 for same URL in the next run) :

2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, *fetchQueues.totalSize=29*, fetchQueues.getQueueCount=1

After the crawling is done and if I want to crawl the same seed url again,
ideally since the fetchqueue is now empty (I assume since first run is
done) should show the same fetchqueue.totalsize=29 as above but in the next
run it shows this fetch queue size as 398 and its really time consuming to
complete this queue.

How do I avoid this ?

2. Do I give multiple seed URLs in seed.txt, each on one line ?

3. What Max depth I can ask NUTCH to crawl.

Thanks,
Preetam

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Posted by Preetam Pradeepkumar Shingavi <sh...@usc.edu>.

Cool.

Thanks & Regards,
Preetam

On Sun, Feb 8, 2015 at 11:16 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Thanks Preetam:
>
>
> >[..snip..]
> >Why would you want to?
> >
> >
> >
> >
> >Preetam : Just was curious to manually handle this if possible.
> >I was anticipating that once the db has been fetched and CrawlDB has all
> >the URLs crawled data to depth 2, the next run should not crawl the same
> >URLs again.
> >Is it that URLs fetched at depth 2 which are kept unfetched in the queue
> >(not deque'd since  it has crawled the threshold depth passed) due to the
> >depth value constraint and are hence fetched in the next run resulting in
> >increase in fetch size ?
>
> +1. Yep.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thanks Preetam:


>[..snip..]
>Why would you want to?
>
>
>
>
>Preetam : Just was curious to manually handle this if possible.
>I was anticipating that once the db has been fetched and CrawlDB has all
>the URLs crawled data to depth 2, the next run should not crawl the same
>URLs again. 
>Is it that URLs fetched at depth 2 which are kept unfetched in the queue
>(not deque'd since  it has crawled the threshold depth passed) due to the
>depth value constraint and are hence fetched in the next run resulting in
>increase in fetch size ?

+1. Yep.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Posted by Preetam Pradeepkumar Shingavi <sh...@usc.edu>.

Comments inline .

Thanks,
Preetam

On Sun, Feb 8, 2015 at 10:56 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Preetam,
>
>
> -----Original Message-----
> From: Preetam Pradeepkumar Shingavi <sh...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 8, 2015 at 10:18 AM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>, Chris Mattmann
> <ma...@usc.edu>
> Subject: Fetch queue size, Multiple seed URLs and Maximum Depth
>
> >Hi,
> >
> >
> >I have configured NUTCH with seed URL in local/url/seed.txt with just 1
> >URL to test (depth=2) :
> >
> >
> >https://www.aoncadis.org/home.htm
> >
> >
> >
> >DOUBTS :
> >1. Fetch queue size :
> >
> >
> >Watching at the LOGs first time, while NUTCH crawls, it shows (see below
> >fetchQueues.totalSize=29
> >which changes to something like 389 for same URL in the next run) :
> >
> >
> >2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
> >spinWaiting=50,
> >fetchQueues.totalSize=29, fetchQueues.getQueueCount=1
> >
> >
> >
> >After the crawling is done and if I want to crawl the same seed url
> >again, ideally since the fetchqueue is now empty (I assume since first
> >run is done) should show the same fetchqueue.totalsize=29 as above but in
> >the next run it shows this fetch queue
> > size as 398 and its really time consuming to complete this queue.
>
> I’m not sure I understand your question. The fetch queue is never empty
> since
> it’s driven by the URL DB. So, if you have Urls that Nutch finishes its
> fetcher
> run (configured by numberOfRounds) that are still unfetched, it will be
> marked
> as such in the UrlDB and on the next iteration it will pick up where it
> left
> off on those URLs.
>


> >
> >
> >How do I avoid this ?
>
> Why would you want to?
>

*Preetam : Just was curious to manually handle this if possible.*
I was anticipating that once the db has been fetched and CrawlDB has all
the URLs crawled data to depth 2, the next run should not crawl the same
URLs again.
Is it that URLs fetched at depth 2 which are kept unfetched in the queue
(not deque'd since  it has crawled the threshold depth passed) due to the
depth value constraint and are hence fetched in the next run resulting in
increase in fetch size ?


> >
> >
> >2. Do I give multiple seed URLs in seed.txt, each on one line ?
>
> Yep.
> >
> >
> >3. What Max depth I can ask NUTCH to crawl.
>
> numberOfRounds controls this and you will have to experiment to determine
> the tradeoff here between depth and completeness.
>

*Preetam : Okay cool.*

>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Preetam,


-----Original Message-----
From: Preetam Pradeepkumar Shingavi <sh...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 8, 2015 at 10:18 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>, Chris Mattmann
<ma...@usc.edu>
Subject: Fetch queue size, Multiple seed URLs and Maximum Depth

>Hi,
>
>
>I have configured NUTCH with seed URL in local/url/seed.txt with just 1
>URL to test (depth=2) :
>
>
>https://www.aoncadis.org/home.htm
>
>
>
>DOUBTS :
>1. Fetch queue size :
>
>
>Watching at the LOGs first time, while NUTCH crawls, it shows (see below
>fetchQueues.totalSize=29
>which changes to something like 389 for same URL in the next run) :
>
>
>2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
>spinWaiting=50,
>fetchQueues.totalSize=29, fetchQueues.getQueueCount=1
>
>
>
>After the crawling is done and if I want to crawl the same seed url
>again, ideally since the fetchqueue is now empty (I assume since first
>run is done) should show the same fetchqueue.totalsize=29 as above but in
>the next run it shows this fetch queue
> size as 398 and its really time consuming to complete this queue.

I’m not sure I understand your question. The fetch queue is never empty
since
it’s driven by the URL DB. So, if you have Urls that Nutch finishes its
fetcher
run (configured by numberOfRounds) that are still unfetched, it will be
marked
as such in the UrlDB and on the next iteration it will pick up where it
left
off on those URLs.

>
>
>How do I avoid this ?

Why would you want to?

>
>
>2. Do I give multiple seed URLs in seed.txt, each on one line ?

Yep.

>
>
>3. What Max depth I can ask NUTCH to crawl.

numberOfRounds controls this and you will have to experiment to determine
the tradeoff here between depth and completeness.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++