You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marco Ebbinghaus <eb...@gmail.com> on 2018/10/22 16:33:26 UTC

Nutch 1.15: crawling single web page resulting in crawldb-DB_UNFETCHED counter decreasing until 0

Hi all,

I am trying to crawl a single website and while it works for one 
website, it doesn't for another:

The one that doesn't work is https://www.saturn.de/ (the one that works 
with the same configuration is https://www.gamestop.de/)

What I do is using the following commands:

bin/nutch inject crawlDir/crawldb conf/seed.txt (with content 
https://www.saturn.de/)
bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
segment=`ls -d crawlDir/segments/* | tail -1`
bin/nutch fetch $segment -threads 2
bin/nutch parse $segment -threads 2
bin/nutch updatedb crawlDir/crawldb $segment

after that a bin/nutch readdb -stats says:

root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
CrawlDb statistics start: crawlDir/crawldb/
Statistics for CrawlDb: crawlDir/crawldb/
TOTAL urls:    93
shortest fetch interval:    15 days, 00:00:00
avg fetch interval:    15 days, 00:00:00
longest fetch interval:    15 days, 00:00:00
earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
avg of fetch times:    Mon Oct 22 19:51:00 UTC 2018
latest fetch time:    Tue Nov 06 15:59:00 UTC 2018
retry 0:    93
score quantile 0.01:    0.00917431153357029
score quantile 0.05:    0.00917431153357029
score quantile 0.1:    0.00917431153357029
score quantile 0.2:    0.00917431153357029
score quantile 0.25:    0.00917431153357029
score quantile 0.3:    0.00917431153357029
score quantile 0.4:    0.00917431153357029
score quantile 0.5:    0.00917431153357029
score quantile 0.6:    0.00917431153357029
score quantile 0.7:    0.00917431153357029
score quantile 0.75:    0.00917431153357029
score quantile 0.8:    0.00917431153357029
score quantile 0.9:    0.01834862306714058
score quantile 0.95:    0.01834862306714058
score quantile 0.99:    0.5922935494221679
min score:    0.00917431153357029
avg score:    0.021505396853211105
max score:    1.0183485746383667
status 1 (db_unfetched):    92
status 2 (db_fetched):    1

okay! So it crawled one page (the one from seed.txt and found 92 links, 
right?)

after that I do:

bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
segment=`ls -d crawlDir/segments/* | tail -1`
bin/nutch fetch $segment -threads 2
bin/nutch parse $segment -threads 2
bin/nutch updatedb crawlDir/crawldb $segment

The output seems all okay, I see that 10 pages are fetched and parsed 
etc. The output of the fetcher e.g. looks like:

root@d12560375098:~/nutch# bin/nutch fetch $segment -threads 2
Fetcher: starting at 2018-10-22 15:59:59
Fetcher: segment: crawlDir/segments/20181022155950
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 10 records hit by time limit : 0
FetcherThread 37 Using queue mode : byHost
FetcherThread 37 Using queue mode : byHost
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html 
(queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
robots.txt whitelist not configured.
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_computer-tablet-235588.html (queue 
crawl delay=5000ms)
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_pc-241041.html (queue crawl delay=5000ms)
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_smartphones-tarife-235592.html (queue 
crawl delay=5000ms)
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_tablets-252064.html (queue crawl 
delay=5000ms)
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_notebooks-241042.html (queue crawl 
delay=5000ms)
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_apple-322511.html (queue crawl 
delay=5000ms)
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_monitore-241044.html (queue crawl 
delay=5000ms)
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/category/_netzwerk-241046.html (queue crawl 
delay=5000ms)
-activeThreads=2, spinWaiting=1, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, 
fetchQueues.getQueueCount=1
FetcherThread 43 fetching 
https://www.saturn.de/de/shop/cardinformation.html (queue crawl 
delay=5000ms)
FetcherThread 44 has no more work available
FetcherThread 44 -finishing thread FetcherThread, activeThreads=1
FetcherThread 43 has no more work available
FetcherThread 43 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-10-22 16:00:50, elapsed: 00:00:51

parse says:

root@d12560375098:~/nutch# bin/nutch parse $segment -threads 2
ParseSegment: starting at 2018-10-22 16:01:01
ParseSegment: segment: crawlDir/segments/20181022155950
Parsed (89ms):https://www.saturn.de/de/category/_apple-322511.html
Parsed (16ms):https://www.saturn.de/de/category/_computer-tablet-235588.html
Parsed 
(14ms):https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html
Parsed (14ms):https://www.saturn.de/de/category/_monitore-241044.html
Parsed (8ms):https://www.saturn.de/de/category/_netzwerk-241046.html
Parsed (12ms):https://www.saturn.de/de/category/_notebooks-241042.html
Parsed (7ms):https://www.saturn.de/de/category/_pc-241041.html
Parsed 
(7ms):https://www.saturn.de/de/category/_smartphones-tarife-235592.html
Parsed (6ms):https://www.saturn.de/de/category/_tablets-252064.html
Parsed (8ms):https://www.saturn.de/de/shop/cardinformation.html
ParseSegment: finished at 2018-10-22 16:01:03, elapsed: 00:00:01

but when I do another readdb -stats it says:

Statistics for CrawlDb: crawlDir/crawldb/
TOTAL urls:    101
shortest fetch interval:    15 days, 00:00:00
avg fetch interval:    15 days, 00:00:00
longest fetch interval:    15 days, 00:00:00
earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
avg of fetch times:    Wed Oct 24 07:11:00 UTC 2018
latest fetch time:    Tue Nov 06 16:00:00 UTC 2018
retry 0:    101
score quantile 0.01:    1.7152949061710387E-4
score quantile 0.05:    1.73100212123245E-4
score quantile 0.1:    0.010499933175742628
score quantile 0.2:    0.010984613560140133
score quantile 0.25:    0.010984613560140133
score quantile 0.3:    0.010984613560140133
score quantile 0.4:    0.010984613560140133
score quantile 0.5:    0.010984613560140133
score quantile 0.6:    0.010984613560140133
score quantile 0.7:    0.010984613560140133
score quantile 0.75:    0.010984613560140133
score quantile 0.8:    0.010984613560140133
score quantile 0.9:    0.021969228982925415
score quantile 0.95:    0.021969228982925415
score quantile 0.99:    0.5157972617447326
min score:    1.6989465802907944E-4
avg score:    0.021709534201291528
max score:    1.0183485746383667
status 1 (db_unfetched):    90
status 2 (db_fetched):    11
CrawlDb statistics: done

As you can see, 10 more pages where fetched (before 1, now after -topN 
10 --> 11) as intended but db_unfetched decreased from 92 to 90.. I have 
no clue why this happens. How can that be? It seems like all links from 
the 10 fetched pages are ignored instead of putting it into the crawldb 
as unfetched.

After one more crawling cycle:

bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
segment=`ls -d crawlDir/segments/* | tail -1`
bin/nutch fetch $segment -threads 2
bin/nutch parse $segment -threads 2
bin/nutch updatedb crawlDir/crawldb $segment

readdb says:

root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
CrawlDb statistics start: crawlDir/crawldb/
Statistics for CrawlDb: crawlDir/crawldb/
TOTAL urls:    109
shortest fetch interval:    15 days, 00:00:00
avg fetch interval:    15 days, 03:18:09
longest fetch interval:    22 days, 12:00:00
earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
avg of fetch times:    Thu Oct 25 23:18:00 UTC 2018
latest fetch time:    Wed Nov 14 04:16:00 UTC 2018
retry 0:    109
score quantile 0.01:    0.0
score quantile 0.05:    1.694382077403134E-4
score quantile 0.1:    1.73100212123245E-4
score quantile 0.2:    0.012107410468161106
score quantile 0.25:    0.012107410468161106
score quantile 0.3:    0.012107410468161106
score quantile 0.4:    0.012107410468161106
score quantile 0.5:    0.012107410468161106
score quantile 0.6:    0.012107410468161106
score quantile 0.7:    0.012107410468161106
score quantile 0.75:    0.012107410468161106
score quantile 0.8:    0.012107410468161106
score quantile 0.9:    0.02421482279896736
score quantile 0.95:    0.02421482279896736
score quantile 0.99:    0.4389530326798525
min score:    0.0
avg score:    0.02122468467152447
max score:    1.0183485746383667
status 1 (db_unfetched):    86
status 2 (db_fetched):    19
status 3 (db_gone):    2
status 4 (db_redir_temp):    2
CrawlDb statistics: done

db_unfetched decreased again.. this will continue until db_unfetched 
reached 0 and my crawler does nothing anymore.

my nutch-site.xml looks like:

<property>
   <name>http.agent.name</name>
   <value>ENTER_NAME_HERE</value>
</property>

<property>
   <name>db.fetch.interval.default</name>
   <value>1296000</value>
</property>

<property>
   <name>db.fetch.schedule.adaptive.inc_rate</name>
   <value>0.4</value>
   <description>If a page is unmodified, its fetchInterval will be
   increased by this rate. This value should not
   exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>

<property>
   <name>db.fetch.schedule.adaptive.dec_rate</name>
   <value>0.2</value>
   <description>If a page is modified, its fetchInterval will be
   decreased by this rate. This value should not
   exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>

<property>
   <name>db.fetch.schedule.adaptive.min_interval</name>
   <value>86400.0</value>
   <description>Minimum fetchInterval, in seconds.</description>
</property>

<property>
   <name>db.fetch.schedule.adaptive.max_interval</name>
   <value>31536000.0</value>
   <description>Maximum fetchInterval, in seconds (365 days).
   NOTE: this is limited by db.fetch.interval.max. Pages with
   fetchInterval larger than db.fetch.interval.max
   will be fetched anyway.</description>
</property>

<property>
   <name>db.fetch.schedule.adaptive.sync_delta</name>
   <value>false</value>
   <description>If true, try to synchronize with the time of page change.
   by shifting the next fetchTime by a fraction (sync_rate) of the difference
   between the last modification time, and the last fetch time.</description>
</property>

<property>
   <name>db.fetch.schedule.class</name>
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
   <description>The implementation of fetch schedule. DefaultFetchSchedule simply
   adds the original fetchInterval to the last fetch time, regardless of
   page changes.</description>
</property>

<property>
   <name>db.signature.class</name>
   <value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>

<property>
    <name>db.max.outlinks.per.page</name>
    <value>-1</value>
    <description>The maximum number of outlinks that we'll process for a page.
       If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
       will be processed for a page; otherwise, all outlinks will be processed.
    </description>
</property>

<!-- link-navigation --> <property>
   <name>db.ignore.internal.links</name>
   <value>false</value>
</property>

<property>
   <name>db.ignore.external.links</name>
   <value>true</value>
</property>

<!-- sitemap properties --> <property>
    <name>sitemap.strict.parsing</name>
    <value>false</value>
    <description>
       If true (default) the Sitemap parser rejects URLs not sharing the same
       prefix with the sitemap: a sitemap `http://example.com/catalog/sitemap.xml'
       may only contain URLs starting with `http://example.com/catalog/'.
       All other URLs are skipped.  If false the parser will allow any URLs contained
       in the sitemap.
    </description>
</property>

<property>
    <name>sitemap.url.filter</name>
    <value>false</value>
    <description>
       Filter URLs from sitemaps.
    </description>
</property>

<property>
    <name>sitemap.url.normalize</name>
    <value>false</value>
    <description>
       Normalize URLs from sitemaps.
    </description>
</property>

<!-- etc --> <property>
    <name>http.redirect.max</name>
    <value>6</value>
    <description>The maximum number of redirects the fetcher will follow when
       trying to fetch a page. If set to negative or 0, fetcher won't immediately
       follow redirected URLs, instead it will record them for later fetching.
    </description>
</property>

<property>
   <name>fetcher.verbose</name>
   <value>true</value>
   <description>If true, fetcher will log more verbosely.</description>
</property>

and my regex-urlfilter.txt only contains:

-(?i)\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|svg|SVG|mp3|MP3|mp4|MP4|pdf|PDF|json|JSON)$
+.

First I had lots of filters in my regex-urlfilter.txt.. but because I 
thought that the fetched links are filtered out because of a wrong 
filter I commented out nearly everything..but now that I have nothing 
left in my regex-urlfilter.txt it still continues to decrease 
db_unfetched-counter. So this doesn't seem to be the problem.

I have no more idea why this happens. Does anyone have a hint for me? I 
would really really appreciate it! Thanks a lot in advance!

Greetings,

Marco


Re: Nutch 1.15: crawling single web page resulting in crawldb-DB_UNFETCHED counter decreasing until 0

Posted by Marco Ebbinghaus <eb...@gmail.com>.
Hi Sebastian,

No, I did not increase http.content.limit. And yes that was absolutely 
the cause of my problem. I increased it and now the db_unfetched counter 
increases as expected. Thank you so much for having a look! I created a 
ticket for a potential increase of the http.content.limit default value 
at https://issues.apache.org/jira/browse/NUTCH-2666. Thanks again! I 
will have a look if I can extend my scripts to make use of that sitemap, 
thanks for that hint, too! :)

Greetings,

Marco

On 22.10.18 21:12, Sebastian Nagel wrote:
> Hi Marco,
>
> did you increase
>    http.content.limit
> The default is 64 kB, saturn.de pages are much larger and it may happen that
> the first 64 kB contain always the same set of navigation links (linking to product categories here).
>
> Feel free to open an issue on
>       https://issues.apache.org/jira/projects/NUTCH
> to discuss whether to increase the default.  The default has been chosen long
> time ago, maybe it's time to increase it.
>
> Btw., saturn.de "announces" a sitemap in their robots.txt - maybe you want to use it:
>
>   % echo "https://www.saturn.de/sitemap/siteindex.xml" >sitemaps.txt
>   % bin/nutch sitemap  -Dhttp.content.limit=52428800 crawldb \
>       -sitemapUrls sitemaps.txt  -noStrict -noFilter -noNormalize
>   % bin/nutch readdb crawldb/ -stats
>   ...
>   status 1 (db_unfetched):        743307
>
> Also note the high value for http.content.limit (50 MB). It also should be increased, see
>    https://issues.apache.org/jira/browse/NUTCH-2511
>
> Best,
> Sebastian
>
>
> On 10/22/18 6:33 PM, Marco Ebbinghaus wrote:
>> Hi all,
>>
>> I am trying to crawl a single website and while it works for one website, it doesn't for another:
>>
>> The one that doesn't work is https://www.saturn.de/ (the one that works with the same configuration
>> is https://www.gamestop.de/)
>>
>> What I do is using the following commands:
>>
>> bin/nutch inject crawlDir/crawldb conf/seed.txt (with content https://www.saturn.de/)
>> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
>> segment=`ls -d crawlDir/segments/* | tail -1`
>> bin/nutch fetch $segment -threads 2
>> bin/nutch parse $segment -threads 2
>> bin/nutch updatedb crawlDir/crawldb $segment
>>
>> after that a bin/nutch readdb -stats says:
>>
>> root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
>> CrawlDb statistics start: crawlDir/crawldb/
>> Statistics for CrawlDb: crawlDir/crawldb/
>> TOTAL urls:    93
>> shortest fetch interval:    15 days, 00:00:00
>> avg fetch interval:    15 days, 00:00:00
>> longest fetch interval:    15 days, 00:00:00
>> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
>> avg of fetch times:    Mon Oct 22 19:51:00 UTC 2018
>> latest fetch time:    Tue Nov 06 15:59:00 UTC 2018
>> retry 0:    93
>> score quantile 0.01:    0.00917431153357029
>> score quantile 0.05:    0.00917431153357029
>> score quantile 0.1:    0.00917431153357029
>> score quantile 0.2:    0.00917431153357029
>> score quantile 0.25:    0.00917431153357029
>> score quantile 0.3:    0.00917431153357029
>> score quantile 0.4:    0.00917431153357029
>> score quantile 0.5:    0.00917431153357029
>> score quantile 0.6:    0.00917431153357029
>> score quantile 0.7:    0.00917431153357029
>> score quantile 0.75:    0.00917431153357029
>> score quantile 0.8:    0.00917431153357029
>> score quantile 0.9:    0.01834862306714058
>> score quantile 0.95:    0.01834862306714058
>> score quantile 0.99:    0.5922935494221679
>> min score:    0.00917431153357029
>> avg score:    0.021505396853211105
>> max score:    1.0183485746383667
>> status 1 (db_unfetched):    92
>> status 2 (db_fetched):    1
>>
>> okay! So it crawled one page (the one from seed.txt and found 92 links, right?)
>>
>> after that I do:
>>
>> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
>> segment=`ls -d crawlDir/segments/* | tail -1`
>> bin/nutch fetch $segment -threads 2
>> bin/nutch parse $segment -threads 2
>> bin/nutch updatedb crawlDir/crawldb $segment
>>
>> The output seems all okay, I see that 10 pages are fetched and parsed etc. The output of the fetcher
>> e.g. looks like:
>>
>> root@d12560375098:~/nutch# bin/nutch fetch $segment -threads 2
>> Fetcher: starting at 2018-10-22 15:59:59
>> Fetcher: segment: crawlDir/segments/20181022155950
>> Fetcher: threads: 2
>> Fetcher: time-out divisor: 2
>> QueueFeeder finished: total 10 records hit by time limit : 0
>> FetcherThread 37 Using queue mode : byHost
>> FetcherThread 37 Using queue mode : byHost
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html
>> (queue crawl delay=5000ms)
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold retries: 5
>> robots.txt whitelist not configured.
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_computer-tablet-235588.html (queue
>> crawl delay=5000ms)
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_pc-241041.html (queue crawl delay=5000ms)
>> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_smartphones-tarife-235592.html (queue
>> crawl delay=5000ms)
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_tablets-252064.html (queue crawl
>> delay=5000ms)
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_notebooks-241042.html (queue crawl
>> delay=5000ms)
>> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_apple-322511.html (queue crawl
>> delay=5000ms)
>> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_monitore-241044.html (queue crawl
>> delay=5000ms)
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/category/_netzwerk-241046.html (queue crawl
>> delay=5000ms)
>> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
>> FetcherThread 43 fetching https://www.saturn.de/de/shop/cardinformation.html (queue crawl delay=5000ms)
>> FetcherThread 44 has no more work available
>> FetcherThread 44 -finishing thread FetcherThread, activeThreads=1
>> FetcherThread 43 has no more work available
>> FetcherThread 43 -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
>> -activeThreads=0
>> Fetcher: finished at 2018-10-22 16:00:50, elapsed: 00:00:51
>>
>> parse says:
>>
>> root@d12560375098:~/nutch# bin/nutch parse $segment -threads 2
>> ParseSegment: starting at 2018-10-22 16:01:01
>> ParseSegment: segment: crawlDir/segments/20181022155950
>> Parsed (89ms):https://www.saturn.de/de/category/_apple-322511.html
>> Parsed (16ms):https://www.saturn.de/de/category/_computer-tablet-235588.html
>> Parsed (14ms):https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html
>> Parsed (14ms):https://www.saturn.de/de/category/_monitore-241044.html
>> Parsed (8ms):https://www.saturn.de/de/category/_netzwerk-241046.html
>> Parsed (12ms):https://www.saturn.de/de/category/_notebooks-241042.html
>> Parsed (7ms):https://www.saturn.de/de/category/_pc-241041.html
>> Parsed (7ms):https://www.saturn.de/de/category/_smartphones-tarife-235592.html
>> Parsed (6ms):https://www.saturn.de/de/category/_tablets-252064.html
>> Parsed (8ms):https://www.saturn.de/de/shop/cardinformation.html
>> ParseSegment: finished at 2018-10-22 16:01:03, elapsed: 00:00:01
>>
>> but when I do another readdb -stats it says:
>>
>> Statistics for CrawlDb: crawlDir/crawldb/
>> TOTAL urls:    101
>> shortest fetch interval:    15 days, 00:00:00
>> avg fetch interval:    15 days, 00:00:00
>> longest fetch interval:    15 days, 00:00:00
>> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
>> avg of fetch times:    Wed Oct 24 07:11:00 UTC 2018
>> latest fetch time:    Tue Nov 06 16:00:00 UTC 2018
>> retry 0:    101
>> score quantile 0.01:    1.7152949061710387E-4
>> score quantile 0.05:    1.73100212123245E-4
>> score quantile 0.1:    0.010499933175742628
>> score quantile 0.2:    0.010984613560140133
>> score quantile 0.25:    0.010984613560140133
>> score quantile 0.3:    0.010984613560140133
>> score quantile 0.4:    0.010984613560140133
>> score quantile 0.5:    0.010984613560140133
>> score quantile 0.6:    0.010984613560140133
>> score quantile 0.7:    0.010984613560140133
>> score quantile 0.75:    0.010984613560140133
>> score quantile 0.8:    0.010984613560140133
>> score quantile 0.9:    0.021969228982925415
>> score quantile 0.95:    0.021969228982925415
>> score quantile 0.99:    0.5157972617447326
>> min score:    1.6989465802907944E-4
>> avg score:    0.021709534201291528
>> max score:    1.0183485746383667
>> status 1 (db_unfetched):    90
>> status 2 (db_fetched):    11
>> CrawlDb statistics: done
>>
>> As you can see, 10 more pages where fetched (before 1, now after -topN 10 --> 11) as intended but
>> db_unfetched decreased from 92 to 90.. I have no clue why this happens. How can that be? It seems
>> like all links from the 10 fetched pages are ignored instead of putting it into the crawldb as
>> unfetched.
>>
>> After one more crawling cycle:
>>
>> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
>> segment=`ls -d crawlDir/segments/* | tail -1`
>> bin/nutch fetch $segment -threads 2
>> bin/nutch parse $segment -threads 2
>> bin/nutch updatedb crawlDir/crawldb $segment
>>
>> readdb says:
>>
>> root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
>> CrawlDb statistics start: crawlDir/crawldb/
>> Statistics for CrawlDb: crawlDir/crawldb/
>> TOTAL urls:    109
>> shortest fetch interval:    15 days, 00:00:00
>> avg fetch interval:    15 days, 03:18:09
>> longest fetch interval:    22 days, 12:00:00
>> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
>> avg of fetch times:    Thu Oct 25 23:18:00 UTC 2018
>> latest fetch time:    Wed Nov 14 04:16:00 UTC 2018
>> retry 0:    109
>> score quantile 0.01:    0.0
>> score quantile 0.05:    1.694382077403134E-4
>> score quantile 0.1:    1.73100212123245E-4
>> score quantile 0.2:    0.012107410468161106
>> score quantile 0.25:    0.012107410468161106
>> score quantile 0.3:    0.012107410468161106
>> score quantile 0.4:    0.012107410468161106
>> score quantile 0.5:    0.012107410468161106
>> score quantile 0.6:    0.012107410468161106
>> score quantile 0.7:    0.012107410468161106
>> score quantile 0.75:    0.012107410468161106
>> score quantile 0.8:    0.012107410468161106
>> score quantile 0.9:    0.02421482279896736
>> score quantile 0.95:    0.02421482279896736
>> score quantile 0.99:    0.4389530326798525
>> min score:    0.0
>> avg score:    0.02122468467152447
>> max score:    1.0183485746383667
>> status 1 (db_unfetched):    86
>> status 2 (db_fetched):    19
>> status 3 (db_gone):    2
>> status 4 (db_redir_temp):    2
>> CrawlDb statistics: done
>>
>> db_unfetched decreased again.. this will continue until db_unfetched reached 0 and my crawler does
>> nothing anymore.
>>
>> my nutch-site.xml looks like:
>>
>> <property>
>>    <name>http.agent.name</name>
>>    <value>ENTER_NAME_HERE</value>
>> </property>
>>
>> <property>
>>    <name>db.fetch.interval.default</name>
>>    <value>1296000</value>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.adaptive.inc_rate</name>
>>    <value>0.4</value>
>>    <description>If a page is unmodified, its fetchInterval will be
>>    increased by this rate. This value should not
>>    exceed 0.5, otherwise the algorithm becomes unstable.</description>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.adaptive.dec_rate</name>
>>    <value>0.2</value>
>>    <description>If a page is modified, its fetchInterval will be
>>    decreased by this rate. This value should not
>>    exceed 0.5, otherwise the algorithm becomes unstable.</description>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.adaptive.min_interval</name>
>>    <value>86400.0</value>
>>    <description>Minimum fetchInterval, in seconds.</description>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.adaptive.max_interval</name>
>>    <value>31536000.0</value>
>>    <description>Maximum fetchInterval, in seconds (365 days).
>>    NOTE: this is limited by db.fetch.interval.max. Pages with
>>    fetchInterval larger than db.fetch.interval.max
>>    will be fetched anyway.</description>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.adaptive.sync_delta</name>
>>    <value>false</value>
>>    <description>If true, try to synchronize with the time of page change.
>>    by shifting the next fetchTime by a fraction (sync_rate) of the difference
>>    between the last modification time, and the last fetch time.</description>
>> </property>
>>
>> <property>
>>    <name>db.fetch.schedule.class</name>
>>    <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>>    <description>The implementation of fetch schedule. DefaultFetchSchedule simply
>>    adds the original fetchInterval to the last fetch time, regardless of
>>    page changes.</description>
>> </property>
>>
>> <property>
>>    <name>db.signature.class</name>
>>    <value>org.apache.nutch.crawl.TextProfileSignature</value>
>> </property>
>>
>> <property>
>>     <name>db.max.outlinks.per.page</name>
>>     <value>-1</value>
>>     <description>The maximum number of outlinks that we'll process for a page.
>>        If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
>>        will be processed for a page; otherwise, all outlinks will be processed.
>>     </description>
>> </property>
>>
>> <!-- link-navigation --> <property>
>>    <name>db.ignore.internal.links</name>
>>    <value>false</value>
>> </property>
>>
>> <property>
>>    <name>db.ignore.external.links</name>
>>    <value>true</value>
>> </property>
>>
>> <!-- sitemap properties --> <property>
>>     <name>sitemap.strict.parsing</name>
>>     <value>false</value>
>>     <description>
>>        If true (default) the Sitemap parser rejects URLs not sharing the same
>>        prefix with the sitemap: a sitemap `http://example.com/catalog/sitemap.xml'
>>        may only contain URLs starting with `http://example.com/catalog/'.
>>        All other URLs are skipped.  If false the parser will allow any URLs contained
>>        in the sitemap.
>>     </description>
>> </property>
>>
>> <property>
>>     <name>sitemap.url.filter</name>
>>     <value>false</value>
>>     <description>
>>        Filter URLs from sitemaps.
>>     </description>
>> </property>
>>
>> <property>
>>     <name>sitemap.url.normalize</name>
>>     <value>false</value>
>>     <description>
>>        Normalize URLs from sitemaps.
>>     </description>
>> </property>
>>
>> <!-- etc --> <property>
>>     <name>http.redirect.max</name>
>>     <value>6</value>
>>     <description>The maximum number of redirects the fetcher will follow when
>>        trying to fetch a page. If set to negative or 0, fetcher won't immediately
>>        follow redirected URLs, instead it will record them for later fetching.
>>     </description>
>> </property>
>>
>> <property>
>>    <name>fetcher.verbose</name>
>>    <value>true</value>
>>    <description>If true, fetcher will log more verbosely.</description>
>> </property>
>>
>> and my regex-urlfilter.txt only contains:
>>
>> -(?i)\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|svg|SVG|mp3|MP3|mp4|MP4|pdf|PDF|json|JSON)$
>>
>> +.
>>
>> First I had lots of filters in my regex-urlfilter.txt.. but because I thought that the fetched links
>> are filtered out because of a wrong filter I commented out nearly everything..but now that I have
>> nothing left in my regex-urlfilter.txt it still continues to decrease db_unfetched-counter. So this
>> doesn't seem to be the problem.
>>
>> I have no more idea why this happens. Does anyone have a hint for me? I would really really
>> appreciate it! Thanks a lot in advance!
>>
>> Greetings,
>>
>> Marco
>>
>>

Re: Nutch 1.15: crawling single web page resulting in crawldb-DB_UNFETCHED counter decreasing until 0

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Marco,

did you increase
  http.content.limit
The default is 64 kB, saturn.de pages are much larger and it may happen that
the first 64 kB contain always the same set of navigation links (linking to product categories here).

Feel free to open an issue on
     https://issues.apache.org/jira/projects/NUTCH
to discuss whether to increase the default.  The default has been chosen long
time ago, maybe it's time to increase it.

Btw., saturn.de "announces" a sitemap in their robots.txt - maybe you want to use it:

 % echo "https://www.saturn.de/sitemap/siteindex.xml" >sitemaps.txt
 % bin/nutch sitemap  -Dhttp.content.limit=52428800 crawldb \
     -sitemapUrls sitemaps.txt  -noStrict -noFilter -noNormalize
 % bin/nutch readdb crawldb/ -stats
 ...
 status 1 (db_unfetched):        743307

Also note the high value for http.content.limit (50 MB). It also should be increased, see
  https://issues.apache.org/jira/browse/NUTCH-2511

Best,
Sebastian


On 10/22/18 6:33 PM, Marco Ebbinghaus wrote:
> Hi all,
> 
> I am trying to crawl a single website and while it works for one website, it doesn't for another:
> 
> The one that doesn't work is https://www.saturn.de/ (the one that works with the same configuration
> is https://www.gamestop.de/)
> 
> What I do is using the following commands:
> 
> bin/nutch inject crawlDir/crawldb conf/seed.txt (with content https://www.saturn.de/)
> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
> segment=`ls -d crawlDir/segments/* | tail -1`
> bin/nutch fetch $segment -threads 2
> bin/nutch parse $segment -threads 2
> bin/nutch updatedb crawlDir/crawldb $segment
> 
> after that a bin/nutch readdb -stats says:
> 
> root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
> CrawlDb statistics start: crawlDir/crawldb/
> Statistics for CrawlDb: crawlDir/crawldb/
> TOTAL urls:    93
> shortest fetch interval:    15 days, 00:00:00
> avg fetch interval:    15 days, 00:00:00
> longest fetch interval:    15 days, 00:00:00
> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
> avg of fetch times:    Mon Oct 22 19:51:00 UTC 2018
> latest fetch time:    Tue Nov 06 15:59:00 UTC 2018
> retry 0:    93
> score quantile 0.01:    0.00917431153357029
> score quantile 0.05:    0.00917431153357029
> score quantile 0.1:    0.00917431153357029
> score quantile 0.2:    0.00917431153357029
> score quantile 0.25:    0.00917431153357029
> score quantile 0.3:    0.00917431153357029
> score quantile 0.4:    0.00917431153357029
> score quantile 0.5:    0.00917431153357029
> score quantile 0.6:    0.00917431153357029
> score quantile 0.7:    0.00917431153357029
> score quantile 0.75:    0.00917431153357029
> score quantile 0.8:    0.00917431153357029
> score quantile 0.9:    0.01834862306714058
> score quantile 0.95:    0.01834862306714058
> score quantile 0.99:    0.5922935494221679
> min score:    0.00917431153357029
> avg score:    0.021505396853211105
> max score:    1.0183485746383667
> status 1 (db_unfetched):    92
> status 2 (db_fetched):    1
> 
> okay! So it crawled one page (the one from seed.txt and found 92 links, right?)
> 
> after that I do:
> 
> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
> segment=`ls -d crawlDir/segments/* | tail -1`
> bin/nutch fetch $segment -threads 2
> bin/nutch parse $segment -threads 2
> bin/nutch updatedb crawlDir/crawldb $segment
> 
> The output seems all okay, I see that 10 pages are fetched and parsed etc. The output of the fetcher
> e.g. looks like:
> 
> root@d12560375098:~/nutch# bin/nutch fetch $segment -threads 2
> Fetcher: starting at 2018-10-22 15:59:59
> Fetcher: segment: crawlDir/segments/20181022155950
> Fetcher: threads: 2
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 10 records hit by time limit : 0
> FetcherThread 37 Using queue mode : byHost
> FetcherThread 37 Using queue mode : byHost
> FetcherThread 43 fetching https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html
> (queue crawl delay=5000ms)
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> robots.txt whitelist not configured.
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=9, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_computer-tablet-235588.html (queue
> crawl delay=5000ms)
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=8, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_pc-241041.html (queue crawl delay=5000ms)
> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=7, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_smartphones-tarife-235592.html (queue
> crawl delay=5000ms)
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=6, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_tablets-252064.html (queue crawl
> delay=5000ms)
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=5, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_notebooks-241042.html (queue crawl
> delay=5000ms)
> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=4, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_apple-322511.html (queue crawl
> delay=5000ms)
> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=3, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_monitore-241044.html (queue crawl
> delay=5000ms)
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/category/_netzwerk-241046.html (queue crawl
> delay=5000ms)
> -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
> FetcherThread 43 fetching https://www.saturn.de/de/shop/cardinformation.html (queue crawl delay=5000ms)
> FetcherThread 44 has no more work available
> FetcherThread 44 -finishing thread FetcherThread, activeThreads=1
> FetcherThread 43 has no more work available
> FetcherThread 43 -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
> -activeThreads=0
> Fetcher: finished at 2018-10-22 16:00:50, elapsed: 00:00:51
> 
> parse says:
> 
> root@d12560375098:~/nutch# bin/nutch parse $segment -threads 2
> ParseSegment: starting at 2018-10-22 16:01:01
> ParseSegment: segment: crawlDir/segments/20181022155950
> Parsed (89ms):https://www.saturn.de/de/category/_apple-322511.html
> Parsed (16ms):https://www.saturn.de/de/category/_computer-tablet-235588.html
> Parsed (14ms):https://www.saturn.de/de/category/_festplatten-speichermedien-286920.html
> Parsed (14ms):https://www.saturn.de/de/category/_monitore-241044.html
> Parsed (8ms):https://www.saturn.de/de/category/_netzwerk-241046.html
> Parsed (12ms):https://www.saturn.de/de/category/_notebooks-241042.html
> Parsed (7ms):https://www.saturn.de/de/category/_pc-241041.html
> Parsed (7ms):https://www.saturn.de/de/category/_smartphones-tarife-235592.html
> Parsed (6ms):https://www.saturn.de/de/category/_tablets-252064.html
> Parsed (8ms):https://www.saturn.de/de/shop/cardinformation.html
> ParseSegment: finished at 2018-10-22 16:01:03, elapsed: 00:00:01
> 
> but when I do another readdb -stats it says:
> 
> Statistics for CrawlDb: crawlDir/crawldb/
> TOTAL urls:    101
> shortest fetch interval:    15 days, 00:00:00
> avg fetch interval:    15 days, 00:00:00
> longest fetch interval:    15 days, 00:00:00
> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
> avg of fetch times:    Wed Oct 24 07:11:00 UTC 2018
> latest fetch time:    Tue Nov 06 16:00:00 UTC 2018
> retry 0:    101
> score quantile 0.01:    1.7152949061710387E-4
> score quantile 0.05:    1.73100212123245E-4
> score quantile 0.1:    0.010499933175742628
> score quantile 0.2:    0.010984613560140133
> score quantile 0.25:    0.010984613560140133
> score quantile 0.3:    0.010984613560140133
> score quantile 0.4:    0.010984613560140133
> score quantile 0.5:    0.010984613560140133
> score quantile 0.6:    0.010984613560140133
> score quantile 0.7:    0.010984613560140133
> score quantile 0.75:    0.010984613560140133
> score quantile 0.8:    0.010984613560140133
> score quantile 0.9:    0.021969228982925415
> score quantile 0.95:    0.021969228982925415
> score quantile 0.99:    0.5157972617447326
> min score:    1.6989465802907944E-4
> avg score:    0.021709534201291528
> max score:    1.0183485746383667
> status 1 (db_unfetched):    90
> status 2 (db_fetched):    11
> CrawlDb statistics: done
> 
> As you can see, 10 more pages where fetched (before 1, now after -topN 10 --> 11) as intended but
> db_unfetched decreased from 92 to 90.. I have no clue why this happens. How can that be? It seems
> like all links from the 10 fetched pages are ignored instead of putting it into the crawldb as
> unfetched.
> 
> After one more crawling cycle:
> 
> bin/nutch generate crawlDir/crawldb crawlDir/segments -topN 10
> segment=`ls -d crawlDir/segments/* | tail -1`
> bin/nutch fetch $segment -threads 2
> bin/nutch parse $segment -threads 2
> bin/nutch updatedb crawlDir/crawldb $segment
> 
> readdb says:
> 
> root@d12560375098:~/nutch# bin/nutch readdb crawlDir/crawldb/ -stats
> CrawlDb statistics start: crawlDir/crawldb/
> Statistics for CrawlDb: crawlDir/crawldb/
> TOTAL urls:    109
> shortest fetch interval:    15 days, 00:00:00
> avg fetch interval:    15 days, 03:18:09
> longest fetch interval:    22 days, 12:00:00
> earliest fetch time:    Mon Oct 22 15:59:00 UTC 2018
> avg of fetch times:    Thu Oct 25 23:18:00 UTC 2018
> latest fetch time:    Wed Nov 14 04:16:00 UTC 2018
> retry 0:    109
> score quantile 0.01:    0.0
> score quantile 0.05:    1.694382077403134E-4
> score quantile 0.1:    1.73100212123245E-4
> score quantile 0.2:    0.012107410468161106
> score quantile 0.25:    0.012107410468161106
> score quantile 0.3:    0.012107410468161106
> score quantile 0.4:    0.012107410468161106
> score quantile 0.5:    0.012107410468161106
> score quantile 0.6:    0.012107410468161106
> score quantile 0.7:    0.012107410468161106
> score quantile 0.75:    0.012107410468161106
> score quantile 0.8:    0.012107410468161106
> score quantile 0.9:    0.02421482279896736
> score quantile 0.95:    0.02421482279896736
> score quantile 0.99:    0.4389530326798525
> min score:    0.0
> avg score:    0.02122468467152447
> max score:    1.0183485746383667
> status 1 (db_unfetched):    86
> status 2 (db_fetched):    19
> status 3 (db_gone):    2
> status 4 (db_redir_temp):    2
> CrawlDb statistics: done
> 
> db_unfetched decreased again.. this will continue until db_unfetched reached 0 and my crawler does
> nothing anymore.
> 
> my nutch-site.xml looks like:
> 
> <property>
>   <name>http.agent.name</name>
>   <value>ENTER_NAME_HERE</value>
> </property>
> 
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1296000</value>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.adaptive.inc_rate</name>
>   <value>0.4</value>
>   <description>If a page is unmodified, its fetchInterval will be
>   increased by this rate. This value should not
>   exceed 0.5, otherwise the algorithm becomes unstable.</description>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.adaptive.dec_rate</name>
>   <value>0.2</value>
>   <description>If a page is modified, its fetchInterval will be
>   decreased by this rate. This value should not
>   exceed 0.5, otherwise the algorithm becomes unstable.</description>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.adaptive.min_interval</name>
>   <value>86400.0</value>
>   <description>Minimum fetchInterval, in seconds.</description>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.adaptive.max_interval</name>
>   <value>31536000.0</value>
>   <description>Maximum fetchInterval, in seconds (365 days).
>   NOTE: this is limited by db.fetch.interval.max. Pages with
>   fetchInterval larger than db.fetch.interval.max
>   will be fetched anyway.</description>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.adaptive.sync_delta</name>
>   <value>false</value>
>   <description>If true, try to synchronize with the time of page change.
>   by shifting the next fetchTime by a fraction (sync_rate) of the difference
>   between the last modification time, and the last fetch time.</description>
> </property>
> 
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
> </property>
> 
> <property>
>   <name>db.signature.class</name>
>   <value>org.apache.nutch.crawl.TextProfileSignature</value>
> </property>
> 
> <property>
>    <name>db.max.outlinks.per.page</name>
>    <value>-1</value>
>    <description>The maximum number of outlinks that we'll process for a page.
>       If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
>       will be processed for a page; otherwise, all outlinks will be processed.
>    </description>
> </property>
> 
> <!-- link-navigation --> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
> </property>
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
> </property>
> 
> <!-- sitemap properties --> <property>
>    <name>sitemap.strict.parsing</name>
>    <value>false</value>
>    <description>
>       If true (default) the Sitemap parser rejects URLs not sharing the same
>       prefix with the sitemap: a sitemap `http://example.com/catalog/sitemap.xml'
>       may only contain URLs starting with `http://example.com/catalog/'.
>       All other URLs are skipped.  If false the parser will allow any URLs contained
>       in the sitemap.
>    </description>
> </property>
> 
> <property>
>    <name>sitemap.url.filter</name>
>    <value>false</value>
>    <description>
>       Filter URLs from sitemaps.
>    </description>
> </property>
> 
> <property>
>    <name>sitemap.url.normalize</name>
>    <value>false</value>
>    <description>
>       Normalize URLs from sitemaps.
>    </description>
> </property>
> 
> <!-- etc --> <property>
>    <name>http.redirect.max</name>
>    <value>6</value>
>    <description>The maximum number of redirects the fetcher will follow when
>       trying to fetch a page. If set to negative or 0, fetcher won't immediately
>       follow redirected URLs, instead it will record them for later fetching.
>    </description>
> </property>
> 
> <property>
>   <name>fetcher.verbose</name>
>   <value>true</value>
>   <description>If true, fetcher will log more verbosely.</description>
> </property>
> 
> and my regex-urlfilter.txt only contains:
> 
> -(?i)\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|svg|SVG|mp3|MP3|mp4|MP4|pdf|PDF|json|JSON)$
> 
> +.
> 
> First I had lots of filters in my regex-urlfilter.txt.. but because I thought that the fetched links
> are filtered out because of a wrong filter I commented out nearly everything..but now that I have
> nothing left in my regex-urlfilter.txt it still continues to decrease db_unfetched-counter. So this
> doesn't seem to be the problem.
> 
> I have no more idea why this happens. Does anyone have a hint for me? I would really really
> appreciate it! Thanks a lot in advance!
> 
> Greetings,
> 
> Marco
> 
>