You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by nutchcase <ch...@yahoo.com> on 2009/10/20 22:06:48 UTC

crawl always stops at depth=3

My crawl always stops at depth=3. It gets documents but does not continue any
further.
Here is my nutch-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>1000</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
lizer-(pass|regex|basic)</value>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>1000</value>
</property>
</configuration>


-- 
View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p25981603.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl always stops at depth=3

Posted by nutchcase <ch...@yahoo.com>.

All the urls that are qeued are crawled, the problem is that it doesnt look
further than depth 3 for urls so anything below that depth doesnt end up in
the segments. If I disable url filtering completely by removing it from
nutch-site.xml, it gets too many urls so I guess it is a problem with my
filter definition. I just can't seem to get the filter right.

reinhard schwab wrote:
> 
> and you miss some urls to be crawled? which?
> 
> with
> 
> bin/nutch readdb crawl/crawldb -dump <some directory>
> 
> you can dump the content of the crawl db into readable format.
> you will see there the next fetch times of the urls and the status.
> 
> with
> 
> bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir>
> 
> you can dump a segment into readable format
> and see which links have been extracted.
> 
> nutchcase schrieb:
>> Right, I have commented that part of the filter out and it gets urls with
>> queries, but only to a depth of 3. Here is my url filter:
>> -^(https|telnet|file|ftp|mailto):
>>
>> # skip some suffixes
>> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\
>> co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> #-[?*!@=]
>>
>> # allow urls in foofactory.fi domain
>> +^http://([a-z0-9\-A-Z]*\.)*.foo.com/
>>
>> # deny anything else
>> #-.
>>  
>>
>> reinhard schwab wrote:
>>   
>>> the crawler has stopped fetching because all urls are already fetched.
>>> there are no unfetched urls left.
>>> do you expect to have more urls fetched?
>>>
>>> either you need more seed urls or you change your urf filters.
>>> the default nutch url filter configuration excludes the deep web, every
>>> url with a query part (?).
>>>
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p26012590.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl always stops at depth=3

Posted by reinhard schwab <re...@aon.at>.

and you miss some urls to be crawled? which?

with

bin/nutch readdb crawl/crawldb -dump <some directory>

you can dump the content of the crawl db into readable format.
you will see there the next fetch times of the urls and the status.

with

bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir>

you can dump a segment into readable format
and see which links have been extracted.

nutchcase schrieb:
> Right, I have commented that part of the filter out and it gets urls with
> queries, but only to a depth of 3. Here is my url filter:
> -^(https|telnet|file|ftp|mailto):
>
> # skip some suffixes
> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\
> co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # allow urls in foofactory.fi domain
> +^http://([a-z0-9\-A-Z]*\.)*.foo.com/
>
> # deny anything else
> #-.
>  
>
> reinhard schwab wrote:
>   
>> the crawler has stopped fetching because all urls are already fetched.
>> there are no unfetched urls left.
>> do you expect to have more urls fetched?
>>
>> either you need more seed urls or you change your urf filters.
>> the default nutch url filter configuration excludes the deep web, every
>> url with a query part (?).
>>
>>
>>     
>
>

Re: crawl always stops at depth=3

Posted by nutchcase <ch...@yahoo.com>.

Right, I have commented that part of the filter out and it gets urls with
queries, but only to a depth of 3. Here is my url filter:
-^(https|telnet|file|ftp|mailto):

# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\
co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# allow urls in foofactory.fi domain
+^http://([a-z0-9\-A-Z]*\.)*.foo.com/

# deny anything else
#-.
 

reinhard schwab wrote:
> 
> the crawler has stopped fetching because all urls are already fetched.
> there are no unfetched urls left.
> do you expect to have more urls fetched?
> 
> either you need more seed urls or you change your urf filters.
> the default nutch url filter configuration excludes the deep web, every
> url with a query part (?).
> 
> 

-- 
View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p26009869.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl always stops at depth=3

Posted by reinhard schwab <re...@aon.at>.

the crawler has stopped fetching because all urls are already fetched.
there are no unfetched urls left.
do you expect to have more urls fetched?

either you need more seed urls or you change your urf filters.
the default nutch url filter configuration excludes the deep web, every
url with a query part (?).


nutchcase schrieb:
> Here is the output from that:
> TOTAL urls:	297
> retry 0:	297
> min score:	0.0
> avg score:	0.023377104
> max score:	2.009
> status 2 (db_fetched):	295
> status 5 (db_redir_perm):	2
>
>
> reinhard schwab wrote:
>   
>> try
>>
>> bin/nutch readdb crawl/crawldb -stats
>>
>> are there any unfetched pages?
>>
>> nutchcase schrieb:
>>     
>>> My crawl always stops at depth=3. It gets documents but does not continue
>>> any
>>> further.
>>> Here is my nutch-site.xml
>>> <?xml version="1.0"?>
>>> <configuration>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>nutch-solr-integration</value>
>>> </property>
>>> <property>
>>> <name>generate.max.per.host</name>
>>> <value>1000</value>
>>> </property>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
>>> lizer-(pass|regex|basic)</value>
>>> </property>
>>> <property>
>>> <name>db.max.outlinks.per.page</name>
>>> <value>1000</value>
>>> </property>
>>> </configuration>
>>>
>>>
>>>   
>>>       
>>
>>     
>
>

Re: crawl always stops at depth=3

Posted by nutchcase <ch...@yahoo.com>.

Here is the output from that:
TOTAL urls:	297
retry 0:	297
min score:	0.0
avg score:	0.023377104
max score:	2.009
status 2 (db_fetched):	295
status 5 (db_redir_perm):	


reinhard schwab wrote:
> 
> try
> 
> bin/nutch readdb crawl/crawldb -stats
> 
> are there any unfetched pages?
> 
> nutchcase schrieb:
>> My crawl always stops at depth=3. It gets documents but does not continue
>> any
>> further.
>> Here is my nutch-site.xml
>> <?xml version="1.0"?>
>> <configuration>
>> <property>
>> <name>http.agent.name</name>
>> <value>nutch-solr-integration</value>
>> </property>
>> <property>
>> <name>generate.max.per.host</name>
>> <value>1000</value>
>> </property>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
>> lizer-(pass|regex|basic)</value>
>> </property>
>> <property>
>> <name>db.max.outlinks.per.page</name>
>> <value>1000</value>
>> </property>
>> </configuration>
>>
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p25998652.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl always stops at depth=3

Posted by reinhard schwab <re...@aon.at>.

try

bin/nutch readdb crawl/crawldb -stats

are there any unfetched pages?

nutchcase schrieb:
> My crawl always stops at depth=3. It gets documents but does not continue any
> further.
> Here is my nutch-site.xml
> <?xml version="1.0"?>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>nutch-solr-integration</value>
> </property>
> <property>
> <name>generate.max.per.host</name>
> <value>1000</value>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
> lizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>db.max.outlinks.per.page</name>
> <value>1000</value>
> </property>
> </configuration>
>
>
>