You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kumar Krishnasami <ku...@vembu.com> on 2010/01/08 11:59:50 UTC

Crawl specific urls and depth argument

Hi,

I am a newbie to nutch. Just started looking at. I have a requirement to 
crawl and index only urls that are specified under the urls folder. I do 
not want nutch to crawl to any depth beyond the ones that are listed in 
the urls folder.

Can I accomplish this by setting the depth argument for 'crawl' to "0"?

If I set the depth to 0, I get a message that says "No URLs to fetch - 
check your seed list and URL filters.".

Any help will be greatly appreciated.

Thanks,
Kumar.

Re: Crawl specific urls and depth argument

Posted by MilleBii <mi...@gmail.com>.

I agree it is a miss-leading at first.

2010/1/9 Kumar Krishnasami <ku...@vembu.com>

> Thanks, MilleBii. That explains it. All the docs I came across mentioned
> something like "-depth /depth/ indicates the link depth from the root page
> that should be crawled" (from
> http://lucene.apache.org/nutch/tutorial8.html).
>
>
>
> MilleBii wrote:
>
>> Depth argument is only used for the crawl command and basically is the
>> number of run cycles craw/fetch/update/index
>>
>> 2010/1/8, Mischa Tuffield <mi...@garlik.com>:
>>
>>
>>> Hi Kumar,
>>>
>>> Am happy that that was of use to you. Sadly I have no feel for what the
>>> "depth" argument does, I don't tend to ever use it, I tend to use nutch's
>>> more specific commands: inject, generate, fetch, updatedb, merge, etc ...
>>>
>>> Perhaps someone else could shed light on the crawl command.
>>>
>>> Regards, and happy new years!
>>>
>>> Mischa
>>> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
>>>
>>>
>>>
>>>> Thanks, Mischa. That worked!!
>>>>
>>>> So, it looks like once this config property is set, crawl ignores the
>>>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never
>>>> crawl
>>>> any of the outlinks. Is that correct?
>>>>
>>>> Regards,
>>>> Kumar.
>>>>
>>>> Mischa Tuffield wrote:
>>>>
>>>>
>>>>> Hello Kumar,
>>>>> There is a config property you can set in conf/nutch-site.xml, as
>>>>> follows
>>>>> :
>>>>> <!--
>>>>> <property>
>>>>>  <name>db.max.outlinks.per.page</name>
>>>>>  <value>0</value>
>>>>>  <description>The maximum number of outlinks that we'll process for a
>>>>> page.
>>>>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>>>>> outlinks
>>>>>  will be processed for a page; otherwise, all outlinks will be
>>>>> processed.
>>>>>  </description>
>>>>> </property>
>>>>>             -->
>>>>> This will force nutch to only fetch items of depth "0", i.e. it wont
>>>>> attempt to follow any of the outlinks from pages you tell it to go and
>>>>> fetch.
>>>>>
>>>>> Regards,
>>>>> Mischa
>>>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a newbie to nutch. Just started looking at. I have a requirement
>>>>>> to
>>>>>> crawl and index only urls that are specified under the urls folder. I
>>>>>> do
>>>>>> not want nutch to crawl to any depth beyond the ones that are listed
>>>>>> in
>>>>>> the urls folder.
>>>>>>
>>>>>> Can I accomplish this by setting the depth argument for 'crawl' to
>>>>>> "0"?
>>>>>>
>>>>>> If I set the depth to 0, I get a message that says "No URLs to fetch -
>>>>>> check your seed list and URL filters.".
>>>>>>
>>>>>> Any help will be greatly appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> Kumar.
>>>>>>
>>>>>>
>>>>> ___________________________________
>>>>> Mischa Tuffield
>>>>> Email: mischa.tuffield@garlik.com <ma...@garlik.com>
>>>>> Homepage - http://mmt.me.uk/
>>>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>>>> +44(0)20 8973 2465  http://www.garlik.com/
>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
>>>>> 9AD
>>>>>
>>>>>
>>>>>
>>>> ___________________________________
>>> Mischa Tuffield
>>> Email: mischa.tuffield@garlik.com
>>> Homepage - http://mmt.me.uk/
>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>> +44(0)20 8973 2465  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
-MilleBii-

Re: Crawl specific urls and depth argument

Posted by Kumar Krishnasami <ku...@vembu.com>.

Thanks, MilleBii. That explains it. All the docs I came across mentioned 
something like "-depth /depth/ indicates the link depth from the root 
page that should be crawled" (from 
http://lucene.apache.org/nutch/tutorial8.html).


MilleBii wrote:
> Depth argument is only used for the crawl command and basically is the
> number of run cycles craw/fetch/update/index
>
> 2010/1/8, Mischa Tuffield <mi...@garlik.com>:
>   
>> Hi Kumar,
>>
>> Am happy that that was of use to you. Sadly I have no feel for what the
>> "depth" argument does, I don't tend to ever use it, I tend to use nutch's
>> more specific commands: inject, generate, fetch, updatedb, merge, etc ...
>>
>> Perhaps someone else could shed light on the crawl command.
>>
>> Regards, and happy new years!
>>
>> Mischa
>> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
>>
>>     
>>> Thanks, Mischa. That worked!!
>>>
>>> So, it looks like once this config property is set, crawl ignores the
>>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never crawl
>>> any of the outlinks. Is that correct?
>>>
>>> Regards,
>>> Kumar.
>>>
>>> Mischa Tuffield wrote:
>>>       
>>>> Hello Kumar,
>>>> There is a config property you can set in conf/nutch-site.xml, as follows
>>>> :
>>>> <!--
>>>> <property>
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>0</value>
>>>>  <description>The maximum number of outlinks that we'll process for a
>>>> page.
>>>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>>>> outlinks
>>>>  will be processed for a page; otherwise, all outlinks will be processed.
>>>>  </description>
>>>> </property>
>>>>              -->
>>>> This will force nutch to only fetch items of depth "0", i.e. it wont
>>>> attempt to follow any of the outlinks from pages you tell it to go and
>>>> fetch.
>>>>
>>>> Regards,
>>>> Mischa
>>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I am a newbie to nutch. Just started looking at. I have a requirement to
>>>>> crawl and index only urls that are specified under the urls folder. I do
>>>>> not want nutch to crawl to any depth beyond the ones that are listed in
>>>>> the urls folder.
>>>>>
>>>>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>>>>
>>>>> If I set the depth to 0, I get a message that says "No URLs to fetch -
>>>>> check your seed list and URL filters.".
>>>>>
>>>>> Any help will be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Kumar.
>>>>>           
>>>> ___________________________________
>>>> Mischa Tuffield
>>>> Email: mischa.tuffield@garlik.com <ma...@garlik.com>
>>>> Homepage - http://mmt.me.uk/
>>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>>> +44(0)20 8973 2465  http://www.garlik.com/
>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>
>>>>         
>> ___________________________________
>> Mischa Tuffield
>> Email: mischa.tuffield@garlik.com
>> Homepage - http://mmt.me.uk/
>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>> +44(0)20 8973 2465  http://www.garlik.com/
>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>
>>
>>     
>
>
>

Re: Crawl specific urls and depth argument

Posted by MilleBii <mi...@gmail.com>.

Depth argument is only used for the crawl command and basically is the
number of run cycles craw/fetch/update/index

2010/1/8, Mischa Tuffield <mi...@garlik.com>:
> Hi Kumar,
>
> Am happy that that was of use to you. Sadly I have no feel for what the
> "depth" argument does, I don't tend to ever use it, I tend to use nutch's
> more specific commands: inject, generate, fetch, updatedb, merge, etc ...
>
> Perhaps someone else could shed light on the crawl command.
>
> Regards, and happy new years!
>
> Mischa
> On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
>
>> Thanks, Mischa. That worked!!
>>
>> So, it looks like once this config property is set, crawl ignores the
>> 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never crawl
>> any of the outlinks. Is that correct?
>>
>> Regards,
>> Kumar.
>>
>> Mischa Tuffield wrote:
>>> Hello Kumar,
>>> There is a config property you can set in conf/nutch-site.xml, as follows
>>> :
>>> <!--
>>> <property>
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>0</value>
>>>  <description>The maximum number of outlinks that we'll process for a
>>> page.
>>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>>> outlinks
>>>  will be processed for a page; otherwise, all outlinks will be processed.
>>>  </description>
>>> </property>
>>>              -->
>>> This will force nutch to only fetch items of depth "0", i.e. it wont
>>> attempt to follow any of the outlinks from pages you tell it to go and
>>> fetch.
>>>
>>> Regards,
>>> Mischa
>>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a newbie to nutch. Just started looking at. I have a requirement to
>>>> crawl and index only urls that are specified under the urls folder. I do
>>>> not want nutch to crawl to any depth beyond the ones that are listed in
>>>> the urls folder.
>>>>
>>>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>>>
>>>> If I set the depth to 0, I get a message that says "No URLs to fetch -
>>>> check your seed list and URL filters.".
>>>>
>>>> Any help will be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Kumar.
>>>
>>> ___________________________________
>>> Mischa Tuffield
>>> Email: mischa.tuffield@garlik.com <ma...@garlik.com>
>>> Homepage - http://mmt.me.uk/
>>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>>> +44(0)20 8973 2465  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>
>>
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>


-- 
-MilleBii-

Re: Crawl specific urls and depth argument

Posted by Mischa Tuffield <mi...@garlik.com>.

Hi Kumar, 

Am happy that that was of use to you. Sadly I have no feel for what the "depth" argument does, I don't tend to ever use it, I tend to use nutch's more specific commands: inject, generate, fetch, updatedb, merge, etc ...

Perhaps someone else could shed light on the crawl command. 

Regards, and happy new years!

Mischa
On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:

> Thanks, Mischa. That worked!!
> 
> So, it looks like once this config property is set, crawl ignores the 'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never crawl any of the outlinks. Is that correct?
> 
> Regards,
> Kumar.
> 
> Mischa Tuffield wrote:
>> Hello Kumar, 
>> There is a config property you can set in conf/nutch-site.xml, as follows : 
>> <!-- 
>> <property>
>>  <name>db.max.outlinks.per.page</name>
>>  <value>0</value>
>>  <description>The maximum number of outlinks that we'll process for a page.
>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
>>  will be processed for a page; otherwise, all outlinks will be processed.
>>  </description>
>> </property>
>>              --> 
>> This will force nutch to only fetch items of depth "0", i.e. it wont attempt to follow any of the outlinks from pages you tell it to go and fetch.
>> 
>> Regards, 
>> Mischa
>> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>> 
>>> Hi,
>>> 
>>> I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder.
>>> 
>>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>> 
>>> If I set the depth to 0, I get a message that says "No URLs to fetch - check your seed list and URL filters.".
>>> 
>>> Any help will be greatly appreciated.
>>> 
>>> Thanks,
>>> Kumar.
>> 
>> ___________________________________
>> Mischa Tuffield
>> Email: mischa.tuffield@garlik.com <ma...@garlik.com>
>> Homepage - http://mmt.me.uk/
>> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
>> +44(0)20 8973 2465  http://www.garlik.com/
>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>> 
> 

___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Re: Crawl specific urls and depth argument

Posted by Kumar Krishnasami <ku...@vembu.com>.

Thanks, Mischa. That worked!!

So, it looks like once this config property is set, crawl ignores the 
'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never 
crawl any of the outlinks. Is that correct?

Regards,
Kumar.

Mischa Tuffield wrote:
> Hello Kumar, 
>
> There is a config property you can set in conf/nutch-site.xml, as 
> follows : 
>
> <!-- 
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>0</value>
>   <description>The maximum number of outlinks that we'll process for a 
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
>               
> --> 
>
> This will force nutch to only fetch items of depth "0", i.e. it wont 
> attempt to follow any of the outlinks from pages you tell it to go and 
> fetch.
>
> Regards, 
>
> Mischa
> On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:
>
>> Hi,
>>
>> I am a newbie to nutch. Just started looking at. I have a requirement 
>> to crawl and index only urls that are specified under the urls 
>> folder. I do not want nutch to crawl to any depth beyond the ones 
>> that are listed in the urls folder.
>>
>> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
>>
>> If I set the depth to 0, I get a message that says "No URLs to fetch 
>> - check your seed list and URL filters.".
>>
>> Any help will be greatly appreciated.
>>
>> Thanks,
>> Kumar.
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com <ma...@garlik.com>
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>

Re: Crawl specific urls and depth argument

Posted by Mischa Tuffield <mi...@garlik.com>.

Hello Kumar, 

There is a config property you can set in conf/nutch-site.xml, as follows : 

<!-- 

<property>
  <name>db.max.outlinks.per.page</name>
  <value>0</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

--> 

This will force nutch to only fetch items of depth "0", i.e. it wont attempt to follow any of the outlinks from pages you tell it to go and fetch.

Regards, 

Mischa
On 8 Jan 2010, at 10:59, Kumar Krishnasami wrote:

> Hi,
> 
> I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder.
> 
> Can I accomplish this by setting the depth argument for 'crawl' to "0"?
> 
> If I set the depth to 0, I get a message that says "No URLs to fetch - check your seed list and URL filters.".
> 
> Any help will be greatly appreciated.
> 
> Thanks,
> Kumar.

___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD