You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Williams <Pa...@becta.org.uk> on 2005/09/14 10:38:46 UTC

Whole web search depth

Hi,

 

I'm fairly new to using Nutch and so this is probably a newbie question
(I've already looked in the mailing lists and can't see an answer).

 

I'm trying to do a web search (limited to around 10 sites at the moment)
but I'm unsure on how to set the depth of searching.  How is this done?

 

 

Cheers.


Re: Prune

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Yes, all sites, when the url contains with your filter will be removed.
You can try it with:
 bin/nutch prune /srv/segments/ -dryrun -queries qry.txt -output del.txt
With the "-dryrun" you can't delete anything. This is only show what 
will deleted without it.

Gal Nitzan wrotte:

> Hi Ferenc,
>
> Thank you for the information, regrettably I didn't figure it out yet.
>
> Do you mean to write a text file on which every line contains 
> +url:sample.com , and it shall remove that site from the index?
>
> Thanks,
>
> Gal
>
> yoursoft@freemail.hu wrote:
>
>> e.g.:
>> +url:sex-com
>>
>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>> org.apache.nutch.searcher.Query
>>
>> Regards,
>>    Ferenc
>>
>> Gal Nitzan wrotte:
>>
>>> Hi,
>>>
>>> Does anyone know how to use the prune option?
>>>
>>> what should be in the [-queries filename] file?
>>>
>>> Regards,
>>>
>>> Gal
>>>
>>>
>>
>>
>> .
>>
>
>
>


Re: Prune

Posted by Gal Nitzan <gn...@usa.net>.
Hi Egor,

Thank you very much! that was i was looking for :)

Gal

Egor Chernodarov wrote:
> Hello, Gal!
>
> As I understood all queries must be in the lucene syntax.
> Example from archive of maillist:
> ########Example of queries########
> #    delete docs from www.cnn.com
> url:"www cnn com"
>
> #    delete docs that contain "p0rn" in their content,
> #    but not "study" or "research", and which come from www.cnn.com
> content:p0rn -content:(study research) +url:"www cnn com"
>
> #    delete docs in Swahili language
> lang:sw
>
> Friday, September 16, 2005, 3:42:40 AM, âû ïèñàëè:
>
> Gal Nitzan> Hi Ferenc,
>
> Gal Nitzan> Thank you for the information, regrettably I
> Gal Nitzan> didn't figure it out yet.
>
> Gal Nitzan> Do you mean to write a text file on which every line contains
> Gal Nitzan> +url:sample.com , and it shall remove that site from the index?
>
> Gal Nitzan> Thanks,
>
> Gal Nitzan> Gal
>
> Gal Nitzan> yoursoft@freemail.hu wrote:
>   
>>> e.g.:
>>> +url:sex-com
>>>
>>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>>> org.apache.nutch.searcher.Query
>>>
>>> Regards,
>>>    Ferenc
>>>
>>> Gal Nitzan wrotte:
>>>
>>>       
>>>> Hi,
>>>>
>>>> Does anyone know how to use the prune option?
>>>>
>>>> what should be in the [-queries filename] file?
>>>>
>>>> Regards,
>>>>
>>>> Gal
>>>>
>>>>
>>>>         
>>> .
>>>
>>>       
>
>
>
>
>   


Re[2]: Prune

Posted by Egor Chernodarov <eg...@zarinsk.dem.ru>.
Hello, Gal!

As I understood all queries must be in the lucene syntax.
Example from archive of maillist:
########Example of queries########
#    delete docs from www.cnn.com
url:"www cnn com"

#    delete docs that contain "p0rn" in their content,
#    but not "study" or "research", and which come from www.cnn.com
content:p0rn -content:(study research) +url:"www cnn com"

#    delete docs in Swahili language
lang:sw

Friday, September 16, 2005, 3:42:40 AM, âû ïèñàëè:

Gal Nitzan> Hi Ferenc,

Gal Nitzan> Thank you for the information, regrettably I
Gal Nitzan> didn't figure it out yet.

Gal Nitzan> Do you mean to write a text file on which every line contains
Gal Nitzan> +url:sample.com , and it shall remove that site from the index?

Gal Nitzan> Thanks,

Gal Nitzan> Gal

Gal Nitzan> yoursoft@freemail.hu wrote:
>> e.g.:
>> +url:sex-com
>>
>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>> org.apache.nutch.searcher.Query
>>
>> Regards,
>>    Ferenc
>>
>> Gal Nitzan wrotte:
>>
>>> Hi,
>>>
>>> Does anyone know how to use the prune option?
>>>
>>> what should be in the [-queries filename] file?
>>>
>>> Regards,
>>>
>>> Gal
>>>
>>>
>>
>>
>> .
>>




-- 
Best regards,               
 Chernodarov Egor


Re: Prune

Posted by Gal Nitzan <gn...@usa.net>.
Hi Ferenc,

Thank you for the information, regrettably I didn't figure it out yet.

Do you mean to write a text file on which every line contains 
+url:sample.com , and it shall remove that site from the index?

Thanks,

Gal

yoursoft@freemail.hu wrote:
> e.g.:
> +url:sex-com
>
> You can try, how from sex.com to +url:sex-com with: bin/nutch 
> org.apache.nutch.searcher.Query
>
> Regards,
>    Ferenc
>
> Gal Nitzan wrotte:
>
>> Hi,
>>
>> Does anyone know how to use the prune option?
>>
>> what should be in the [-queries filename] file?
>>
>> Regards,
>>
>> Gal
>>
>>
>
>
> .
>


Re: Prune

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
e.g.:
+url:sex-com

You can try, how from sex.com to +url:sex-com with: bin/nutch 
org.apache.nutch.searcher.Query

Regards,
    Ferenc

Gal Nitzan wrotte:

> Hi,
>
> Does anyone know how to use the prune option?
>
> what should be in the [-queries filename] file?
>
> Regards,
>
> Gal
>
>


Prune

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

Does anyone know how to use the prune option?

what should be in the [-queries filename] file?

Regards,

Gal

Re: Whole web search depth

Posted by Michael Nebel <mi...@nebel.de>.
Hi Paul,

just call the "generate - fetch - updatedb" loop as often as you want.  :-)

Perhaps the parameter "depth" is the wrong name and causes the 
confusion. Depth does not mean, that the crawler follows one link to a 
depth of x and then takes the next link. Depth does mean the number of 
times, the loop "generate - fetch - updatedb" is done. Just take a look 
at output of the crawl.  The result of calling the loop is (should be) 
the same as if you follow one link to the depth of x!

Regards

	Michael

Paul Williams wrote:

> Hi,
> 
>  
> 
> I'm fairly new to using Nutch and so this is probably a newbie question
> (I've already looked in the mailing lists and can't see an answer).
> 
>  
> 
> I'm trying to do a web search (limited to around 10 sites at the moment)
> but I'm unsure on how to set the depth of searching.  How is this done?
> 
>  
> 
>  
> 
> Cheers.
> 
> 


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/