You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gal Nitzan <gn...@usa.net> on 2005/09/15 00:01:53 UTC

Prune

Hi,

Does anyone know how to use the prune option?

what should be in the [-queries filename] file?

Regards,

Gal

Re: Prune

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Yes, all sites, when the url contains with your filter will be removed.
You can try it with:
 bin/nutch prune /srv/segments/ -dryrun -queries qry.txt -output del.txt
With the "-dryrun" you can't delete anything. This is only show what 
will deleted without it.

Gal Nitzan wrotte:

> Hi Ferenc,
>
> Thank you for the information, regrettably I didn't figure it out yet.
>
> Do you mean to write a text file on which every line contains 
> +url:sample.com , and it shall remove that site from the index?
>
> Thanks,
>
> Gal
>
> yoursoft@freemail.hu wrote:
>
>> e.g.:
>> +url:sex-com
>>
>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>> org.apache.nutch.searcher.Query
>>
>> Regards,
>>    Ferenc
>>
>> Gal Nitzan wrotte:
>>
>>> Hi,
>>>
>>> Does anyone know how to use the prune option?
>>>
>>> what should be in the [-queries filename] file?
>>>
>>> Regards,
>>>
>>> Gal
>>>
>>>
>>
>>
>> .
>>
>
>
>


Re: Prune

Posted by Gal Nitzan <gn...@usa.net>.
Hi Egor,

Thank you very much! that was i was looking for :)

Gal

Egor Chernodarov wrote:
> Hello, Gal!
>
> As I understood all queries must be in the lucene syntax.
> Example from archive of maillist:
> ########Example of queries########
> #    delete docs from www.cnn.com
> url:"www cnn com"
>
> #    delete docs that contain "p0rn" in their content,
> #    but not "study" or "research", and which come from www.cnn.com
> content:p0rn -content:(study research) +url:"www cnn com"
>
> #    delete docs in Swahili language
> lang:sw
>
> Friday, September 16, 2005, 3:42:40 AM, âû ïèñàëè:
>
> Gal Nitzan> Hi Ferenc,
>
> Gal Nitzan> Thank you for the information, regrettably I
> Gal Nitzan> didn't figure it out yet.
>
> Gal Nitzan> Do you mean to write a text file on which every line contains
> Gal Nitzan> +url:sample.com , and it shall remove that site from the index?
>
> Gal Nitzan> Thanks,
>
> Gal Nitzan> Gal
>
> Gal Nitzan> yoursoft@freemail.hu wrote:
>   
>>> e.g.:
>>> +url:sex-com
>>>
>>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>>> org.apache.nutch.searcher.Query
>>>
>>> Regards,
>>>    Ferenc
>>>
>>> Gal Nitzan wrotte:
>>>
>>>       
>>>> Hi,
>>>>
>>>> Does anyone know how to use the prune option?
>>>>
>>>> what should be in the [-queries filename] file?
>>>>
>>>> Regards,
>>>>
>>>> Gal
>>>>
>>>>
>>>>         
>>> .
>>>
>>>       
>
>
>
>
>   


Re[2]: Prune

Posted by Egor Chernodarov <eg...@zarinsk.dem.ru>.
Hello, Gal!

As I understood all queries must be in the lucene syntax.
Example from archive of maillist:
########Example of queries########
#    delete docs from www.cnn.com
url:"www cnn com"

#    delete docs that contain "p0rn" in their content,
#    but not "study" or "research", and which come from www.cnn.com
content:p0rn -content:(study research) +url:"www cnn com"

#    delete docs in Swahili language
lang:sw

Friday, September 16, 2005, 3:42:40 AM, âû ïèñàëè:

Gal Nitzan> Hi Ferenc,

Gal Nitzan> Thank you for the information, regrettably I
Gal Nitzan> didn't figure it out yet.

Gal Nitzan> Do you mean to write a text file on which every line contains
Gal Nitzan> +url:sample.com , and it shall remove that site from the index?

Gal Nitzan> Thanks,

Gal Nitzan> Gal

Gal Nitzan> yoursoft@freemail.hu wrote:
>> e.g.:
>> +url:sex-com
>>
>> You can try, how from sex.com to +url:sex-com with: bin/nutch 
>> org.apache.nutch.searcher.Query
>>
>> Regards,
>>    Ferenc
>>
>> Gal Nitzan wrotte:
>>
>>> Hi,
>>>
>>> Does anyone know how to use the prune option?
>>>
>>> what should be in the [-queries filename] file?
>>>
>>> Regards,
>>>
>>> Gal
>>>
>>>
>>
>>
>> .
>>




-- 
Best regards,               
 Chernodarov Egor


Re: Prune

Posted by Gal Nitzan <gn...@usa.net>.
Hi Ferenc,

Thank you for the information, regrettably I didn't figure it out yet.

Do you mean to write a text file on which every line contains 
+url:sample.com , and it shall remove that site from the index?

Thanks,

Gal

yoursoft@freemail.hu wrote:
> e.g.:
> +url:sex-com
>
> You can try, how from sex.com to +url:sex-com with: bin/nutch 
> org.apache.nutch.searcher.Query
>
> Regards,
>    Ferenc
>
> Gal Nitzan wrotte:
>
>> Hi,
>>
>> Does anyone know how to use the prune option?
>>
>> what should be in the [-queries filename] file?
>>
>> Regards,
>>
>> Gal
>>
>>
>
>
> .
>


Re: Prune

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
e.g.:
+url:sex-com

You can try, how from sex.com to +url:sex-com with: bin/nutch 
org.apache.nutch.searcher.Query

Regards,
    Ferenc

Gal Nitzan wrotte:

> Hi,
>
> Does anyone know how to use the prune option?
>
> what should be in the [-queries filename] file?
>
> Regards,
>
> Gal
>
>