You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by prashant_nutch <pr...@in.v2solutions.com> on 2007/03/30 08:54:13 UTC

Help on Activation of Subcollection at Indexing & searching

IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?

in conf/subcollection , 
if we include our URL in whitelist ,then only we have search on that URLs?
command for searching on subcollection

Subcollection :< Name of subcollection> < word for specific URL>


<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
	<subcollection>
		<name>nutch</name>
		<id>nutch</id>
		<whitelist>
                                           http://lucene.apache.org/nutch/
                                           http://wiki.apache.org/nutch/
                                </whitelist>
		<blacklist />
	</subcollection>
</subcollections>

can anybody explain how overall thing should work ?
can it is useful for specific URL searching ?(we are using nutch 0.8.1)

-- 
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9748196
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Help on Activation of Subcollection at Indexing & searching

Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> Hi,
> Thanks for your early response.
> finally i got search result using subcollection,but still some issues,
> 1.can we should  search on more than 2 subcollection at same time?
>    like command 
>    subcollection:<subcollection name1> <term for search> .......
>  
>  can we extend this as  subcollection:<subcollection name1> <term for search
> <subcollection name2> <term for search2>
>  or how to achieve this?
>
>   
Actually you can but it requires a little work. Nutch parses the query 
by a predefined syntax using JavaCC generated classes, namely 
NutchAnalysis.java and NutchAnalysis.cc (Also see Query.parse()). 
Unfortunatelly the query syntax does not allow for parsing multiple 
terms for a field. And also the query syntax does not include boolean OR 
operation. So a query like

<query_term> <field1> : <term1>, <term2>

is not possible as well as a query like
<query_term> (<field1> :<term1> OR <field1>:<term2>)

So for your case, you can add this functionality to NutchAnalysis so 
share this with the community, so nutch has this wanted feature. 
Alternatively you can add the clauses to the Query object 
programmatically if you know the field a priori.

> 2.in subcollection if we want adding URLs after crawling,or removing from
> subcollection or 
>    merging two subcollection, each time we should do new crawl?
>
>   can we manage our subcollection according requirement and we don't want to
> recrawl again?(like subcollection A , B. Now we want add some URL from  A
> into B)
>
>  
>
>   
like above this is also not an issue of subcollection, but an issue of 
lucene herself. All the subcollection indexing extension does is to add 
a subcollection field to the document with possible values of 
subcollection names. Thus you can do all the operations on the index as 
you like. I suggest you learn more about lucene, by reading their wiki 
or one of the books. Also you can check out Solr, which manages the 
index more dynamically.




Re: Help on Activation of Subcollection at Indexing & searching

Posted by prashant_nutch <pr...@in.v2solutions.com>.
Hi,
Thanks for your early response.
finally i got search result using subcollection,but still some issues,
1.can we should  search on more than 2 subcollection at same time?
   like command 
   subcollection:<subcollection name1> <term for search> .......
 
 can we extend this as  subcollection:<subcollection name1> <term for search
<subcollection name2> <term for search2>
 or how to achieve this?

2.in subcollection if we want adding URLs after crawling,or removing from
subcollection or 
   merging two subcollection, each time we should do new crawl?

  can we manage our subcollection according requirement and we don't want to
recrawl again?(like subcollection A , B. Now we want add some URL from  A
into B)

 

prashant_nutch wrote:
> 
> IS Subcollection useful for specific URL Searching ?
> How we activate subcollection at indexing and searching time?
> 
> in conf/subcollection , 
> if we include our URL in whitelist ,then only we have search on that URLs?
> command for searching on subcollection
> 
> Subcollection :< Name of subcollection> < word for specific URL>
> 
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
> 	<subcollection>
> 		<name>nutch</name>
> 		<id>nutch</id>
> 		<whitelist>
>                                            http://lucene.apache.org/nutch/
>                                            http://wiki.apache.org/nutch/
>                                 </whitelist>
> 		<blacklist />
> 	</subcollection>
> </subcollections>
> 
> can anybody explain how overall thing should work ?
> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
> 
> 

-- 
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9786370
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Help on Activation of Subcollection at Indexing & searching

Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> Thanks for your valuable comment on subcollection,
> but still i have some issues, 
> 1.enabling subcollection in nutch-site.xml mean at time of crawling, can it
> is possible if it is on direcly on index (means at searching)
>   
nutch plugins can implement several extension points. Subcollection 
implements both the IndexingFilter extension point, so that 
subcollections are inserted in the index, and QueryFilter plugin , so 
that you can search in the subcollection field. This means that if you 
enable the subcollection plugin in nutch-site.xml, indexing and querying 
in subcollection field is enabled.
> 2.in your message can u explain comment like
>   subcollection also includes a query plugin
>   
by enabling the Subcollection plugin, you can search in the 
subcollection field. For example

 <term1> subcollection:<term2>
> i done steps mentioned by you,
> but when i execute command like 
>
> subcollection:<name of subcollection> <word for search>
> still i get result 0 hits......
>   
You should open your indexes in luke or lucli and check if the urls are 
indexed correctly.
> can u explain Subcollection more deeply because our aim is to searching on
> specific URL?
>   
Check the readme file in the src/plugin/subcollection directory.
> is any other way other than subcollection ?
>   
I assume that you do want to search on a set of urls(matching a regular 
expression) rathe than a single url. If not, then there is no point in 
using subcollection.
>
>
>
>   


Re: Help on Activation of Subcollection at Indexing & searching

Posted by prashant_nutch <pr...@in.v2solutions.com>.
Thanks for your valuable comment on subcollection,
but still i have some issues, 
1.enabling subcollection in nutch-site.xml mean at time of crawling, can it
is possible if it is on direcly on index (means at searching)
2.in your message can u explain comment like
  subcollection also includes a query plugin

i done steps mentioned by you,
but when i execute command like 

subcollection:<name of subcollection> <word for search>
still i get result 0 hits......
can u explain Subcollection more deeply because our aim is to searching on
specific URL?
is any other way other than subcollection ?






Enis Soztutar wrote:
> 
> prashant_nutch wrote:
>> IS Subcollection useful for specific URL Searching ?
>> How we activate subcollection at indexing and searching time?
>>
>> in conf/subcollection , 
>> if we include our URL in whitelist ,then only we have search on that
>> URLs?
>> command for searching on subcollection
>>
>> Subcollection :< Name of subcollection> < word for specific URL>
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <subcollections>
>> 	<subcollection>
>> 		<name>nutch</name>
>> 		<id>nutch</id>
>> 		<whitelist>
>>                                           
>> http://lucene.apache.org/nutch/
>>                                            http://wiki.apache.org/nutch/
>>                                 </whitelist>
>> 		<blacklist />
>> 	</subcollection>
>> </subcollections>
>>
>> can anybody explain how overall thing should work ?
>> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
>>
>>   
> Subcollection is a very useful way to group a set of urls and then 
> assign a label for them. You can use it to limit searching to certain
> urls.
> 
> You should first enable subcollection in the nutch-site.xml file.
> Then you should add collections to the conf/subcollection.xml file.
> After indexing, the documents with the matched urls should have the 
> subcollection field in the index.
> After that, since subcollection also includes a query plugin, you can do 
> searches like
> 
>       java subcollection:nutch
> 
> To limit the search to the nutch collection. You can consult the readme 
> file in the plugin's directory.
> 
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9752653
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Help on Activation of Subcollection at Indexing & searching

Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> IS Subcollection useful for specific URL Searching ?
> How we activate subcollection at indexing and searching time?
>
> in conf/subcollection , 
> if we include our URL in whitelist ,then only we have search on that URLs?
> command for searching on subcollection
>
> Subcollection :< Name of subcollection> < word for specific URL>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
> 	<subcollection>
> 		<name>nutch</name>
> 		<id>nutch</id>
> 		<whitelist>
>                                            http://lucene.apache.org/nutch/
>                                            http://wiki.apache.org/nutch/
>                                 </whitelist>
> 		<blacklist />
> 	</subcollection>
> </subcollections>
>
> can anybody explain how overall thing should work ?
> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
>
>   
Subcollection is a very useful way to group a set of urls and then 
assign a label for them. You can use it to limit searching to certain urls.

You should first enable subcollection in the nutch-site.xml file.
Then you should add collections to the conf/subcollection.xml file.
After indexing, the documents with the matched urls should have the 
subcollection field in the index.
After that, since subcollection also includes a query plugin, you can do 
searches like

      java subcollection:nutch

To limit the search to the nutch collection. You can consult the readme 
file in the plugin's directory.