You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by prashant_nutch <pr...@in.v2solutions.com> on 2007/03/30 08:54:13 UTC
Help on Activation of Subcollection at Indexing & searching
IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?
in conf/subcollection ,
if we include our URL in whitelist ,then only we have search on that URLs?
command for searching on subcollection
Subcollection :< Name of subcollection> < word for specific URL>
<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
<subcollection>
<name>nutch</name>
<id>nutch</id>
<whitelist>
http://lucene.apache.org/nutch/
http://wiki.apache.org/nutch/
</whitelist>
<blacklist />
</subcollection>
</subcollections>
can anybody explain how overall thing should work ?
can it is useful for specific URL searching ?(we are using nutch 0.8.1)
--
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9748196
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Help on Activation of Subcollection at Indexing & searching
Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> Hi,
> Thanks for your early response.
> finally i got search result using subcollection,but still some issues,
> 1.can we should search on more than 2 subcollection at same time?
> like command
> subcollection:<subcollection name1> <term for search> .......
>
> can we extend this as subcollection:<subcollection name1> <term for search
> <subcollection name2> <term for search2>
> or how to achieve this?
>
>
Actually you can but it requires a little work. Nutch parses the query
by a predefined syntax using JavaCC generated classes, namely
NutchAnalysis.java and NutchAnalysis.cc (Also see Query.parse()).
Unfortunatelly the query syntax does not allow for parsing multiple
terms for a field. And also the query syntax does not include boolean OR
operation. So a query like
<query_term> <field1> : <term1>, <term2>
is not possible as well as a query like
<query_term> (<field1> :<term1> OR <field1>:<term2>)
So for your case, you can add this functionality to NutchAnalysis so
share this with the community, so nutch has this wanted feature.
Alternatively you can add the clauses to the Query object
programmatically if you know the field a priori.
> 2.in subcollection if we want adding URLs after crawling,or removing from
> subcollection or
> merging two subcollection, each time we should do new crawl?
>
> can we manage our subcollection according requirement and we don't want to
> recrawl again?(like subcollection A , B. Now we want add some URL from A
> into B)
>
>
>
>
like above this is also not an issue of subcollection, but an issue of
lucene herself. All the subcollection indexing extension does is to add
a subcollection field to the document with possible values of
subcollection names. Thus you can do all the operations on the index as
you like. I suggest you learn more about lucene, by reading their wiki
or one of the books. Also you can check out Solr, which manages the
index more dynamically.
Re: Help on Activation of Subcollection at Indexing & searching
Posted by prashant_nutch <pr...@in.v2solutions.com>.
Hi,
Thanks for your early response.
finally i got search result using subcollection,but still some issues,
1.can we should search on more than 2 subcollection at same time?
like command
subcollection:<subcollection name1> <term for search> .......
can we extend this as subcollection:<subcollection name1> <term for search
<subcollection name2> <term for search2>
or how to achieve this?
2.in subcollection if we want adding URLs after crawling,or removing from
subcollection or
merging two subcollection, each time we should do new crawl?
can we manage our subcollection according requirement and we don't want to
recrawl again?(like subcollection A , B. Now we want add some URL from A
into B)
prashant_nutch wrote:
>
> IS Subcollection useful for specific URL Searching ?
> How we activate subcollection at indexing and searching time?
>
> in conf/subcollection ,
> if we include our URL in whitelist ,then only we have search on that URLs?
> command for searching on subcollection
>
> Subcollection :< Name of subcollection> < word for specific URL>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
> <subcollection>
> <name>nutch</name>
> <id>nutch</id>
> <whitelist>
> http://lucene.apache.org/nutch/
> http://wiki.apache.org/nutch/
> </whitelist>
> <blacklist />
> </subcollection>
> </subcollections>
>
> can anybody explain how overall thing should work ?
> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
>
>
--
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9786370
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Help on Activation of Subcollection at Indexing & searching
Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> Thanks for your valuable comment on subcollection,
> but still i have some issues,
> 1.enabling subcollection in nutch-site.xml mean at time of crawling, can it
> is possible if it is on direcly on index (means at searching)
>
nutch plugins can implement several extension points. Subcollection
implements both the IndexingFilter extension point, so that
subcollections are inserted in the index, and QueryFilter plugin , so
that you can search in the subcollection field. This means that if you
enable the subcollection plugin in nutch-site.xml, indexing and querying
in subcollection field is enabled.
> 2.in your message can u explain comment like
> subcollection also includes a query plugin
>
by enabling the Subcollection plugin, you can search in the
subcollection field. For example
<term1> subcollection:<term2>
> i done steps mentioned by you,
> but when i execute command like
>
> subcollection:<name of subcollection> <word for search>
> still i get result 0 hits......
>
You should open your indexes in luke or lucli and check if the urls are
indexed correctly.
> can u explain Subcollection more deeply because our aim is to searching on
> specific URL?
>
Check the readme file in the src/plugin/subcollection directory.
> is any other way other than subcollection ?
>
I assume that you do want to search on a set of urls(matching a regular
expression) rathe than a single url. If not, then there is no point in
using subcollection.
>
>
>
>
Re: Help on Activation of Subcollection at Indexing & searching
Posted by prashant_nutch <pr...@in.v2solutions.com>.
Thanks for your valuable comment on subcollection,
but still i have some issues,
1.enabling subcollection in nutch-site.xml mean at time of crawling, can it
is possible if it is on direcly on index (means at searching)
2.in your message can u explain comment like
subcollection also includes a query plugin
i done steps mentioned by you,
but when i execute command like
subcollection:<name of subcollection> <word for search>
still i get result 0 hits......
can u explain Subcollection more deeply because our aim is to searching on
specific URL?
is any other way other than subcollection ?
Enis Soztutar wrote:
>
> prashant_nutch wrote:
>> IS Subcollection useful for specific URL Searching ?
>> How we activate subcollection at indexing and searching time?
>>
>> in conf/subcollection ,
>> if we include our URL in whitelist ,then only we have search on that
>> URLs?
>> command for searching on subcollection
>>
>> Subcollection :< Name of subcollection> < word for specific URL>
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <subcollections>
>> <subcollection>
>> <name>nutch</name>
>> <id>nutch</id>
>> <whitelist>
>>
>> http://lucene.apache.org/nutch/
>> http://wiki.apache.org/nutch/
>> </whitelist>
>> <blacklist />
>> </subcollection>
>> </subcollections>
>>
>> can anybody explain how overall thing should work ?
>> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
>>
>>
> Subcollection is a very useful way to group a set of urls and then
> assign a label for them. You can use it to limit searching to certain
> urls.
>
> You should first enable subcollection in the nutch-site.xml file.
> Then you should add collections to the conf/subcollection.xml file.
> After indexing, the documents with the matched urls should have the
> subcollection field in the index.
> After that, since subcollection also includes a query plugin, you can do
> searches like
>
> java subcollection:nutch
>
> To limit the search to the nutch collection. You can consult the readme
> file in the plugin's directory.
>
>
>
>
>
>
>
--
View this message in context: http://www.nabble.com/Help-on-Activation-of-Subcollection-at-Indexing---searching-tf3490590.html#a9752653
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Help on Activation of Subcollection at Indexing & searching
Posted by Enis Soztutar <en...@gmail.com>.
prashant_nutch wrote:
> IS Subcollection useful for specific URL Searching ?
> How we activate subcollection at indexing and searching time?
>
> in conf/subcollection ,
> if we include our URL in whitelist ,then only we have search on that URLs?
> command for searching on subcollection
>
> Subcollection :< Name of subcollection> < word for specific URL>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
> <subcollection>
> <name>nutch</name>
> <id>nutch</id>
> <whitelist>
> http://lucene.apache.org/nutch/
> http://wiki.apache.org/nutch/
> </whitelist>
> <blacklist />
> </subcollection>
> </subcollections>
>
> can anybody explain how overall thing should work ?
> can it is useful for specific URL searching ?(we are using nutch 0.8.1)
>
>
Subcollection is a very useful way to group a set of urls and then
assign a label for them. You can use it to limit searching to certain urls.
You should first enable subcollection in the nutch-site.xml file.
Then you should add collections to the conf/subcollection.xml file.
After indexing, the documents with the matched urls should have the
subcollection field in the index.
After that, since subcollection also includes a query plugin, you can do
searches like
java subcollection:nutch
To limit the search to the nutch collection. You can consult the readme
file in the plugin's directory.