You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/01/12 16:10:54 UTC
NutchQuery adding non required Terms
Hi,
I would love to build a nutch Query object via API and not using the
Queryparser.
In my case I need the complete set of boolean operators in the query,
so required (AND) and non required (OR) terms and prohibited (NOT).
I notice that in general this would be possible to add a clause in
the Query object, since the BasicQuery filter just copies the
parameter isRequired and isProhibited.
However the Clauses arraylist is private and there is not method in
the nutch query object that allows to add custom terms or clauses
with isRequired and isProhibited.
Did I miss something in general to be able to support non required
terms in nutch?
Would people agree to add a little method that allows to adding terms
with these parameters?
Thanks for any comments.
Stefan
Re: NutchQuery adding non required Terms
Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> I would love to add non required terms and nesting to the Query object
> API, I will provide also some unit tests, but since I'm not a javacc
> geek it will only extend the java api not the query parser.
> Would such a extension be welcome?
I think we should start with just adding non-required terms, and leave
nesting as a subsequent step.
I also agree that we can leave this out of the query parser as a start.
Doug
Re: NutchQuery adding non required Terms
Posted by Stefan Groschupf <sg...@media-style.com>.
Thanks for the hint.
I would love to add non required terms and nesting to the Query
object API, I will provide also some unit tests, but since I'm not a
javacc geek it will only extend the java api not the query parser.
Would such a extension be welcome?
Stefan
Am 12.01.2006 um 18:29 schrieb Doug Cutting:
> Stefan Groschupf wrote:
>> Did I miss something in general to be able to support non
>> required terms in nutch?
>
> I left OR and nesting out of the API to simplify what query filters
> have to process. Nutch's query features are approximately what
> Google supported for its first three years. (Google did not add OR
> until 2000, I think.)
>
> If we permit optional clauses then we need to make sure that each
> query filter can handle them correctly.
>
> For example, the query "+A +B" is translated by query-basic into
> something like:
>
> +(title:a OR content:a OR anchors:a OR url:a OR host:a)
> +(title:b OR content:b OR anchors:b OR url:b OR host:b)
> title:"a b"~999
> content:"a b"~999
> anchors:"a b"~999
> url:"a b"~999
> host:"a b"~999
>
> The query "+A B" (where B is optional) should remove the plus in
> the second line above. So it should not be too hard to change
> query-basic to be able to handle optional terms in the default
> field. Perhaps that's the only query filter that would need to be
> updated. And it looks like LuceneQueryOptimizer already checks
> that filterized clauses are required.
>
> It would be good to have some unit tests for query filtering.
>
> Doug
>
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net
Re: NutchQuery adding non required Terms
Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> Did I miss something in general to be able to support non required
> terms in nutch?
I left OR and nesting out of the API to simplify what query filters have
to process. Nutch's query features are approximately what Google
supported for its first three years. (Google did not add OR until 2000,
I think.)
If we permit optional clauses then we need to make sure that each query
filter can handle them correctly.
For example, the query "+A +B" is translated by query-basic into
something like:
+(title:a OR content:a OR anchors:a OR url:a OR host:a)
+(title:b OR content:b OR anchors:b OR url:b OR host:b)
title:"a b"~999
content:"a b"~999
anchors:"a b"~999
url:"a b"~999
host:"a b"~999
The query "+A B" (where B is optional) should remove the plus in the
second line above. So it should not be too hard to change query-basic
to be able to handle optional terms in the default field. Perhaps
that's the only query filter that would need to be updated. And it
looks like LuceneQueryOptimizer already checks that filterized clauses
are required.
It would be good to have some unit tests for query filtering.
Doug