You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cristina Belderrain <cr...@gmail.com> on 2006/10/05 04:09:58 UTC

Lucene query support in Nutch

Hello,

we all know that Lucene supports, among others, boolean queries. Even
though Nutch is built on Lucene, boolean clauses are removed by Nutch
filters so boolean queries end up as "flat" queries where terms are
implicitly connected by an OR operator, as far as I can see.

Is there any simple way to turn off the filtering so a boolean query
remains as such after it is submitted to Nutch?

Just in case a simple way doesn't exist, Ravi Chintakunta suggests the
following workaround:

"We have to modify the analyzer and add more plugins to Nutch
to use the Lucene's query syntax. Or we have to directly use
Lucene's Query Parser. I tried the second approach by modifying
org.apache.nutch.searcher.IndexSearcher and that seems to work."

Can anyone please elaborate on what Ravi actually means by "modifying
org.apache.nutch.searcher.IndexSearcher"? Which methods are supposed
to be modified and how?

It would be really nice to know how to do this. I believe many other
Nutch users would also benefit from an answer to this question.

Thanks so much,

Cristina

Re: Lucene query support in Nutch

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Hi,

yes, I guess having the full strength of Lucene-based queries would be
nice. That would as well solve the boolean queries-question I had a few
days ago :-)

Ravi, doesn't Lucene also allow querying of other fields? Is there any
possibility to add that feature to your proposal?


In general: What is the advantage of the current nutch-parser instead of
going with the Lucene-based one?


Regards,
 Stefan

Ravi Chintakunta wrote:
> Hi Cristina,
> 
> You can achieve this by modifying the IndexSearcher to take the query
> String as an argument and then use
> 
> org.apache.lucene.queryParser.QueryParser's parse(String ) method to
> parse the query string. The modified method in IndexSearcher would
> look as below:
> 
> public Hits search(String queryString, int numHits,
>                     String dedupField, String sortField, boolean
> reverse)  throws IOException {
> 
>    org.apache.lucene.queryParser.QueryParser parser = new
> org.apache.lucene.queryParser.QueryParser("content", new
> org.apache.lucene.analysis.standard.StandardAnalyzer());
> 
>   org.apache.lucene.search.Query luceneQuery = parser.parse(queryString);
> 
>   return translateHits
>      (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
>                          sortField, reverse),
>       dedupField, sortField);
>  }
> 
> For this you have to modify the code in search.jsp and NutchBean too,
> so that you are passing on the raw query string to IndexSearcher.
> 
> Note that with this approach, you are limiting the search to the content
> field.
> 
> 
> - Ravi Chintakunta
> 
> 
> 
> On 10/4/06, Cristina Belderrain <cr...@gmail.com> wrote:
>> Hello,
>>
>> we all know that Lucene supports, among others, boolean queries. Even
>> though Nutch is built on Lucene, boolean clauses are removed by Nutch
>> filters so boolean queries end up as "flat" queries where terms are
>> implicitly connected by an OR operator, as far as I can see.
>>
>> Is there any simple way to turn off the filtering so a boolean query
>> remains as such after it is submitted to Nutch?
>>
>> Just in case a simple way doesn't exist, Ravi Chintakunta suggests the
>> following workaround:
>>
>> "We have to modify the analyzer and add more plugins to Nutch
>> to use the Lucene's query syntax. Or we have to directly use
>> Lucene's Query Parser. I tried the second approach by modifying
>> org.apache.nutch.searcher.IndexSearcher and that seems to work."
>>
>> Can anyone please elaborate on what Ravi actually means by "modifying
>> org.apache.nutch.searcher.IndexSearcher"? Which methods are supposed
>> to be modified and how?
>>
>> It would be really nice to know how to do this. I believe many other
>> Nutch users would also benefit from an answer to this question.
>>
>> Thanks so much,
>>
>> Cristina

Re: Lucene query support in Nutch

Posted by Sami Siren <ss...@gmail.com>.

> Nevertheless, I agree that there should be an option to choose the 
> Lucene query engine instead of the Nutch flavour one because Nutch has 
> been proven to be equally suitable for areas which do not require as 
> efficient queries (like intranet crawling for instance) as an all-out 
> web indexing application.

I agree also. Different query parsers could perhaps be made pluggable or 
at least configurable. The current(-alike) implementation could be the 
default one offered and by configuration one could switch it to 
"intranet" mode.

Contributions anyone?

--
  Sami Siren

Re: Lucene query support in Nutch

Posted by Bill Goffe <go...@oswego.edu>.

Tomi said:

> In conclusion, my position is pragmatic: I welcome the simplest
> solution to implement the "or" search. I just believe that it'd be
> easiest to do that extending the nutch Analyzer.

This seems like a very reasonable approach. I too would very much like
OR. It would also be nice if it worked in 0.7.2 and I could drop it in,
but that may be asking for too much.

         - Bill

-- 
         *------------------------------------------------------*
         | Bill Goffe                 goffe@oswego.edu          |
         | Department of Economics    voice: (315) 312-3444     |
         | SUNY Oswego                fax:   (315) 312-5444     |
         | 416 Mahar Hall             <http://cook.rfe.org>     |          
         | Oswego, NY  13126                                    |
*--------*------------------------------------------------------*-----------*
| "Been there. Done that."                                                  |
|   -- Ed Viesturs as he looked up Mount Everest. He climbed it five times, |
|      twice without oxygen. He now plans to be the first American to scale |
|      all of the world's 8,000 meter mountains. "Climber for the Ages Has  |
|      Next Peak in View," New York Times, 2/13/00.                         |
*---------------------------------------------------------------------------*

Re: Lucene query support in Nutch

Posted by Tomi NA <he...@gmail.com>.

2006/10/10, Cristina Belderrain <cr...@gmail.com>:
> On 10/9/06, Tomi NA <he...@gmail.com> wrote:
>
> > This is *exactly* what I was thinking. Like Stefan, I believe the
> > nutch analyzer is a good foundation and should therefore be extended
> > to support the "or" operator, and possibly additional capabilities
> > when the need arises.
> >
> > t.n.a.
>
> Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
> which does exactly what you want, is already there?

Stefan basically answered that question, but basically, my opinion is
that Nutch's analyzer does it's job well, but only lacks one obvious
query capability: the "or" search. The fact that several users here
need this kind of functionality suggests it's not the beginning of a
landslide of new required capabilities. Lucene's analyzer, on the
other hand, is completely inadequate in this respect if search is
necessarily bound to a single (content) field.
In conclusion, my position is pragmatic: I welcome the simplest
solution to implement the "or" search. I just believe that it'd be
easiest to do that extending the nutch Analyzer.

t.n.a.

Re: Lucene query support in Nutch

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Cristina Belderrain wrote:
> On 10/9/06, Tomi NA <he...@gmail.com> wrote:
> 
>> This is *exactly* what I was thinking. Like Stefan, I believe the
>> nutch analyzer is a good foundation and should therefore be extended
>> to support the "or" operator, and possibly additional capabilities
>> when the need arises.
>>
>> t.n.a.
> 
> Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
> which does exactly what you want, is already there?

To what I understood so far in this thread the Nutch
analyser/query-whatever seems to be more targeted and provides
additional features regarding distributed search as well as maybe
speed-improvements due to it's nature etc. (Correct me if I'm wrong.)

One idea that has come up was to offer both as alternatives so you could
use Lucene-based queries if you need it's features on the   one hand but
can live with restrictions on the other.

However due to what has been mentioned so far it seems that
Lucene-queries by default can only be on document-content (is that
right?) not e.g. site:www.example.org. Hmm ...

PS: Thank you all for help offered so far in this thread on how to get
Lucene-queries going. Unfortunately I couldn't make much use of "just
simply extend it here and there ..." :-(

Regards,
 Stefan

Re: Lucene query support in Nutch

Posted by Cristina Belderrain <cr...@gmail.com>.

On 10/9/06, Tomi NA <he...@gmail.com> wrote:

> This is *exactly* what I was thinking. Like Stefan, I believe the
> nutch analyzer is a good foundation and should therefore be extended
> to support the "or" operator, and possibly additional capabilities
> when the need arises.
>
> t.n.a.

Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
which does exactly what you want, is already there?

Regards,

Cristina

Re: Lucene query support in Nutch

Posted by Tomi NA <he...@gmail.com>.

2006/10/8, Stefan Neufeind <ap...@stefan-neufeind.de>:

> if it's not the full feature-set, maybe most people could live with it.
> But basic boolean queries I think were the root for this topic. Is there
> an "easier" way to allow this in Nutch as well instead of throwing quite
> a bit away and using the Lucene-syntax? As has just been pointed out: It

This is *exactly* what I was thinking. Like Stefan, I believe the
nutch analyzer is a good foundation and should therefore be extended
to support the "or" operator, and possibly additional capabilities
when the need arises.

t.n.a.

Re: Lucene query support in Nutch

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Björn Wilmsmann wrote:
> 
> Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:
> 
>> Let me remind you that all this must be done just to provide something
>> that's already there: Nutch is built on top of Lucene, after all. If
>> it's hard to understand why Lucene's capabilities were simply
>> neutralized in Nutch, it's even harder to figure out why no choice was
>> left to users by means of some configuration file.
> 
> I think this issue is rooted in the underlying philosophy of Nutch:
> Nutch was designed with the idea of a possible Google(and the
> likes)-sized crawler and indexer in mind. Regular expressions and
> wildcard queries do not seem to fit into this philosophy, as such
> queries would be way less efficient on a huge data set than simple
> boolean queries.
> 
> Nevertheless, I agree that there should be an option to choose the
> Lucene query engine instead of the Nutch flavour one because Nutch has
> been proven to be equally suitable for areas which do not require as
> efficient queries (like intranet crawling for instance) as an all-out
> web indexing application.

Hi,

if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an "easier" way to allow this in Nutch as well instead of throwing quite
a bit away and using the Lucene-syntax? As has just been pointed out: It
seems quite a few things need to be "changed" to use Lucene-search
instead of a Nutch-search. I don't think that it's needed in most cases.
But I see several reasons where a boolean query would make sense.

(Currently I do fetch up to 10.000 or so results using opensearch and
filter them in a script myself, since no "AND (site:... or site:...)" is
 yet possible.)

Regards,
 Stefan

Re: Lucene query support in Nutch

Posted by Björn Wilmsmann <bj...@wilmsmann.de>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:

> Let me remind you that all this must be done just to provide something
> that's already there: Nutch is built on top of Lucene, after all. If
> it's hard to understand why Lucene's capabilities were simply
> neutralized in Nutch, it's even harder to figure out why no choice was
> left to users by means of some configuration file.

I think this issue is rooted in the underlying philosophy of Nutch:  
Nutch was designed with the idea of a possible Google(and the likes)- 
sized crawler and indexer in mind. Regular expressions and wildcard  
queries do not seem to fit into this philosophy, as such queries  
would be way less efficient on a huge data set than simple boolean  
queries.

Nevertheless, I agree that there should be an option to choose the  
Lucene query engine instead of the Nutch flavour one because Nutch  
has been proven to be equally suitable for areas which do not require  
as efficient queries (like intranet crawling for instance) as an all- 
out web indexing application.

- --
Best regards,
Björn Wilmsmann


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJ+75gz0R1bg11MERAgT7AJ4mPRF8Z0BR2yLCm5Pxsz4VvtTI6QCfcS8b
q8gM8LQapjAloNIRwNV+osE=
=v7Lf
-----END PGP SIGNATURE-----

Re: Lucene query support in Nutch

Posted by Cristina Belderrain <cr...@gmail.com>.

Hello,

I just would like to confirm that the version of the search() method
shown in the previous post works fine, at least regarding boolean
queries. Anyway, I see no reason why it wouldn't work with any other
Lucene query (fuzzy, proximity, etc.).

Now, please be warned that the inclusion of this new method in
IndexSearcher has quite an impact on some other classes: besides
NutchBean, where you'll need to add the wrapper methods that will
allow its use there, you'll also need to add the new method signature
to the Searcher interface, which is implemented by IndexSearcher.

Since DistributedSearch implements the Searcher interface as well,
you'll need to provide there a method with the new siganature.
Besides, depending on your needs, Summarizer and Query will demand
some changes in order to preserve phrases (composite search terms)
when they are highlighted in the summary.

Let me remind you that all this must be done just to provide something
that's already there: Nutch is built on top of Lucene, after all. If
it's hard to understand why Lucene's capabilities were simply
neutralized in Nutch, it's even harder to figure out why no choice was
left to users by means of some configuration file.

Regards,

Cristina

Re: Lucene query support in Nutch

Posted by Cristina Belderrain <cr...@gmail.com>.

Hi Björn,

yes, the error you point out will happen indeed... A possible
workaround would be:

    public Hits search(String queryString, int numHits,
        String dedupField, String sortField, boolean reverse)

    throws IOException {

        org.apache.lucene.queryParser.QueryParser parser = new
        org.apache.lucene.queryParser.QueryParser("content", new
            org.apache.lucene.analysis.standard.StandardAnalyzer());

        org.apache.lucene.search.Query luceneQuery = null;
        try {
            luceneQuery = parser.parse(queryString);
        } catch(Exception ex) {
        }

        org.apache.lucene.search.BooleanQuery boolQuery = new
        org.apache.lucene.search.BooleanQuery();
        boolQuery.add(luceneQuery,
            org.apache.lucene.search.BooleanClause.Occur.MUST);
        return translateHits
        (optimizer.optimize(boolQuery, luceneSearcher, numHits,
            sortField, reverse),
            dedupField, sortField);
    }

Please notice that I'm not sure this will work as it should: right
now, it just compiles... I still need to modify the NutchBean class so
it can pass on the raw query, as Ravi says.

Regards,

Cristina


On 10/5/06, Björn Wilmsmann <bj...@wilmsmann.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi everybody,
>
>
> On 05/10/2006 05:44 Ravi Chintakunta wrote:
>
> > public Hits search(String queryString, int numHits,
> >                     String dedupField, String sortField, boolean
> > reverse)  throws IOException {
> >
> >    org.apache.lucene.queryParser.QueryParser parser = new
> > org.apache.lucene.queryParser.QueryParser("content", new
> > org.apache.lucene.analysis.standard.StandardAnalyzer());
> >
> >   org.apache.lucene.search.Query luceneQuery = parser.parse
> > (queryString);
> >
> >   return translateHits
> >      (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
> >                          sortField, reverse),
> >       dedupField, sortField);
> >  }
>
> This seems to be a good approach. I have not yet tried it out in
> detail, however, the method optimize() in LuceneQueryOptimizer does
> only take BooleanQuery as an argument, so the line 'return
> translateHits...'  would cause a compile error, wouldn't it?
>
>
> - --
> Best regards,
> Björn Wilmsmann
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (Darwin)
>
> iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm
> sFAZIcCv3CoIBJC5g8FbOyo=
> =vzdw
> -----END PGP SIGNATURE-----

Re: Lucene query support in Nutch

Posted by Björn Wilmsmann <bj...@wilmsmann.de>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,


On 05/10/2006 05:44 Ravi Chintakunta wrote:

> public Hits search(String queryString, int numHits,
>                     String dedupField, String sortField, boolean
> reverse)  throws IOException {
>
>    org.apache.lucene.queryParser.QueryParser parser = new
> org.apache.lucene.queryParser.QueryParser("content", new
> org.apache.lucene.analysis.standard.StandardAnalyzer());
>
>   org.apache.lucene.search.Query luceneQuery = parser.parse 
> (queryString);
>
>   return translateHits
>      (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
>                          sortField, reverse),
>       dedupField, sortField);
>  }

This seems to be a good approach. I have not yet tried it out in  
detail, however, the method optimize() in LuceneQueryOptimizer does  
only take BooleanQuery as an argument, so the line 'return  
translateHits...'  would cause a compile error, wouldn't it?


- --
Best regards,
Björn Wilmsmann


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm
sFAZIcCv3CoIBJC5g8FbOyo=
=vzdw
-----END PGP SIGNATURE-----

Re: Lucene query support in Nutch

Posted by Ravi Chintakunta <ra...@gmail.com>.

Hi Cristina,

You can achieve this by modifying the IndexSearcher to take the query
String as an argument and then use

org.apache.lucene.queryParser.QueryParser's parse(String ) method to
parse the query string. The modified method in IndexSearcher would
look as below:

public Hits search(String queryString, int numHits,
                     String dedupField, String sortField, boolean
reverse)  throws IOException {

    org.apache.lucene.queryParser.QueryParser parser = new
org.apache.lucene.queryParser.QueryParser("content", new
org.apache.lucene.analysis.standard.StandardAnalyzer());

   org.apache.lucene.search.Query luceneQuery = parser.parse(queryString);

   return translateHits
      (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
                          sortField, reverse),
       dedupField, sortField);
  }

For this you have to modify the code in search.jsp and NutchBean too,
so that you are passing on the raw query string to IndexSearcher.

Note that with this approach, you are limiting the search to the content field.


- Ravi Chintakunta



On 10/4/06, Cristina Belderrain <cr...@gmail.com> wrote:
> Hello,
>
> we all know that Lucene supports, among others, boolean queries. Even
> though Nutch is built on Lucene, boolean clauses are removed by Nutch
> filters so boolean queries end up as "flat" queries where terms are
> implicitly connected by an OR operator, as far as I can see.
>
> Is there any simple way to turn off the filtering so a boolean query
> remains as such after it is submitted to Nutch?
>
> Just in case a simple way doesn't exist, Ravi Chintakunta suggests the
> following workaround:
>
> "We have to modify the analyzer and add more plugins to Nutch
> to use the Lucene's query syntax. Or we have to directly use
> Lucene's Query Parser. I tried the second approach by modifying
> org.apache.nutch.searcher.IndexSearcher and that seems to work."
>
> Can anyone please elaborate on what Ravi actually means by "modifying
> org.apache.nutch.searcher.IndexSearcher"? Which methods are supposed
> to be modified and how?
>
> It would be really nice to know how to do this. I believe many other
> Nutch users would also benefit from an answer to this question.
>
> Thanks so much,
>
> Cristina
>