You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sean O'Connor <se...@oconeco.com> on 2005/10/20 00:40:25 UTC

SpanQuery parser?

Hello,
    I have user entered search commands which I want to convert to 
SpanQueries. I have seen in the book "Lucene in Action" that no parser 
existed at time of publication, but there was someone working on a 
SpanQuery parser. Can anyone point me to that code, or provide any 
suggestions?

    I want to use SpanQueries for their detail on the number of hits 
from a query, and more importantly, the location (position start and 
end) of each hit. My application requires me to do precise hit 
highlighting.  I also need to perform calculations on the number of hits 
per document, as well as per query (sum of document hits).

    It is fairly critical I highlight the hits, and only the hits. From 
what I've read SpanQueries (with dumpSpans) is a better approach than 
using 'regular' queries. I _think_ regular queries currently use a 
highlighter which shows all terms highlighted. This can give more 
highlighting than actual hits (i.e false positives).

    So, that being said, should I stick with SpanQueries? Is there any 
current work on a parser to convert a string, or regular (Token, 
Boolean, Phrase, Prefix,...) query into a SpanQuery?

    I have written some very duct tape-ish code which will convert basic 
booleanOR and prefix queries into SpanQueries. I just realized I'm in 
deeper water than I expected when I tried converting my first query 
string containing several boolean queries, AND a phrase query. So now I 
am looking to either help an existing effort, or just continue with my 
own hacking.
Thanks,

Sean
ps: some of this message is repeated from previous postings just as 
background for my goal.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search deleted docs?

Posted by Cyril Barlow <in...@fantasyfooty.org>.

Can you make a search that searches deleted docs as well as normal docs?

		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 08 November 2005 00:05, Sean O'Connor wrote:
...
> When I looked at using surround queries, I believe I got stuck finding 
> access to the spans information. I was pressed for time, and only looked 

They are in the org.apache.lucene.search.spans package, not in the surround
language.

> it over for a couple of hours. A good portion of that was setup time. I 
> probably missed some of the more obvious points, so I will go back and 
> look over the surround code to see how unencumbered by the thought 
> process I was when I decided to reinvent my wheel.
> 
> >Also, PhraseQuery is more efficient than a combination of SpanTermQuery,
> >SpanOrQuery and SpanNearQuery, so perhaps PhraseQuery should have
> >a getSpans() method so it could be used as a SpanQuery, too.
> >  
> >
> This intrigues me. I know I don't understand Lucene past the basics, so 
> I'm a little lost on the implications of your suggestion.
> My impression was (and still is) that PhraseQuery is a significantly 
> different implementation from the SpanQuery family. Is it rather 
> straightforward to implement getSpans() for a PhraseQuery? Granted, I 
> seem to have missed that Surround is based on SpanQueries, so it looks 
> like I need to spend some more time getting my head straight on these 
> topics.

This is mainly a performance issue, so when SpanNearQuery works
for you, and you can live with the slower performance of SpanNearQuery
over PhraseQuery, you can use the surround parser as it is.
The surround parser does not generate PhraseQueries,
but it will generate SpanNearQueries, and it can also nest
SpanNearQueries.

Implementing getSpans() on PhraseQuery may not be straightforward,
because of the differences in the meanings of the slop for PhraseQueries
and SpanNearQueries.

PhraseQuery is built directly on TermPositions as returned from
an IndexReader. In the SpanQuery family, there is a Spans
between the TermPositions of the terms and a SpanNearQuery.
This allows SpanNearQueries to be nested, unlike PhraseQueries.

The current implementation of SpanNearQuery is slower than
PhraseQuery, but the implementation of SpanNearQuery here may 
narrow the gap:
http://issues.apache.org/jira/browse/LUCENE-413
Another way to improve performance might be to implement
a getTermSpans() method directly on the IndexReader.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Sean O'Connor <se...@oconeco.com>.

Paul Elschot wrote:

                  <snip>

>>>The goal(s) I am trying to accomplish is rather specific I think,  
>>>so I imagine the use of my hacking is rather limited (i.e. just to  
>>>me).
>>>
>>>At the moment my code:
>>>
>>>   * parses the search text (i.e. user entered query)
>>>      
>>>
>>Are you using QueryParser?  If so, you'll also want to account for  
>>BooleanQuery, recursively.
>>    
>>
>
>The surround parser can create both boolean queries and span queries.
>
>Sean, as you seem to prefer not to use the surround syntax, do you think
>this syntax could be improved somehow? I recall trying to make it simpler,
>but when I made it I was not able to do so.
>  
>
Paul, I think the surround syntax is very logical, and looks quite 
powerful. From my perspective, it is the right fit syntax for the job. 
My main issue was probably lack of understanding. I decided to stick 
with the tool I've gained a little understanding over now.

When I looked at using surround queries, I believe I got stuck finding 
access to the spans information. I was pressed for time, and only looked 
it over for a couple of hours. A good portion of that was setup time. I 
probably missed some of the more obvious points, so I will go back and 
look over the surround code to see how unencumbered by the thought 
process I was when I decided to reinvent my wheel.

>Also, PhraseQuery is more efficient than a combination of SpanTermQuery,
>SpanOrQuery and SpanNearQuery, so perhaps PhraseQuery should have
>a getSpans() method so it could be used as a SpanQuery, too.
>  
>
This intrigues me. I know I don't understand Lucene past the basics, so 
I'm a little lost on the implications of your suggestion.
My impression was (and still is) that PhraseQuery is a significantly 
different implementation from the SpanQuery family. Is it rather 
straightforward to implement getSpans() for a PhraseQuery? Granted, I 
seem to have missed that Surround is based on SpanQueries, so it looks 
like I need to spend some more time getting my head straight on these 
topics.

>Regards,
>Paul Elschot
>  
>
Thanks for your comments,

Sean



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Paul Elschot <pa...@xs4all.nl>.

On Saturday 05 November 2005 01:29, Erik Hatcher wrote:
> 
> On 4 Nov 2005, at 18:32, Sean O'Connor wrote:
> > I'm posting this primarily hoping to give back a tiny bit to a very  
> > helpful community. More likely however, someone else will open my  
> > eyes to an easier approach than what I outline below...
> >
> > I've come up with a very ugly conversion approach from regular  
> > Query objects into SpanQuery objects. I then use the converted  
> > SpanQuery to get span positions (currently both token #, and start/ 
> > end position). In effect, I have highlighting for simple queries  
> > with a very inefficient approach (yea for me!).
> 
> As you and I have talked about on a couple of face to face occasions,  
> this is the approach I am taking on a current consulting project.  My  
> conversion code is slightly different than yours in that I don't  
> rewrite the query, but translate it as-is into comparable SpanQuery  
> subclasses - and this is because I have a RegexQuery and  
> SpanRegexQuery that are comparable.  But rewriting is a good  
> pragmatic way to go for general query types that don't have a  
> comparable SpanQuery subclass.
> 
> > The goal(s) I am trying to accomplish is rather specific I think,  
> > so I imagine the use of my hacking is rather limited (i.e. just to  
> > me).
> >
> > At the moment my code:
> >
> >    * parses the search text (i.e. user entered query)
> 
> Are you using QueryParser?  If so, you'll also want to account for  
> BooleanQuery, recursively.

The surround parser can create both boolean queries and span queries.

Sean, as you seem to prefer not to use the surround syntax, do you think
this syntax could be improved somehow? I recall trying to make it simpler,
but when I made it I was not able to do so.

Also, PhraseQuery is more efficient than a combination of SpanTermQuery,
SpanOrQuery and SpanNearQuery, so perhaps PhraseQuery should have
a getSpans() method so it could be used as a SpanQuery, too.

Regards,
Paul Elschot

> 
> >    * rewrites the resulting query to expand wildcards and such against
> >      index
> >    * calls a recursive conversion function with very basic conversion
> >      understanding
> >          o TermQuery -> SpanTerm
> >          o PhraseQuery -> SpanNear
> >          o others in progress as time permits
> >
> > Currently, I only process simple query strings like:
> > "blue green yellow" => SpanOrQuery
> > "luce* acti*" => SpanOrQuery with wild cards expanded
> >    e.g.: lucene lucent action acting ... all or'ed together in a  
> > braindead fashion
> > "luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and  
> > SpanNear (no slop)
> >    er, hopefully you get the picture, I'm not up to showing a  
> > vector of this one... :-)
> >
> > I would be happy to discuss my approach if there is anyone  
> > interested. I assume I am pretty much alone in finding this  
> > ineffecient approach useful. For me, it is the functionality that  
> > overrides perfomance issues.
> 
> What is inefficient about it?   The rewrite stuff is the main  
> difference, and perhaps that is the issue you're encountering.  Where  
> do you see the performance issues?
> 
> Converting a query, for me at least, is fast - perhaps because there  
> is no rewriting involved.
> 
> > I have something which can take user search strings and do hit  
> > highlighting for the exact hit found. This is really only useful  
> > for "termA near 'some phrase'" at the moment, but might become more  
> > advanced in the next 2-3 months.
> 
> I'm basically implementing this very thing.  I will likely be  
> enhancing the contrib/highlighter code in the next month to use  
> SpanQuery for highlighting, as well as adding field-aware highlighting.
> 
>      Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Sean O'Connor <se...@oconeco.com>.

Erik Hatcher wrote:

>
> On 4 Nov 2005, at 18:32, Sean O'Connor wrote:
>
>> I'm posting this primarily hoping to give back a tiny bit to a very  
>> helpful community. More likely however, someone else will open my  
>> eyes to an easier approach than what I outline below...
>>
>> I've come up with a very ugly conversion approach from regular  Query 
>> objects into SpanQuery objects. I then use the converted  SpanQuery 
>> to get span positions (currently both token #, and start/ end 
>> position). In effect, I have highlighting for simple queries  with a 
>> very inefficient approach (yea for me!).
>
>
> As you and I have talked about on a couple of face to face occasions,  
> this is the approach I am taking on a current consulting project.  My  
> conversion code is slightly different than yours in that I don't  
> rewrite the query, but translate it as-is into comparable SpanQuery  
> subclasses - and this is because I have a RegexQuery and  
> SpanRegexQuery that are comparable.  But rewriting is a good  
> pragmatic way to go for general query types that don't have a  
> comparable SpanQuery subclass.
>
>> The goal(s) I am trying to accomplish is rather specific I think,  so 
>> I imagine the use of my hacking is rather limited (i.e. just to  me).
>>
>> At the moment my code:
>>
>>    * parses the search text (i.e. user entered query)
>
>
> Are you using QueryParser?  If so, you'll also want to account for  
> BooleanQuery, recursively.


I am using QueryParser. So far I have taken the easy route, and just 
deal with 'Or' BooleanQueries. The additional aspects of Boolean query 
(required and prohibited) should not be much of a stretch.

>
>
>>    * rewrites the resulting query to expand wildcards and such against
>>      index
>>    * calls a recursive conversion function with very basic conversion
>>      understanding
>>          o TermQuery -> SpanTerm
>>          o PhraseQuery -> SpanNear
>>          o others in progress as time permits
>>
>> Currently, I only process simple query strings like:
>> "blue green yellow" => SpanOrQuery
>> "luce* acti*" => SpanOrQuery with wild cards expanded
>>    e.g.: lucene lucent action acting ... all or'ed together in a  
>> braindead fashion
>> "luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and  
>> SpanNear (no slop)
>>    er, hopefully you get the picture, I'm not up to showing a  vector 
>> of this one... :-)
>>
>> I would be happy to discuss my approach if there is anyone  
>> interested. I assume I am pretty much alone in finding this  
>> ineffecient approach useful. For me, it is the functionality that  
>> overrides perfomance issues.
>
>
> What is inefficient about it?   The rewrite stuff is the main  
> difference, and perhaps that is the issue you're encountering.  Where  
> do you see the performance issues?
> Converting a query, for me at least, is fast - perhaps because there  
> is no rewriting involved.


Good question. I haven't done any performance testing, nor am I seeing 
any performance problems with lucene. I just assumed that my approach 
was adding an extra (unoptimized) layer. So for now, forgot I mentioned 
that :-).

>> I have something which can take user search strings and do hit  
>> highlighting for the exact hit found. This is really only useful  for 
>> "termA near 'some phrase'" at the moment, but might become more  
>> advanced in the next 2-3 months.
>
>
> I'm basically implementing this very thing.  I will likely be  
> enhancing the contrib/highlighter code in the next month to use  
> SpanQuery for highlighting, as well as adding field-aware highlighting.
>
Excellent. I will keep an eye out for it. Thanks for the heads up.

>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


Sean



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 4 Nov 2005, at 18:32, Sean O'Connor wrote:
> I'm posting this primarily hoping to give back a tiny bit to a very  
> helpful community. More likely however, someone else will open my  
> eyes to an easier approach than what I outline below...
>
> I've come up with a very ugly conversion approach from regular  
> Query objects into SpanQuery objects. I then use the converted  
> SpanQuery to get span positions (currently both token #, and start/ 
> end position). In effect, I have highlighting for simple queries  
> with a very inefficient approach (yea for me!).

As you and I have talked about on a couple of face to face occasions,  
this is the approach I am taking on a current consulting project.  My  
conversion code is slightly different than yours in that I don't  
rewrite the query, but translate it as-is into comparable SpanQuery  
subclasses - and this is because I have a RegexQuery and  
SpanRegexQuery that are comparable.  But rewriting is a good  
pragmatic way to go for general query types that don't have a  
comparable SpanQuery subclass.

> The goal(s) I am trying to accomplish is rather specific I think,  
> so I imagine the use of my hacking is rather limited (i.e. just to  
> me).
>
> At the moment my code:
>
>    * parses the search text (i.e. user entered query)

Are you using QueryParser?  If so, you'll also want to account for  
BooleanQuery, recursively.

>    * rewrites the resulting query to expand wildcards and such against
>      index
>    * calls a recursive conversion function with very basic conversion
>      understanding
>          o TermQuery -> SpanTerm
>          o PhraseQuery -> SpanNear
>          o others in progress as time permits
>
> Currently, I only process simple query strings like:
> "blue green yellow" => SpanOrQuery
> "luce* acti*" => SpanOrQuery with wild cards expanded
>    e.g.: lucene lucent action acting ... all or'ed together in a  
> braindead fashion
> "luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and  
> SpanNear (no slop)
>    er, hopefully you get the picture, I'm not up to showing a  
> vector of this one... :-)
>
> I would be happy to discuss my approach if there is anyone  
> interested. I assume I am pretty much alone in finding this  
> ineffecient approach useful. For me, it is the functionality that  
> overrides perfomance issues.

What is inefficient about it?   The rewrite stuff is the main  
difference, and perhaps that is the issue you're encountering.  Where  
do you see the performance issues?

Converting a query, for me at least, is fast - perhaps because there  
is no rewriting involved.

> I have something which can take user search strings and do hit  
> highlighting for the exact hit found. This is really only useful  
> for "termA near 'some phrase'" at the moment, but might become more  
> advanced in the next 2-3 months.

I'm basically implementing this very thing.  I will likely be  
enhancing the contrib/highlighter code in the next month to use  
SpanQuery for highlighting, as well as adding field-aware highlighting.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser? Update (ugly hack inside...)

Posted by Sean O'Connor <se...@oconeco.com>.

I'm posting this primarily hoping to give back a tiny bit to a very 
helpful community. More likely however, someone else will open my eyes 
to an easier approach than what I outline below...

I've come up with a very ugly conversion approach from regular Query 
objects into SpanQuery objects. I then use the converted SpanQuery to 
get span positions (currently both token #, and start/end position). In 
effect, I have highlighting for simple queries with a very inefficient 
approach (yea for me!).

The goal(s) I am trying to accomplish is rather specific I think, so I 
imagine the use of my hacking is rather limited (i.e. just to me).

At the moment my code:

    * parses the search text (i.e. user entered query)
    * rewrites the resulting query to expand wildcards and such against
      index
    * calls a recursive conversion function with very basic conversion
      understanding
          o TermQuery -> SpanTerm
          o PhraseQuery -> SpanNear
          o others in progress as time permits

Currently, I only process simple query strings like:
"blue green yellow" => SpanOrQuery
"luce* acti*" => SpanOrQuery with wild cards expanded
    e.g.: lucene lucent action acting ... all or'ed together in a 
braindead fashion
"luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and 
SpanNear (no slop)
    er, hopefully you get the picture, I'm not up to showing a vector of 
this one... :-)

I would be happy to discuss my approach if there is anyone interested. I 
assume I am pretty much alone in finding this ineffecient approach 
useful. For me, it is the functionality that overrides perfomance 
issues. I have something which can take user search strings and do hit 
highlighting for the exact hit found. This is really only useful for 
"termA near 'some phrase'" at the moment, but might become more advanced 
in the next 2-3 months.

Sean


Paul Elschot wrote:

>On Thursday 20 October 2005 00:40, Sean O'Connor wrote:
>  
>
>>Hello,
>>    I have user entered search commands which I want to convert to 
>>SpanQueries. I have seen in the book "Lucene in Action" that no parser 
>>existed at time of publication, but there was someone working on a 
>>SpanQuery parser. Can anyone point me to that code, or provide any 
>>suggestions?
>>
>>    I want to use SpanQueries for their detail on the number of hits 
>>from a query, and more importantly, the location (position start and 
>>end) of each hit. My application requires me to do precise hit 
>>highlighting.  I also need to perform calculations on the number of hits 
>>per document, as well as per query (sum of document hits).
>>    
>>
>
>You may want to use the getSpans() method of SpanQuery and operate
>on the result directly.
>
>  
>
>>    It is fairly critical I highlight the hits, and only the hits. From 
>>what I've read SpanQueries (with dumpSpans) is a better approach than 
>>using 'regular' queries. I _think_ regular queries currently use a 
>>highlighter which shows all terms highlighted. This can give more 
>>highlighting than actual hits (i.e false positives).
>>
>>    So, that being said, should I stick with SpanQueries? Is there any 
>>current work on a parser to convert a string, or regular (Token, 
>>Boolean, Phrase, Prefix,...) query into a SpanQuery?
>>
>>    I have written some very duct tape-ish code which will convert basic 
>>booleanOR and prefix queries into SpanQueries. I just realized I'm in 
>>deeper water than I expected when I tried converting my first query 
>>string containing several boolean queries, AND a phrase query. So now I 
>>am looking to either help an existing effort, or just continue with my 
>>own hacking.
>>    
>>
>
>:)
>
>Have a look at the surround query parser in the svn trunk:
>http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/surround/
>
>There is also some code that does highlighting based on Spans,
>but I don't know where that is. Hopefully someone else can point you at that.
>
>Regards,
>Paul Elschot
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery parser?

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 20 October 2005 00:40, Sean O'Connor wrote:
> Hello,
>     I have user entered search commands which I want to convert to 
> SpanQueries. I have seen in the book "Lucene in Action" that no parser 
> existed at time of publication, but there was someone working on a 
> SpanQuery parser. Can anyone point me to that code, or provide any 
> suggestions?
> 
>     I want to use SpanQueries for their detail on the number of hits 
> from a query, and more importantly, the location (position start and 
> end) of each hit. My application requires me to do precise hit 
> highlighting.  I also need to perform calculations on the number of hits 
> per document, as well as per query (sum of document hits).

You may want to use the getSpans() method of SpanQuery and operate
on the result directly.

>     It is fairly critical I highlight the hits, and only the hits. From 
> what I've read SpanQueries (with dumpSpans) is a better approach than 
> using 'regular' queries. I _think_ regular queries currently use a 
> highlighter which shows all terms highlighted. This can give more 
> highlighting than actual hits (i.e false positives).
> 
>     So, that being said, should I stick with SpanQueries? Is there any 
> current work on a parser to convert a string, or regular (Token, 
> Boolean, Phrase, Prefix,...) query into a SpanQuery?
> 
>     I have written some very duct tape-ish code which will convert basic 
> booleanOR and prefix queries into SpanQueries. I just realized I'm in 
> deeper water than I expected when I tried converting my first query 
> string containing several boolean queries, AND a phrase query. So now I 
> am looking to either help an existing effort, or just continue with my 
> own hacking.

:)

Have a look at the surround query parser in the svn trunk:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/surround/

There is also some code that does highlighting based on Spans,
but I don't know where that is. Hopefully someone else can point you at that.

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org