You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sean O'Connor <se...@oconeco.com> on 2005/09/06 08:21:52 UTC

Hits document offset information? Span query or Surround?

I believe I have heard that Span queries provide some way to access 
document offset information for their hits somehow. Does anyone know if 
this is true, and if so, how I would go about it?

Alternatively (preferably actually) does the surround code from the SVN 
development area have a way of returning offsets for the matching hits?

I believe the current highlighter code matches all query terms in a hit 
document, not just those satisfying a query criteria. I need a more 
precise way to access the hit term offsets. I am working on hit 
highlighting, hit excepts and summaries, and compound queries  (is this 
called search vectors?). I am still working through the surround code in 
dev. to see if that gives me the compound queries I need.

I am willing to spend a few days to work on implementing adding offsets 
to the returned hits (or something similar) if this is not currently 
available. It is something I need, even at the cost of search efficiency.
Thanks

Sean



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Hits document offset information? Span query or Surround? - thanks

Posted by Sean O'Connor <se...@oconeco.com>.
Thanks for the input. I am looking at the suggested links now. If I make 
any progress I will return to see if any of my work would be appropriate 
to contribute back.

Sean


Paul Elschot wrote:

>On Tuesday 06 September 2005 08:52, markharw00d wrote:
>  
>
>> >>I believe I have heard that Span queries provide some way to access 
>>document offset information for their hits somehow.
>>
>>See http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
>>
>>Faithfully selecting extracts based *exactly* on query criteria will be 
>>hard given complex queries eg with nested Boolean logic.
>>
>>The current highlighter matches based on ANY query terms found in the 
>>provided doc text
>>The proposal above matches based on any spans/phrases/terms
>>
>>Both options still fail to take into account any boolean logic and show 
>>the real basis for the match eg the query
>>    (author:"Doug Cutting"AND title:"Lucene in Action") OR (author:Erik 
>>AND author:Otis)
>>would still highlight references to "Doug Cutting" and "Lucene In 
>>Action" for the LIA book, despite the fact that the match was actually 
>>for Erik and Otis (the true authors).
>>For most people this is a problem they can live with.
>>    
>>
>
>The person who solves that might also write a SpanAndQuery :)
>
>Regards,
>Paul Elschot
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Hits document offset information? Span query or Surround?

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 06 September 2005 08:52, markharw00d wrote:
>  >>I believe I have heard that Span queries provide some way to access 
> document offset information for their hits somehow.
> 
> See http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
> 
> Faithfully selecting extracts based *exactly* on query criteria will be 
> hard given complex queries eg with nested Boolean logic.
> 
> The current highlighter matches based on ANY query terms found in the 
> provided doc text
> The proposal above matches based on any spans/phrases/terms
> 
> Both options still fail to take into account any boolean logic and show 
> the real basis for the match eg the query
>     (author:"Doug Cutting"AND title:"Lucene in Action") OR (author:Erik 
> AND author:Otis)
> would still highlight references to "Doug Cutting" and "Lucene In 
> Action" for the LIA book, despite the fact that the match was actually 
> for Erik and Otis (the true authors).
> For most people this is a problem they can live with.

The person who solves that might also write a SpanAndQuery :)

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Hits document offset information? Span query or Surround?

Posted by markharw00d <ma...@yahoo.co.uk>.
 >>I believe I have heard that Span queries provide some way to access 
document offset information for their hits somehow.

See http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2

Faithfully selecting extracts based *exactly* on query criteria will be 
hard given complex queries eg with nested Boolean logic.

The current highlighter matches based on ANY query terms found in the 
provided doc text
The proposal above matches based on any spans/phrases/terms

Both options still fail to take into account any boolean logic and show 
the real basis for the match eg the query
    (author:"Doug Cutting"AND title:"Lucene in Action") OR (author:Erik 
AND author:Otis)
would still highlight references to "Doug Cutting" and "Lucene In 
Action" for the LIA book, despite the fact that the match was actually 
for Erik and Otis (the true authors).
For most people this is a problem they can live with.

Cheers
Mark


		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Hits document offset information? Span query or Surround?

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 06 September 2005 08:21, Sean O'Connor wrote:
> I believe I have heard that Span queries provide some way to access 
> document offset information for their hits somehow. Does anyone know if 
> this is true, and if so, how I would go about it?
> 
> Alternatively (preferably actually) does the surround code from the SVN 
> development area have a way of returning offsets for the matching hits?

Using getSpans(reader) on the span query will provide the Spans that
match the query. A Spans iterates through begin/end offset pairs within 
the matching docs. This is provided by Lucene.

> 
> I believe the current highlighter code matches all query terms in a hit 
> document, not just those satisfying a query criteria. I need a more 
> precise way to access the hit term offsets. I am working on hit 
> highlighting, hit excepts and summaries, and compound queries  (is this 
> called search vectors?). I am still working through the surround code in 
> dev. to see if that gives me the compound queries I need.
> 
> I am willing to spend a few days to work on implementing adding offsets 
> to the returned hits (or something similar) if this is not currently 
> available. It is something I need, even at the cost of search efficiency.

See also the thread on better highlighting that started on 25 August
and this:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35518

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org