You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by goran kent <go...@gmail.com> on 2011/10/03 16:06:08 UTC

[lucy-user] Identifying relevant field in $hits

Hi,

I've scrounged around a bit, and I take it
http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201109.mbox/%3C4E6C3489.6090005@peknet.com%3E
is the only way to identify which field triggered a $hit, right?

ie (roughly), flag all fields as highlightable, then if the
Lucy::Highlight::Highlighter actually highlights something in a field,
then that's your indication that something was found in it?

If so, feature request:  my @field_hit = $hits->relevant_field() would
be really nice ;)

My minor problem is:  I have inbound link text pointing to a page
which is indexed along with the page content itself.  Since it's never
displayed, you might have a hit on this 'hidden' text (but highly
relevant in my case) and no other hits, so the excerpt is void of any
highlighting (I can just hear the wailing and gnashing of teeth from
my future users).  It would be nice to be able to flag this kind of
search result as "Found your term in inbound text" or whatever).


Cheers

Re: [lucy-user] Identifying relevant field in $hits

Posted by goran kent <go...@gmail.com>.
On Tue, Oct 4, 2011 at 8:13 AM, Nathan Kurz <na...@verse.com> wrote:
> On Mon, Oct 3, 2011 at 7:06 AM, goran kent <go...@gmail.com> wrote
>> My minor problem is:  I have inbound link text pointing to a page
>> which is indexed along with the page content itself.  Since it's never
>> displayed, you might have a hit on this 'hidden' text (but highly
>> relevant in my case) and no other hits, so the excerpt is void of any
>> highlighting (I can just hear the wailing and gnashing of teeth from
>> my future users).  It would be nice to be able to flag this kind of
>> search result as "Found your term in inbound text" or whatever).
>
> I can't tell if you are tongue in cheek about this or not.  I agree
> with you, it will cause wailing and gnashing, but it's also been
> Google's default behavior for a long time.  Search for:
> http://www.google.com/search?q="links+pointing+to+this+page" for a
> litany of people complaining about just this.

lol Thanks for that.  It's nice when my experience of End Users and
their sometimes unreasonable expectations is confirmed.

Re: [lucy-user] Identifying relevant field in $hits

Posted by Nathan Kurz <na...@verse.com>.
On Mon, Oct 3, 2011 at 7:06 AM, goran kent <go...@gmail.com> wrote
> My minor problem is:  I have inbound link text pointing to a page
> which is indexed along with the page content itself.  Since it's never
> displayed, you might have a hit on this 'hidden' text (but highly
> relevant in my case) and no other hits, so the excerpt is void of any
> highlighting (I can just hear the wailing and gnashing of teeth from
> my future users).  It would be nice to be able to flag this kind of
> search result as "Found your term in inbound text" or whatever).

I can't tell if you are tongue in cheek about this or not.  I agree
with you, it will cause wailing and gnashing, but it's also been
Google's default behavior for a long time.  Search for:
http://www.google.com/search?q="links+pointing+to+this+page" for a
litany of people complaining about just this.

I've don't know for sure, but I presume this is because Google has
similar "field" issues.  For reasons either technical or social,
they've decided they aren't worth fixing.  There are a bunch of
special purpose operators that purport to do this:
http://www.googleguide.com/advanced_operators.html

You'd have to go pretty deep into Lucy to take a more efficient
approach other than post-processing that Marvin and Peter suggest.

--nate

Re: [lucy-user] Identifying relevant field in $hits

Posted by Peter Karman <pe...@peknet.com>.
goran kent wrote on 10/03/2011 09:06 AM:

> If so, feature request:  my @field_hit = $hits->relevant_field() would
> be really nice ;)
> 

http://search.cpan.org/~karman/SWISH-Prog-Lucy-0.05/lib/SWISH/Prog/Lucy/Results.pm#SYNOPSIS


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] Identifying relevant field in $hits

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Oct 04, 2011 at 08:13:04AM +0200, goran kent wrote:
> On Tue, Oct 4, 2011 at 5:26 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> >> If so, feature request:  my @field_hit = $hits->relevant_field() would
> >> be really nice ;)
> 
> > Peter has provided one vision, in SWISH::Prog::Lucy::Results.  I confess that
> > I don't quite understand what you've shown us above.  Can you provide some
> > context illustrating how it would be used?
> 
> Sorry, is that last Q directed at me, or Peter?  If me, then
> @field_hit would contain my list of 'field' names in which the search
> terms were found, allowing me to do my thang.

OK, I think I basically understand.  Hits is an iterator, though, and the
fields which contributed to the score can vary per-document, so I think this
would have to be a property of the HitDoc (which happens to be how Peter has
done things).

However, I don't think it's ideal for Hits to provide an API which only works
when the user has specified an unrelated, unintuitive schema setting.  Why
does this functionality have anything to do with highlighting?  And the
granularity of 'highlightable' is per-field, while we need it to be on for
*all* indexed fields if we're to get a meaningful answer out of
relevant_fields().  Perhaps there ought to be something like an index-wide
attribute on Schema which triggers the creation of single-document inverted
indexes which includes all indexed fields?

Until such high-level design issues are worked out, if we decide that there's
a pressing need for this feature, I'd rather see it in a LucyX class that
wraps Hits.

> Is it safe to change the schema so that my 'inbound_text' field is
> highlightable (currently OFF)? - ie, there will be a mixture of the
> 'inbound_text' field in various indexes (which will end up being
> merged to large searchable indexes) with/without highlightable being
> ON.

No, that will cause a schema conflict exception; existing documents would not
have highlighting data available for that field, and the setting applies to
all segments.  It will be necessary to regenerate.

Marvin Humphrey


Re: [lucy-user] Identifying relevant field in $hits

Posted by goran kent <go...@gmail.com>.
On Tue, Oct 4, 2011 at 5:26 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> If so, feature request:  my @field_hit = $hits->relevant_field() would
>> be really nice ;)

> Peter has provided one vision, in SWISH::Prog::Lucy::Results.  I confess that
> I don't quite understand what you've shown us above.  Can you provide some
> context illustrating how it would be used?

Sorry, is that last Q directed at me, or Peter?  If me, then
@field_hit would contain my list of 'field' names in which the search
terms were found, allowing me to do my thang.

> I assume that you have stripped all HTML tags from your data.  (They would
> likely mess up scoring if left in).  Thus seeing if a highlighted exerpt
> contains a "<strong>" tag suffices to indicate that a field indeed matched.

Yes, it's clean text.

Is it safe to change the schema so that my 'inbound_text' field is
highlightable (currently OFF)? - ie, there will be a mixture of the
'inbound_text' field in various indexes (which will end up being
merged to large searchable indexes) with/without highlightable being
ON.

Re: [lucy-user] Identifying relevant field in $hits

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Oct 03, 2011 at 04:06:08PM +0200, goran kent wrote:
> Hi,
> 
> I've scrounged around a bit, and I take it
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201109.mbox/%3C4E6C3489.6090005@peknet.com%3E
> is the only way to identify which field triggered a $hit, right?

I don't know of a better way.

> ie (roughly), flag all fields as highlightable, then if the
> Lucy::Highlight::Highlighter actually highlights something in a field,
> then that's your indication that something was found in it?
 
Yes.

> If so, feature request:  my @field_hit = $hits->relevant_field() would
> be really nice ;)

I don't think this feature is mature enough to be given a prominent core API
just yet.  We've struck upon the general approach of post-processing the hit
using the single-document mini-inverted-indexes needed for highlighting, but
the current implementation is arguably an abuse of Highlighter.

For now, I think it's OK that we support this feature with cookbook code or
via convenience methods in libraries which wrap Lucy.  But it's become a
popular feature request, and so it's good to think about what a Lucy API might
look like in the future.

Peter has provided one vision, in SWISH::Prog::Lucy::Results.  I confess that
I don't quite understand what you've shown us above.  Can you provide some
context illustrating how it would be used?

> My minor problem is:  I have inbound link text pointing to a page
> which is indexed along with the page content itself.  Since it's never
> displayed, you might have a hit on this 'hidden' text (but highly
> relevant in my case) and no other hits, so the excerpt is void of any
> highlighting (I can just hear the wailing and gnashing of teeth from
> my future users).  It would be nice to be able to flag this kind of
> search result as "Found your term in inbound text" or whatever).

OK, sure.  You can abuse Highlighter to achieve your ends.  :)

I assume that you have stripped all HTML tags from your data.  (They would
likely mess up scoring if left in).  Thus seeing if a highlighted exerpt
contains a "<strong>" tag suffices to indicate that a field indeed matched.

If the primary content field produces a excerpt that does *not* contain
"<strong>", but the "inbound_text" excerpt *does* contain "<strong>", then you
know to flag that particular result.

Cheers,

Marvin Humphrey