You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2011/03/01 04:28:34 UTC
Re: [lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley wrote on 2/26/11 7:17 AM:

>> You have to generate that information after the fact, by post-processing
>> the Hits that come back.  Lucy, Lucene, and Ferret all have the same
>> behavior in this regard.
>> 
>> Matching and scoring are highly abstracted for speed.  The matching engine 
>> does not scan raw document content, a la an RDBMS full table scan --
>> instead, it iterates over heavily optimized data structures devoid of
>> introspection overhead.  At the end of a search, you will only have
>> documents and scores -- not sophisticated metadata about what part of the
>> subquery matched and what parts didn't and how much each matching part
>> contributed to the score. Keeping track of such metadata during the
>> matching phase would be prohibitively expensive.
> 
> I can understand the need to abstract a lot of things for speed.  I'm no
> search expert as I've said before, but I don't understand why at the very
> least the field information (e.g. name) can't be encoded in this data
> structure in such a way that you can determine this information at match
> time.  Highlighting and offsets are a different matter, and I never thought
> it was doing a full-text scan or a table scan like an RDBMS.  If I wanted
> that, I'd just use regex searches (which I do in some cases for small
> datasets).
> 
> Obviously, I'm missing something here, but to me I don't see why it matters
> to keep track of fields at all if you don't have the information about which
> field matched an "all fields" or "multiple field" search query to hand when
> you get the match information back in terms of term and field.  Obviously,
> actually finding the offsets is a much more expensive operation, and I'm ok
> with having to do that after the search is completed--even if I have to do my
> own matching without API support for highlighting.  However, this is only
> possible if I know what term and what field and don't have to effectively
> perform the search again on the document (which is what Ferret seems to
> require).
> 

I miss this feature too (native interrogation of HitDoc objects to discover
which field(s) generated the hit).

Marvin, where would be the appropriate place to extend Lucy in this way? I'm
guessing Search::Searcher and Search::MatchDoc?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com