You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2012/10/25 03:59:12 UTC

[lucy-user] ClusterSearcher - prefetching hits

On Wed, Oct 24, 2012 at 5:08 AM, Dag Lem <da...@nimrod.no> wrote:

> Furthermore fetch_doc and fetch_doc_vec should be replaced with
> something like fetch_docs and fetch_docs_vec, facilitating the
> fetching of several documents with a single request / response.

The place to address batch-fetching of documents is the Lucy::Search::Hits
iterator class.

Right now, Hits doesn't pre-fetch -- it just retrieves individual docs on
demand.

    HitDoc*
    Hits_next(Hits *self) {
        MatchDoc *match_doc
            = (MatchDoc*)VA_Fetch(self->match_docs, self->offset);
        self->offset++;

        if (!match_doc) {
            /** Bail if there aren't any more *captured* hits. (There may
             * be more total hits.) */
            return NULL;
        }
        else {
            // Lazily fetch HitDoc, set score.
            HitDoc *hit_doc = Searcher_Fetch_Doc(self->searcher,
                                                 match_doc->doc_id);
            HitDoc_Set_Score(hit_doc, match_doc->score);
            return hit_doc;
        }
    }

We could modify Hits by giving it a `prefetch_count` member variable and a
`set_prefetch_count()` method.  The default for `prefetch_count` would be 0,
preserving the current behavior, but ClusterSearcher could set that count
before returning the Hits object so that all documents are prefetched by
default on the first call to `next()`.  The result will be to cut down fetches
from one round-trip per hit to one round-trip per shard-with-hits.

There's no need to make `set_prefetch_count()` public yet -- it can remain
an implementation detail for the time being.

The question of what to do about `fetch_doc_vec()` is harder.  Highlighter is
the only place that calls `fetch_doc_vec()`, but it can't prefetch because it
only deals with one hit at a time.

Perhaps we ought to explore integrating Highlighter with Hits instead of
limiting it to dealing with individual Doc objects.  That way, Hits could
assume responsibility for prefetching both Doc and DocVector objects at the
same time.

Marvin Humphrey

Re: [lucy-user] ClusterSearcher - prefetching hits

Posted by Dag Lem <da...@nimrod.no>.
Marvin Humphrey <ma...@rectangular.com> writes:

[...]

> We could modify Hits by giving it a `prefetch_count` member variable and a
> `set_prefetch_count()` method.  The default for `prefetch_count` would be 0,
> preserving the current behavior, but ClusterSearcher could set that count
> before returning the Hits object so that all documents are prefetched by
> default on the first call to `next()`.  The result will be to cut down fetches
> from one round-trip per hit to one round-trip per shard-with-hits.

This will undoubtedly be a big win!

> There's no need to make `set_prefetch_count()` public yet -- it can remain
> an implementation detail for the time being.
> 
> The question of what to do about `fetch_doc_vec()` is harder.  Highlighter is
> the only place that calls `fetch_doc_vec()`, but it can't prefetch because it
> only deals with one hit at a time.
> 
> Perhaps we ought to explore integrating Highlighter with Hits instead of
> limiting it to dealing with individual Doc objects.  That way, Hits could
> assume responsibility for prefetching both Doc and DocVector objects at the
> same time.

This sounds very reasonable to me. To avoid unnecessary fetches for
applications without the need for highlighting, you could conceivably
control whether highlights should be prefetched by adding a second
member variable to Hits, e.g. 'prefetch_highlights'.

Stealing from the documentation of Lucy::Highlight::Highlighter,
perhaps you'd end up with something like the example below? This is
assuming that you want to give the user control over the process,
while also allowing him to shoot himself in the foot, of course :-)

my $highlighter = Lucy::Highlight::Highlighter->new(
        searcher => $searcher,
        query    => $query,
        field    => 'body'
    );
    my $hits = $searcher->hits( query               => $query,
                                prefetch_count      => 100,
                                prefetch_highlights => 1 );
    while ( my $hit = $hits->next ) {
        my $excerpt = $highlighter->create_excerpt($hit);
        ...
    }

I think the defaults you outlined for prefetch_count are sound. The
default for prefetch_highlights should probably be 1 for Hits returned
from ClusterSearcher (a user would have to set prefetch_highlights to
0 in order to squeeze the last bit of performance out of an
application without highlighting, but on the other hand he wouldn't
inadvertedly end up with gazillions of network roundtrips in an
applications which does use highlighting).

-- 
Best regards,

Dag Lem