You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2012/10/25 03:53:44 UTC

[lucy-user] ClusterSearcher statistics

On Wed, Oct 24, 2012 at 5:08 AM, Dag Lem <da...@nimrod.no> wrote:

> Here, doc_freq and top_docs should be replaced with something like
> docs_freq_and_top_docs, i.e. only one request / response per query.

It is true that calls to `doc_freq()` are responsible for a disproportionate
amount of network traffic.  However, it is not currently feasible to
consolidate the calls to `doc_freq()` and `top_docs()` into a single round
trip.

The `doc_freq()` invocations are part of query weighting -- they tell us how
many documents a given term occurs in, allowing us to increase the weight of
rare terms and decrease the weight of common ones.  It is important that the
query weighting be exactly the same for each shard, because otherwise hits
from different shards will have scores which are not comparable to each other.

To know how common a term is across the entire collection we need to survey
all shards and sum the results.  All these calls must be completed before we
can finish weighting the query, allowing us to call `top_docs()`.

The calls to `doc_freq()` also cannot be consolidated together easily, because
they are invoked by nested weighting methods within an arbitrarily complex
compound query object.

As an alternative, how about adding this new method to ClusterSearcher?

    =head2 set_stat_source

        my $local_searcher = Lucy::Search::IndexSearcher->new(
            index => '/path/to/index',
        );
        $cluster_searcher->set_stat_source($local_searcher);

    Set the Searcher which will be used to find index statistics.

    By default, ClusterSearcher gathers index statistics such as doc_freq()
    from all shards when performing certain calcuations.  This is accurate,
    but slow because it involves numerous network round-trips.

    If a local IndexSearcher is consulted instead, network costs are
    eliminated, speeding up processes such as query weighting considerably.

    NB: Use with caution -- relevancy may be degraded to the extent that the
    content of the stat source Searcher differs from the content of the
    collection across all shards.

So long as the ClusterSearcher runs on the same machine as a large,
representative shard, using a local IndexSearcher should be a decent
workaround.  Scoring will be messed up if e.g. the local shard is completely
missing a term which is common on other shards, but at least it will be messed
up in the same way for all hits across all shards.

In the future, the best way to handle this problem is to provide a local cache
of doc_freq stats and create a specialized Searcher subclass which wraps the
cache and knows how to respond to `doc_freq()`.  It's not important that the
stats be updated in real time, nor is it important that the cache contain rare
terms; for decent query weighting, term stats only have to be in the ballpark,
not exact.

Marvin Humphrey

Re: [lucy-user] ClusterSearcher statistics

Posted by Dag Lem <da...@nimrod.no>.
Marvin Humphrey <ma...@rectangular.com> writes:

[...]

> First, regarding term extraction, it does not suffice to walk the query tree
> looking for TermQueries -- PhraseQueries also have terms, but more crucially,
> so do arbitrary user-defined Query subclasses.  In order to get at terms
> within arbitrary Query objects, Query needs an `extract_terms()` method which
> subclasses may have to override.
> 
> Second, once you obtain an array of terms via `$query->extract_terms()` and
> bulk-fetch their stats from the remote shards, you need to cache the stats in
> a hash and override `doc_freq()`.  That way, when nested query weighting
> routines invoke `$searcher->doc_freq`, they get the stat which was bulk
> fetched moments before.

[...]

> There's a lot of dissatisfaction in Lucy-land with our labyrinthine
> search-time Query weighting mechanism.  The Lucene architecture we inherited
> is ridiculously convoluted and we've already been through a couple rounds of
> refactoring trying to simplify it.  The last thing we want to do is make it
> harder to write a custom query subclass when our users already struggle with
> the complexity of that task.

OK, so how about this poor man's solution?

1. Add a private function to Searcher to switch between three
   different behaviors of doc_freq() - normal operation, store
   field/term in cache, or retrieve freq from cache.

2. For ClusterSearcher, insert an extra call to QueryParser::Parse to
   store field/term in the cache (discarding the returned query), and
   call the new function doc_freqs() to add the freqs to the cache.
   Then, let the existing call to QueryParser::Parse retrieve from the
   cache and build the actual query.

Sure, it's a hack, but as far as I can tell it would not be very
intrusive nor change the public API.

> Besides, bulk-fetching of term stats is only an optimization to begin with,
> and it's a sub-optimal optimization in comparison to the approach of obtaining
> term stats locally.

That depends. IMHO the advantages of a fully distributed solution can
in many cases handily trump the theoretical (and far from achievable
in practice) 2x performance win of a local statistics
database. E.g. if I envision, some time in the future, *several*
clients querying the same Lucy sharded massive index, it smells like
unwanted complexity if I had to maintain a local index for each
client.

Sure, if you have a single client where performance is paramount, and
adding more shards is not practical, then local statistics would be
very nice.

I'd say as Winnie-the-Pooh: Both! :-)

[...]

> > I guess this would be nice to have for applications which are
> > extremely performance sensitive.
> 
> Doesn't that include your use case?

Not at all, really :) I've only been doing some tests on Lucy to see
whether it could be used in a possible future project. This would
cover a batch oriented system without any hard limits on performance.
I simply wanted to see just how fast things could run (faster is
always better), tested SearchServer / ClusterSearcher, and you know
the rest :-)

> I was hoping that this approach would meet your immediate needs. :\

Rest assured that Lucy would without a doubt cover my needs, if the
project should materialize! :-)

> No problem! :)
> 
>     package EmptyStatSource;
>     use base qw( Lucy::Search::Searcher );
> 
>     sub doc_freq {1}

Nice :-)

-- 
Best regards,

Dag Lem

Re: [lucy-user] ClusterSearcher statistics

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Oct 25, 2012 at 2:51 AM, Dag Lem <da...@nimrod.no> wrote:
> I've thought a bit about this one; couldn't the problem in principle
> be solved as follows?
>
> 1. Walk the query tree, storing references to all TermQuery objects in
>    a list.
> 2. Call a new function, doc_freqs(), with a list of field/term pairs
>    from the TermQuery objects as its argument. doc_freqs() would in
>    essence call the existing doc_freq() for each field/term pair, and
>    return some form of list of field/term/count triplets.
> 3. Store the returned counts in the corresponding TermQuery objects.
> 4. Replace all calls to doc_freq with lookups of precomputed counts in
>    the TermQuery objects. (Alternatively, the calls can be kept by
>    renaming the original doc_freq to something else for the call in
>    2., and implementing a replacement doc_freq to do the lookup).
>
> Or some workable variant of the above - you get the idea :-)
>
> The upshot of this would be that only one network roundrip per server
> would be necessary in order to get hold of all of the numbers of
> documents per field/term, simply by replacing doc_freq() with
> doc_freqs() in the application protocol.
>
> What do you think?

Your instincts are good -- we used to have something like that, inherited from
Lucene.  Here's the relevant section of Lucene 1.9's MultiSearcher.java:

    http://s.apache.org/PI6  (link to svn.apache.org)

There were two parts to the old strategy: extracting terms from arbitrary
Query objects, and capturing the retrieved stats to a cache.

First, regarding term extraction, it does not suffice to walk the query tree
looking for TermQueries -- PhraseQueries also have terms, but more crucially,
so do arbitrary user-defined Query subclasses.  In order to get at terms
within arbitrary Query objects, Query needs an `extract_terms()` method which
subclasses may have to override.

Second, once you obtain an array of terms via `$query->extract_terms()` and
bulk-fetch their stats from the remote shards, you need to cache the stats in
a hash and override `doc_freq()`.  That way, when nested query weighting
routines invoke `$searcher->doc_freq`, they get the stat which was bulk
fetched moments before.

Here's an approximation of how ClusterSearcher#doc_freq would need to look:

    sub doc_freq () {
        my ($self, %args) = @_;
        my $field = $args{field};
        my $term  = $args{term};
        my $cache = $self->_get_doc_freq_cache;
        if ($cache->{$field}) {
            $doc_freq = $cache->{$field}{$term};
            if (defined $doc_freq) {
                return $doc_freq;
            }
        }
        return 0;
    }

We killed this approach off a long time ago, though.

There's a lot of dissatisfaction in Lucy-land with our labyrinthine
search-time Query weighting mechanism.  The Lucene architecture we inherited
is ridiculously convoluted and we've already been through a couple rounds of
refactoring trying to simplify it.  The last thing we want to do is make it
harder to write a custom query subclass when our users already struggle with
the complexity of that task.

Besides, bulk-fetching of term stats is only an optimization to begin with,
and it's a sub-optimal optimization in comparison to the approach of obtaining
term stats locally.

In general, changing a public API to support an optimization should only be
done under extraordinary circumstances.  In this specific case, it's hard to
justify adding complexity to a public API which is already too complicated in
order to support an optimization when better alternatives exist.

>> So long as the ClusterSearcher runs on the same machine as a large,
>> representative shard, using a local IndexSearcher should be a decent
>> workaround.  Scoring will be messed up if e.g. the local shard is completely
>> missing a term which is common on other shards, but at least it will be messed
>> up in the same way for all hits across all shards.
>
> I guess this would be nice to have for applications which are
> extremely performance sensitive.

Doesn't that include your use case?

I was hoping that this approach would meet your immediate needs. :\

For what it's worth, it would be easy to implement.

> Another idea would be to have the
> possibility of omitting the fetching any statistics whatsoever, if
> there should be use cases where relevancy based on term frequencies is
> not needed.

No problem! :)

    package EmptyStatSource;
    use base qw( Lucy::Search::Searcher );

    sub doc_freq {1}

    ...

    $cluster_searcher->set_stat_source(EmptyStatSource->new);

> Note, however, that assuming the solution I proposed above is
> workable, the theoretical possible speedup for using local statistics
> is no more than 2 (half the number of network roundtrips, assuming
> zero cost for everything else), at the inconvenience of increased
> infrastructural complexity and decreased accuracy of hit relevancy.

That's true, but we'll need a better mechanism eventually, anyway.

Right now, we only support sharded search, but we plan to add support for
sharded indexing at some point.  I anticipate that we'll tackle maintaining a
term stats cache at that time.

Here's a paper describing the issue:

    http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

    Global Term Weights in Distributed Environments, H. Witschel, 2007

    This paper examines the estimation of global term weights (such as IDF) in
    information retrieval scenarios where a global view on the collection is
    not available.  In particular, the two options of either sampling
    documents or of using a reference corpus independent of the retrieval
    collection are compared using standard IR test collections. In addition,
    the possibility of pruning term lists based on frequency is evaluated.

Pruning low-frequency terms seems like the most appropriate approach for a
generic tool like Lucy, since we cannot guarantee that randomly sampled
documents are truly representative of a larger collection:

    The pruning approach taken here is very simple: terms with low frequency
    in the sample are pruned from the list, i.e. they are treated as if they
    had not occurred in the sample by assigning them the weight estimate for
    unseen terms.

However, sampling works well most of the time and the sample doesn't even need
to be that large -- so the approach of using a single shard to supply stats
should work pretty well as long as the shard is representative.

    This tells us that what is really needed for good retrieval performance is
    just a small list of very frequent terms (one could call it an “extended
    list of stop words”).

    ...

    Sampling is generally attractive in terms of effectiveness: in most cases,
    a sample of less than 5% of the collection was sufficient for close to
    optimal performance.

Marvin Humphrey

Re: [lucy-user] ClusterSearcher statistics

Posted by Dag Lem <da...@nimrod.no>.
Marvin Humphrey <ma...@rectangular.com> writes:

[...]

> To know how common a term is across the entire collection we need to survey
> all shards and sum the results.  All these calls must be completed before we
> can finish weighting the query, allowing us to call `top_docs()`.

OK.

> The calls to `doc_freq()` also cannot be consolidated together easily, because
> they are invoked by nested weighting methods within an arbitrarily complex
> compound query object.

I've thought a bit about this one; couldn't the problem in principle
be solved as follows?

1. Walk the query tree, storing references to all TermQuery objects in
   a list.
2. Call a new function, doc_freqs(), with a list of field/term pairs
   from the TermQuery objects as its argument. doc_freqs() would in
   essence call the existing doc_freq() for each field/term pair, and
   return some form of list of field/term/count triplets.
3. Store the returned counts in the corresponding TermQuery objects.
4. Replace all calls to doc_freq with lookups of precomputed counts in
   the TermQuery objects. (Alternatively, the calls can be kept by
   renaming the original doc_freq to something else for the call in
   2., and implementing a replacement doc_freq to do the lookup).

Or some workable variant of the above - you get the idea :-)

The upshot of this would be that only one network roundrip per server
would be necessary in order to get hold of all of the numbers of
documents per field/term, simply by replacing doc_freq() with
doc_freqs() in the application protocol.

What do you think?

> As an alternative, how about adding this new method to ClusterSearcher?
> 
>     =head2 set_stat_source
> 
>         my $local_searcher = Lucy::Search::IndexSearcher->new(
>             index => '/path/to/index',
>         );
>         $cluster_searcher->set_stat_source($local_searcher);
> 
>     Set the Searcher which will be used to find index statistics.

[...]

> So long as the ClusterSearcher runs on the same machine as a large,
> representative shard, using a local IndexSearcher should be a decent
> workaround.  Scoring will be messed up if e.g. the local shard is completely
> missing a term which is common on other shards, but at least it will be messed
> up in the same way for all hits across all shards.

I guess this would be nice to have for applications which are
extremely performance sensitive. Another idea would be to have the
possibility of omitting the fetching any statistics whatsoever, if
there should be use cases where relevancy based on term frequencies is
not needed.

Note, however, that assuming the solution I proposed above is
workable, the theoretical possible speedup for using local statistics
is no more than 2 (half the number of network roundtrips, assuming
zero cost for everything else), at the inconvenience of increased
infrastructural complexity and decreased accuracy of hit relevancy.

-- 
Best regards,

Dag Lem