You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/11/05 19:23:03 UTC

[lucy-dev] ClusterSearcher

cc to lucy-dev...

On Sat, Nov 05, 2011 at 08:28:46AM +0200, goran kent wrote:
> On 11/4/11, Marvin Humphrey <ma...@rectangular.com> wrote:
> > Sounds like the nodes are being accessed serially rather than in parallel.
> > I'll look into it.
> 
> I'd love to sniff around with some debug prints, etc, can you point me
> to the relevant code where this might (not) be occurring?

The serialized requests are initiated by Polysearcher -- see
PolySearcher_top_docs() in the source file
trunk/core/Lucy/Search/PolySearcher.c.  Here is the problematic loop, with
explanatory comments inserted.

    // Loop over an array of Searcher objects.  In this case, each of the
    // Searchers is a LucyX::Remote::SearchClient.
    for (i = 0, max = VA_Get_Size(searchers); i < max; i++) {
        // Extract an individual Searcher and its corresponding doc id offset.
        Searcher   *searcher   = (Searcher*)VA_Fetch(searchers, i);
        int32_t     base       = I32Arr_Get(starts, i);
        // This line triggers a call to the top_docs() subroutine within
        // SearchClient.pm.  It blocks until top_docs() returns, and thus the
        // total time to process all remote requests in this loop is the sum
        // of all child node response times.
        TopDocs    *top_docs   = Searcher_Top_Docs(searcher, (Query*)compiler,
                                                   num_wanted, sort_spec);
        /* ... */
    }

To process the searches in parallel, we need a select loop[1].  However,
PolySearcher can only access SearchClient via the abstract
Lucy::Search::Searcher interface -- it knows nothing about the socket calls
that are being made by SearchClient.pm.  PolySearcher would have to pierce
encapsulation in order to get at those sockets and multiplex the requests.

The most straightforward solution is to eliminate PolySearcher from the
equation and to create a class that combines the functionality of PolySearcher
and SearchClient.  Fortunately, neither of them is particularly large or
complex, so the task is very doable.

I propose that we name this new class LucyX::Remote::ClusterSearcher.  

  * Fork SearchClient.pm to ClusterSearcher.pm and t/510-remote.t to
    t/550-cluster_searcher.t.
  * Give ClusterSearcher the ability to talk to multiple SearchServers.
  * Change to a two-stage RPC mechanism:
    1. Fire off the requests to the individual SearchServers in a "for" loop.
    2. Gather the responses into an array using a select() loop (powered by an 
       IO::Select object).
  * Adapt each of the Searcher methods that ClusterSearcher implements to
    assemble a sensible return value from the array of responses using
    PolySearcher's techniques.

This won't be the end of our iterating if we want to build a robust clustering
system, because it doesn't yet address either node availability issues or
near-real-time updates.  However, it provides the functionality that we meant
to make available via PolySearcher/SearchServer/SearchClient, allowing Goran
to evaluate whether the system meets his basic requirements, and moves us
incrementally towards a highly desirable goal: a ClusterSearcher object backed
by multiple search nodes that is just as easy to use as an IndexSearcher
backed by one index on one machine.

PS: Goran...I'm under the weather right now, so if you're counting on me to
code this up, I'm not sure how quickly I'll get to it.

Marvin Humphrey

[1] http://www.perlfect.com/articles/select.shtml

Re: [lucy-dev] ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Nov 27, 2011 at 1:45 AM, Dan Markham <dm...@gmail.com> wrote:
> Best way to describe what i plan to and currently do.  50% name/valued key pair.. and 50% full text search.

That sounds close to my intended use.  I'm particularly interested in
hybrids between recommendations, full text search, and filtered
categories.  I'd be searching within a particular domain (such as
movies, music, academic papers) and want to return results that meet
the search criteria but are ordered in a highly personalized way.

>> How fast is it
>> changing?
> I'm thinking avg. number of changes will be about ~15 a second.

OK, so manageable on the changes, but you want to make sure that
updates are immediately viewable by the updater.

> What's a ballpark for the searches
>> per second you'd like to handle?
>
> 1k/second (name/value style searches) with the 98 percentile search under 30ms.
> 1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search under 300ms.

Do you really require such a low latency on the simple searches, or
are you basing this on back-calculating from requests per second?  I
think you'd get better throughput (requests/sec) if you could relax
this requirement and allow for some serialized access to nodes.

>>  Do the shards fit in memory?
> Yes and no...
> Will have some servers with low query requirements overloaded to disk..
> High profile Indexes with low search SLA's yes.

The hope is that the mmap() approach should degrade gracefully up to a
point, so this should work, as long as loads truly are light.   And
how long the tail is --- only short posting lists are going to be
small enough to be read from disk in the amount of time you are
speaking.

> I'm thinking your talking about the top_docs call getting to use a hinted low watermark in it's priority queue?

Yes, I'm thinking that a "low watermark" would help to establish some
known minimum, as well as a "high watermark" to allow for offsets in
results.  This would help when trying to get (for example) results
1000-1100 without having to save or send the top 1100 from each node.

In addition, one could save network traffic by adding a roundtrip
between the central and the nodes, where the high and low are returned
first and the central then sends a request for the details after the
preliminary responses are tabulated.

I also reached the conclusion at some point that the "score" returned
should be allowed to include a string, so that results can be arranged
alphabetically based on the value of a given field in all matching
records:  FULLTEXT MATCHES "X" SORT BY FIELD "Y".

>> how to handle distributed TF/IDF...
>>
>
> This is *easy* to solve on a per-Index basis with insider knowledge about the index and how it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key.
> I like the idea of  sampling a node or 2 to get the cache started (service the search) and then finish the cache out of band to get a better more complete picture. Unless your adding/updating to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly.

I'm strongly (perhaps even belligerently?) of the opinion that the
specific values for TF need to be part of the query that is sent to
the nodes, rather than something local to each node's scoring
mechanism.  Term frequency should be converted to a weight for each
clause of the query, and that weight (boost) should be used by the
scorer.  This boost can be equivalent to local, global, or approximate
term frequency as desired, but get it out of the core!

With this in mind, if you truly need an exact term frequency, and are
unable to assume that any single index is a reasonable approximation
for the entire corpus, I think only solution is to have a standalone
TF index.  These will be small enough that they should be easy to
manage per node if necessary.  Every few seconds the TF updates are
broadcast from the indexing machine and replicated as necessary.

> So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers this doesn't really work either and i think  now makes it worse.

My instinct is that this would not be a good architecture.  While
there the recursive approach is superficially appealing (and very OO),
I think it would be a performance nightmare. I may be teaching my
grandmother to suck eggs, but I'm pretty sure that the limiting factor
for full text search at the speeds you desire is going to be memory
bandwidth:  how fast can you sift through a bunch of RAM?

I'd love if someone can check my numbers, but my impression is that
current servers can plow through the order of 10 GB/second from RAM,
which means that each core can do several integer ops per multi-byte
position read.  Multiply by 4, 8, or 16 cores, and we are quickly
memory bound.   I'm not sure where the crossover is, but adding more
shards per node is quickly going to hurt rather than help.

> How do i generate the low watermark we pass to nodes without getting data back from one node?

I think you've given the best and possibly only answer to your
question:  make another round trip.  Or for maximum performance, make
multiple rounds trips.  Do you really need a response in 30 ms?

--nate

Re: [lucy-dev] ClusterSearcher

Posted by Dan Markham <dm...@gmail.com>.

On Nov 26, 2011, at 1:48 AM, Nathan Kurz wrote:

> Dan --
> 
> I took a glance.  Sounds promising.  Could you talk a bit about the use
> case you at anticipating?  
> What are you indexing?  

Best way to describe what i plan to and currently do.  50% name/valued key pair.. and 50% full text search.
Both have rather large documents. Lots of sort fields. I truly do abuse the crap out of KinoSearch currently. Highlighting is about the only thing i don't use heavily.

> How fast is it
> changing?
I'm thinking avg. number of changes will be about ~15 a second.
During more bulky style changes... I hope much faster.

>  Do the shards fit in memory?  
Yes and no...
Will have some servers with low query requirements overloaded to disk.. 
High profile Indexes with low search SLA's yes.

> What's a ballpark for the searches
> per second you'd like to handle?

1k/second (name/value style searches) with the 98 percentile search under 30ms.
1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search under 300ms.

> My first thought is that you may be able to trade off some latency for
> increased throughput by sticking with partially serialized requests if you
> were able to pass a threshold score along to each node/shard so you could
> speed past low scoring results.

More detail!
I'm thinking your talking about the top_docs call getting to use a hinted low watermark in it's priority queue?

if so.. i was chatting with marvin about this the other day.. i was scared with creating a  cluster with 100 nodes.
On reason was the sheer number of docs i would need to push over the network with num_wanted => 10, offset =>200.

The thing that killed the idea for me...
How do i generate the low watermark we pass to nodes without getting data back from one node?

So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers this doesn't really work either and i think  now makes it worse.. I'm thinking by time i start worrying about 100+ nodes sampling and early termination will be a must. Another crazy idea while i have you this far off track. top_doc requests go into this "pool" all 100 nodes try to run the queries in the pool and places the score of the lowest scoring doc into the response pool for that node. the the top_docs query submitter can decide how long to wait for a responses/how many responses to wait for.. and knows what nodes he will need to use top_docs from.

So what i'm doing in lucy_cluster is not trying to solve the 100node issue just yet.. and keeping the number of nodes small < 10. But at the same time keeping the number of shards about 30ish. Mainly so i can rebalance nodes my just moving shards... and nodes can search more than one shard locally with a multi-searcher.  

>  But this brings up Marvin's points about
> how to handle distributed TF/IDF...
> 

This is *easy* to solve on a per-Index basis with insider knowledge about the index and how it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key. 
I like the idea of  sampling a node or 2 to get the cache started (service the search) and then finish the cache out of band to get a better more complete picture. Unless your adding/updating to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly. 

-Dan

Re: [lucy-dev] Re: ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.

Dan --

I took a glance.  Sounds promising.  Could you talk a bit about the use
case you at anticipating?  What are you indexing?  How fast is it
changing?  Do the shards fit in memory?  What's a ballpark for the searches
per second you'd like to handle?

My first thought is that you may be able to trade off some latency for
increased throughput by sticking with partially serialized requests if you
were able to pass a threshold score along to each node/shard so you could
speed past low scoring results.  But this brings up Marvin's points about
how to handle distributed TF/IDF...

--nate

[lucy-dev] Re: ClusterSearcher

Posted by Dan <dm...@gmail.com>.

I'm itching to test out some of these ideas with a dirty prototype.
So i have started github project and plan on hacking something together
in a slightly different style than LucyX/Remote/*.   If nothing more
just to get my head
around all the challenges and provoke thought. I'm also pulling in half of CPAN
ZeroMQ/Message Pack/EV.  So even if this beats all expectations getting anything
more than the ideas back in Core will be a pain.

https://github.com/dmarkham/lucy_cluster
what code i have done is on a branch not master.

I would love some hecklers

-Dan

[lucy-dev] Re: [lucy-user] ClusterSearcher

Posted by Dan <dm...@gmail.com>.

>> For Lucy as a whole, I think there are some meta-questions that should
>> be resolved before we go down this path.
>>
>> 1) How core is is this to Lucy's functionality?
Support feels *very* core.. Lucy designed without great cluster
search/write support seems broken when scaled, hacked at best.

Building a cluster search implementation in the core Lucy project? no idea.

>> 2) How much should we depend on outside libraries?

 Core Lucy:  I hope as little as possible.
 Cluster Search: Best tools for job.


>> 3) How independent should the Searcher and the Clients be?
 What do you mean by this? I keep getting tripped up on with a
clustered search there are 2 servers and 2 clients
 and it's easy to be talking about the wrong ones (for me).

    (client #1)                   (server #1 and client #2)
(server #2)
 python/perl user --->     Cluster request receiver             ---> nodes 1-N

even if the "Cluster request receiver" is just a normal node/master
search collator/etc
you still need  a way to take requests and ask your peers for data you
don't have.
I'm sure someone has good names.


>> 4) How future-proof and scalable do we want this solution to be?



Current things rattling around in my head as we have been talking
about the long term cluster support.

Search or Write Optimized?
   I *think* most people would agree we should lean towards search optimized.
   But real-time/fast reopens is a big feature we have over others
currently and
   I would not want to lose it (or the perception of real-time).
   I also do not want to lose fast bulk adds.

What Type of Cluster?
   a. Muti-Master?
   b. Master-Slave(s)?
   c. Sharded-Master-Slave(s)?

When can a searcher see the new data?
 a. intermittently as it's replicated to all nodes?
 b. only after all nodes have a copy?
 c. Instantly for the client that added.. but (a) from everyone else?

Document Versioning.
 a. Last guy to write wins (deletes become interesting!)?
 b. Vector clocks with client resolution?
 c. currently Lucy docs have no "primary key"
     this feels like it would need to change *if* versioning is
required for clustering
 d. not needed at all?

Delete by Query.
 a. Seems a little tricky to me in cluster search with replication
with possible out of order execution.
      I'm sure something is doable just something to think about.

Replicating Data.
 a. Index copies?
 b. Segment copies?
 c. Doc copies?

Server failover properties.
 a. Auto-Rebalancing (shard/segment/index)?
 b. Always writeable model (muti-master)?
 c. Slave auto-promote?

Schema Changes:
 a. Will every node have to be updated simultaneously to prevent
search/write fails?
      I'm talking about things like  adding a new field not changing a field.



-Dan

Re: [lucy-dev] ClusterSearcher

Posted by Dan <dm...@gmail.com>.

>>> Regardless of the path we go for building / shipping clustered search
>>> solution.  I'm mostly interested in the api's to the lower level lucy that
>>> make it possible and how to make them better.
>>
>> Well, my main concern, naturally, is the potential burden of exposing low-level
>> internals as public APIs, constraining future Lucy core development.
>
> It's a good concern, and I'm not certain what Dan is envisioning, but
> I'm hoping that improving the API's means _less_ exposure of the
> internals.  Rather than passing around Searcher and Index objects
> everywhere, I'd love to make it explicitly clear what information is
> available to whom:  if a remote client doesn't return it, you can't
> use it.  Instead of increasing exposure for remote clients, we'd
> simplify the interface to local Searchers.

My vision would be something like,
Core Lucy has API's exposed needed for cluster search client and server.

Then we build a implementation of "ClusterSearch" build on top of Core Lucy.
Regardless if the project is core/non-core it is built on top of well
defined interfaces in Core Lucy.
So Maybe Lucy Core does not require or use zeromq/MessagePack but
"SearchCluster" does.
And this is done by sub-classing Core Lucy and changing the
serialization/message passing to work across the network.
Mainly "ClusterSearch" doesn't get to cheat.. Anyone could build a
different one with different properties/use cases and still be a first
class citizen.


-Dan

Re: [lucy-dev] ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.

On Mon, Nov 7, 2011 at 1:50 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, Nov 06, 2011 at 08:39:51PM -0800, Dan Markham wrote:
>> ZeroMQ and Google's Protocol Buffers both looking great for building a
>> distributed search solution.
>
> The idea of normalizing our current ad-hoc serialization mechanism using
> Google Protocol Buffers seems interesting, though it looks like it might be a
> lot of work and messy besides.
>
> First, Protocol Buffers doesn't support C -- only C++, Python and Java -- so
> we'd have to write our own custom plugin.  Dunno how hard that is.

While I'm relying on Google rather than experience, I don't think that
C support is actually a problem.
There seem to be C bindings: http://code.google.com/p/protobuf-c/
Or roll your own:
http://blog.reverberate.org/2008/07/12/100-lines-of-c-that-can-parse-any-protocol-buffer/

> Second, the Protocol Buffers compiler is a heavy dependency -- too big to
> bundle.  We'd have to capture the generated source files in version control.

Alternatively, it could just be a dependency.  While I recognize your
desire to keep the core free of such, I think it's entirely reasonable
for LucyX packages to require outside libraries and tools.  The
question would be whether it's reasonable or desirable to relegate
ClusterSearch to non-core.

> Further investigation seems warranted.  It would sure be nice if we could lower
> our costs for developing and maintaining serialization routines.
>
On Mon, Nov 7, 2011 at 2:39 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> MessagePack might be worth a look. See http://msgpack.org/

Yes, that looks good too.  I'm suggesting that we restrict ourselves
to Protocol Buffers, only that it should be possible to use them for
interprocess communication, among other options.  A good architecture
(in my opinion) would be one that allows the over-the-wire protocol to
change without requiring in-depth knowledge of Lucy's internals.  I
think the key is to have a clear definition of what "information" is
required by each layer of Lucy, rather than serializing and
deserializing raw objects.

> As for ZeroMQ, it's LGPL which pretty much rules it out for us -- nothing
> released under the Apache License 2.0 can have a required LGPL dependency.

You know these rules better than I do, but I often worry that your
interpretations are often stricter than required by Apache's legal
counsel.
There's room for optional dependencies: http://www.apache.org
/legal/resolved.html#optional
For example, it looks like Apache Thrift (another alternative protocol
to consider) isn't scared of ZeroMQ:
https://issues.apache.org/jira/browse/THRIFT-812

>> Regardless of the path we go for building / shipping clustered search
>> solution.  I'm mostly interested in the api's to the lower level lucy that
>> make it possible and how to make them better.
>
> Well, my main concern, naturally, is the potential burden of exposing low-level
> internals as public APIs, constraining future Lucy core development.

It's a good concern, and I'm not certain what Dan is envisioning, but
I'm hoping that improving the API's means _less_ exposure of the
internals.  Rather than passing around Searcher and Index objects
everywhere, I'd love to make it explicitly clear what information is
available to whom:  if a remote client doesn't return it, you can't
use it.  Instead of increasing exposure for remote clients, we'd
simplify the interface to local Searchers.

> If we actually had a working networking layer, we'd have a better idea about
> what sort of APIs we'd need to expose in order to facilitate alternate
> implementations.  Rapid-prototyping a networking layer in Perl under LucyX with
> a very conservative API exposure and without hauling in giganto dependencies
> might help with that. :)

Yes!  I don't want to stand in the way of progress.  Prototyping
something that works is a great idea.   I don't have the fear of
dependencies that you do, but if you think it's faster to build
something simple from the ground up rather than using a complex
existing package, have at it!

--nate

Re: [lucy-dev] ClusterSearcher

Posted by Dan <dm...@gmail.com>.

> MessagePack might be worth a look. See http://msgpack.org/

+1 This looks very interesting looks to have legs.

-Dan

Re: [lucy-dev] ClusterSearcher

Posted by Nick Wellnhofer <we...@aevum.de>.

On 07/11/11 22:50, Marvin Humphrey wrote:
> Further investigation seems warranted.  It would sure be nice if we could lower
> our costs for developing and maintaining serialization routines.

MessagePack might be worth a look. See http://msgpack.org/

Nick

Re: [lucy-dev] ClusterSearcher

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sun, Nov 06, 2011 at 08:39:51PM -0800, Dan Markham wrote:
> ZeroMQ and Google's Protocol Buffers both looking great for building a
> distributed search solution.

The idea of normalizing our current ad-hoc serialization mechanism using
Google Protocol Buffers seems interesting, though it looks like it might be a
lot of work and messy besides.

First, Protocol Buffers doesn't support C -- only C++, Python and Java -- so
we'd have to write our own custom plugin.  Dunno how hard that is.

Second, the Protocol Buffers compiler is a heavy dependency -- too big to
bundle.  We'd have to capture the generated source files in version control.
That's theoretically doable -- it's how we're handling the Flex file which is
part of the Clownfish compiler -- but that one Flex file isn't likely to
change much from here on out, whereas developing serialization routines is an
ongoing task.

Further investigation seems warranted.  It would sure be nice if we could lower
our costs for developing and maintaining serialization routines.

As for ZeroMQ, it's LGPL which pretty much rules it out for us -- nothing
released under the Apache License 2.0 can have a required LGPL dependency.

In contrast, the libev license looks compatible:

    http://cvs.schmorp.de/libev/LICENSE?view=markup

Any networking layer that is going to require a dependency like libev should
be released separately from Lucy, though.

> Regardless of the path we go for building / shipping clustered search
> solution.  I'm mostly interested in the api's to the lower level lucy that
> make it possible and how to make them better.

Well, my main concern, naturally, is the potential burden of exposing low-level
internals as public APIs, constraining future Lucy core development.

If we actually had a working networking layer, we'd have a better idea about
what sort of APIs we'd need to expose in order to facilitate alternate
implementations.  Rapid-prototyping a networking layer in Perl under LucyX with
a very conservative API exposure and without hauling in giganto dependencies
might help with that. :)

> I'm sure few will have my exact use-cases so flexibility in the core Lucy is
> key for me.

I'm not convinced that we will be unable to meet those needs. :)

> 1. Keeping the snapshot around long enough for a searcher to comeback and ask for doc_ids.
>        Our index moves quickly (real-time) many docs/segments a second.
>        This issue is mainly a issue because we insist in reopening the index
>        for every write for us to maintain a real-time feel.

This can be achieved if we mod Lucy to enable deletion policies that leave
obsolete snapshots around for some amount of time.

> 2. Replication.
>     One copy never seems to be enough (boxes crash,networking,high-load you
>     name it) so replication of data to other boxes and keeping the
>     perception of real time is always a challenge for us.  

> I'm sure once we flush out the plan... we'll have lots of fun things to chat
> about deletes/sort-caches/TF IDF cache.

No doubt. :)

> I'm all for any api's that help in replication, maintaining indexes in
> distributed setups.

We'll certainly need this eventually, but I think that we can get distibuted
search functionality working first and then follow up with the indexing layer
later.

Marvin Humphrey

[lucy-dev] Re: [lucy-user] ClusterSearcher

Posted by Dan Markham <dm...@gmail.com>.

I'm so looking forward to this discussion. 

We have built a closed source multi-master system with replication.  Currently we are not using a PolySearcher to query more than one server.  Each box has a full index copy.  What we have done is multiplex queries within a index on a single box by using different processes per segment (with help from marvin). It's been pretty slick if your index isn't larger than ram  and you have a few spare cpu cores to spread the workout.
 
ZeroMQ and Google's Protocol Buffers both looking great for building a distributed search solution. 

Regardless of the path we go for building /  shipping clustered search solution. 
I'm mostly interested in the api's to the lower level lucy that make it possible and how to make them better. I'm sure few will have my exact use-cases so flexibility in the core Lucy is key for me.

Challenges we have seen in regards to distributed search. 
1. Keeping the snapshot around long enough for a searcher to comeback and ask for doc_ids.
       Our index moves quickly (real-time) many docs/segments a second. 
        This issue is mainly a issue because we insist in reopening the index for every write for us to maintain a real-time feel.
2. Replication.
    One copy never seems to be enough (boxes crash,networking,high-load you name it) so replication of 
    data to other boxes and keeping the perception of real time is always a challenge for us.  

I'm sure once we flush out the plan... we'll have lots of fun things to chat about deletes/sort-caches/TF IDF cache.


I'm all for any api's that help in replication, maintaining indexes in distributed setups.


-Dan


> For Lucy as a whole, I think there are some meta-questions that should
> be resolved before we go down this path.
> 
> 1) How core is is this to Lucy's functionality?
> 2) How much should we depend on outside libraries?
> 3) How independent should the Searcher and the Clients be?
> 4) How future-proof and scalable do we want this solution to be?
> 
> My position would be that while search clusters are essential to Lucy,
> our core competency is fast search rather than reliable networking,
> and thus we should use well-tested external libraries rather than
> expanding our scope.  I think the remote Clients and the central
> Searcher should be essentially independent of each other and of this
> networking layer.   And I think that we should aim to make it scale to
> the moon.
> 
> Fleshing this out a little bit, I think we should prefer libev in C
> over IO::Select in Perl, and that  that we should prefer something
> high level like ZeroMQ over dealing with libev.  I think we should
> have a well defined query and response format using something like
> Google's Protocol Buffers rather than serializing objects directly.  I
> think a good goal would be allowing Lucene with a wrapper to act as a
> Client.
> 
> Marvin: could you offer an high level overview of how cluster search
> would work ideally, with particular emphasis on what gets passed over
> the wire and what out-of-band coordination is needed between Searcher
> and Clients?
> 
> --nate

[lucy-dev] Re: [lucy-user] ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.

On Sat, Nov 5, 2011 at 11:23 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> The most straightforward solution is to eliminate PolySearcher from the
> equation and to create a class that combines the functionality of PolySearcher
> and SearchClient.  Fortunately, neither of them is particularly large or
> complex, so the task is very doable.
>
> I propose that we name this new class LucyX::Remote::ClusterSearcher.

For Goran who is trying to get something working quickly in Perl, this
seems like a great solution.  Get to it!

For Lucy as a whole, I think there are some meta-questions that should
be resolved before we go down this path.

1) How core is is this to Lucy's functionality?
2) How much should we depend on outside libraries?
3) How independent should the Searcher and the Clients be?
4) How future-proof and scalable do we want this solution to be?

My position would be that while search clusters are essential to Lucy,
our core competency is fast search rather than reliable networking,
and thus we should use well-tested external libraries rather than
expanding our scope.  I think the remote Clients and the central
Searcher should be essentially independent of each other and of this
networking layer.   And I think that we should aim to make it scale to
the moon.

Fleshing this out a little bit, I think we should prefer libev in C
over IO::Select in Perl, and that  that we should prefer something
high level like ZeroMQ over dealing with libev.  I think we should
have a well defined query and response format using something like
Google's Protocol Buffers rather than serializing objects directly.  I
think a good goal would be allowing Lucene with a wrapper to act as a
Client.

Marvin: could you offer an high level overview of how cluster search
would work ideally, with particular emphasis on what gets passed over
the wire and what out-of-band coordination is needed between Searcher
and Clients?

--nate

Re: [lucy-user] ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.

On Sat, Nov 5, 2011 at 11:23 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> The most straightforward solution is to eliminate PolySearcher from the
> equation and to create a class that combines the functionality of PolySearcher
> and SearchClient.  Fortunately, neither of them is particularly large or
> complex, so the task is very doable.
>
> I propose that we name this new class LucyX::Remote::ClusterSearcher.

For Goran who is trying to get something working quickly in Perl, this
seems like a great solution.  Get to it!

For Lucy as a whole, I think there are some meta-questions that should
be resolved before we go down this path.

1) How core is is this to Lucy's functionality?
2) How much should we depend on outside libraries?
3) How independent should the Searcher and the Clients be?
4) How future-proof and scalable do we want this solution to be?

My position would be that while search clusters are essential to Lucy,
our core competency is fast search rather than reliable networking,
and thus we should use well-tested external libraries rather than
expanding our scope.  I think the remote Clients and the central
Searcher should be essentially independent of each other and of this
networking layer.   And I think that we should aim to make it scale to
the moon.

Fleshing this out a little bit, I think we should prefer libev in C
over IO::Select in Perl, and that  that we should prefer something
high level like ZeroMQ over dealing with libev.  I think we should
have a well defined query and response format using something like
Google's Protocol Buffers rather than serializing objects directly.  I
think a good goal would be allowing Lucene with a wrapper to act as a
Client.

Marvin: could you offer an high level overview of how cluster search
would work ideally, with particular emphasis on what gets passed over
the wire and what out-of-band coordination is needed between Searcher
and Clients?

--nate

Re: [lucy-user] ClusterSearcher

Posted by goran kent <go...@gmail.com>.

On 11/5/11, Marvin Humphrey <ma...@rectangular.com> wrote:
>         // This line triggers a call to the top_docs() subroutine within
>         // SearchClient.pm.  It blocks until top_docs() returns, and thus
> the
>         // total time to process all remote requests in this loop is the sum
>         // of all child node response times.

/releases held breath

My faith in Lucy is restored :), I was dreading a response to the
effect that the remote search stuff was immutable and couldn't be
significantly improved.

> To process the searches in parallel, we need a select loop[1].  However,
> PolySearcher can only access SearchClient via the abstract
> Lucy::Search::Searcher interface -- it knows nothing about the socket calls
> that are being made by SearchClient.pm.  PolySearcher would have to pierce
> encapsulation in order to get at those sockets and multiplex the requests.

Yup, sounds like that approach is buggered.

> The most straightforward solution is to eliminate PolySearcher from the
> equation and to create a class that combines the functionality of
> PolySearcher
> and SearchClient.  Fortunately, neither of them is particularly large or
> complex, so the task is very doable.
>
> I propose that we name this new class LucyX::Remote::ClusterSearcher.
>
>   * Fork SearchClient.pm to ClusterSearcher.pm and t/510-remote.t to
>     t/550-cluster_searcher.t.
>   * Give ClusterSearcher the ability to talk to multiple SearchServers.
>   * Change to a two-stage RPC mechanism:
>     1. Fire off the requests to the individual SearchServers in a "for"
> loop.
>     2. Gather the responses into an array using a select() loop (powered by
> an
>        IO::Select object).
>   * Adapt each of the Searcher methods that ClusterSearcher implements to
>     assemble a sensible return value from the array of responses using
>     PolySearcher's techniques.
>
> This won't be the end of our iterating if we want to build a robust
> clustering
> system, because it doesn't yet address either node availability issues or
> near-real-time updates.  However, it provides the functionality that we
> meant
> to make available via PolySearcher/SearchServer/SearchClient, allowing Goran
> to evaluate whether the system meets his basic requirements, and moves us
> incrementally towards a highly desirable goal: a ClusterSearcher object
> backed
> by multiple search nodes that is just as easy to use as an IndexSearcher
> backed by one index on one machine.
>
> PS: Goran...I'm under the weather right now, so if you're counting on me to
> code this up, I'm not sure how quickly I'll get to it.

No worries, you take care of yourself and get better.

I would like to thank you for your responsiveness on this list, you
and others have been top notch.  Even though it's a bit disappointing
that the cluster search functionality doesn't work 100% right now, the
level of commitment is gratifying to see, and it's reassuring to know
that this shortcoming will be addressed soon (hopefully before we
launch our service... ;)

I feel vindicated in my decision to move from another library to Lucy.  w00t!