You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Dan <dm...@gmail.com> on 2011/11/26 09:45:05 UTC

[lucy-dev] Re: ClusterSearcher

I'm itching to test out some of these ideas with a dirty prototype.
So i have started github project and plan on hacking something together
in a slightly different style than LucyX/Remote/*.   If nothing more
just to get my head
around all the challenges and provoke thought. I'm also pulling in half of CPAN
ZeroMQ/Message Pack/EV.  So even if this beats all expectations getting anything
more than the ideas back in Core will be a pain.

https://github.com/dmarkham/lucy_cluster
what code i have done is on a branch not master.

I would love some hecklers

-Dan

Re: [lucy-dev] ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.
On Sun, Nov 27, 2011 at 1:45 AM, Dan Markham <dm...@gmail.com> wrote:
> Best way to describe what i plan to and currently do.  50% name/valued key pair.. and 50% full text search.

That sounds close to my intended use.  I'm particularly interested in
hybrids between recommendations, full text search, and filtered
categories.  I'd be searching within a particular domain (such as
movies, music, academic papers) and want to return results that meet
the search criteria but are ordered in a highly personalized way.

>> How fast is it
>> changing?
> I'm thinking avg. number of changes will be about ~15 a second.

OK, so manageable on the changes, but you want to make sure that
updates are immediately viewable by the updater.


> What's a ballpark for the searches
>> per second you'd like to handle?
>
> 1k/second (name/value style searches) with the 98 percentile search under 30ms.
> 1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search under 300ms.

Do you really require such a low latency on the simple searches, or
are you basing this on back-calculating from requests per second?  I
think you'd get better throughput (requests/sec) if you could relax
this requirement and allow for some serialized access to nodes.

>>  Do the shards fit in memory?
> Yes and no...
> Will have some servers with low query requirements overloaded to disk..
> High profile Indexes with low search SLA's yes.

The hope is that the mmap() approach should degrade gracefully up to a
point, so this should work, as long as loads truly are light.   And
how long the tail is --- only short posting lists are going to be
small enough to be read from disk in the amount of time you are
speaking.

> I'm thinking your talking about the top_docs call getting to use a hinted low watermark in it's priority queue?

Yes, I'm thinking that a "low watermark" would help to establish some
known minimum, as well as a "high watermark" to allow for offsets in
results.  This would help when trying to get (for example) results
1000-1100 without having to save or send the top 1100 from each node.

In addition, one could save network traffic by adding a roundtrip
between the central and the nodes, where the high and low are returned
first and the central then sends a request for the details after the
preliminary responses are tabulated.

I also reached the conclusion at some point that the "score" returned
should be allowed to include a string, so that results can be arranged
alphabetically based on the value of a given field in all matching
records:  FULLTEXT MATCHES "X" SORT BY FIELD "Y".

>> how to handle distributed TF/IDF...
>>
>
> This is *easy* to solve on a per-Index basis with insider knowledge about the index and how it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key.
> I like the idea of  sampling a node or 2 to get the cache started (service the search) and then finish the cache out of band to get a better more complete picture. Unless your adding/updating to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly.

I'm strongly (perhaps even belligerently?) of the opinion that the
specific values for TF need to be part of the query that is sent to
the nodes, rather than something local to each node's scoring
mechanism.  Term frequency should be converted to a weight for each
clause of the query, and that weight (boost) should be used by the
scorer.  This boost can be equivalent to local, global, or approximate
term frequency as desired, but get it out of the core!

With this in mind, if you truly need an exact term frequency, and are
unable to assume that any single index is a reasonable approximation
for the entire corpus, I think only solution is to have a standalone
TF index.  These will be small enough that they should be easy to
manage per node if necessary.  Every few seconds the TF updates are
broadcast from the indexing machine and replicated as necessary.


> So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers this doesn't really work either and i think  now makes it worse.

My instinct is that this would not be a good architecture.  While
there the recursive approach is superficially appealing (and very OO),
I think it would be a performance nightmare. I may be teaching my
grandmother to suck eggs, but I'm pretty sure that the limiting factor
for full text search at the speeds you desire is going to be memory
bandwidth:  how fast can you sift through a bunch of RAM?

I'd love if someone can check my numbers, but my impression is that
current servers can plow through the order of 10 GB/second from RAM,
which means that each core can do several integer ops per multi-byte
position read.  Multiply by 4, 8, or 16 cores, and we are quickly
memory bound.   I'm not sure where the crossover is, but adding more
shards per node is quickly going to hurt rather than help.

> How do i generate the low watermark we pass to nodes without getting data back from one node?

I think you've given the best and possibly only answer to your
question:  make another round trip.  Or for maximum performance, make
multiple rounds trips.  Do you really need a response in 30 ms?

--nate

Re: [lucy-dev] ClusterSearcher

Posted by Dan Markham <dm...@gmail.com>.
On Nov 26, 2011, at 1:48 AM, Nathan Kurz wrote:

> Dan --
> 
> I took a glance.  Sounds promising.  Could you talk a bit about the use
> case you at anticipating?  
> What are you indexing?  

Best way to describe what i plan to and currently do.  50% name/valued key pair.. and 50% full text search.
Both have rather large documents. Lots of sort fields. I truly do abuse the crap out of KinoSearch currently. Highlighting is about the only thing i don't use heavily.

> How fast is it
> changing?
I'm thinking avg. number of changes will be about ~15 a second.
During more bulky style changes... I hope much faster.


>  Do the shards fit in memory?  
Yes and no...
Will have some servers with low query requirements overloaded to disk.. 
High profile Indexes with low search SLA's yes.


> What's a ballpark for the searches
> per second you'd like to handle?

1k/second (name/value style searches) with the 98 percentile search under 30ms.
1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search under 300ms.
 
> My first thought is that you may be able to trade off some latency for
> increased throughput by sticking with partially serialized requests if you
> were able to pass a threshold score along to each node/shard so you could
> speed past low scoring results.

More detail!
I'm thinking your talking about the top_docs call getting to use a hinted low watermark in it's priority queue?

if so.. i was chatting with marvin about this the other day.. i was scared with creating a  cluster with 100 nodes.
On reason was the sheer number of docs i would need to push over the network with num_wanted => 10, offset =>200.

The thing that killed the idea for me...
How do i generate the low watermark we pass to nodes without getting data back from one node?

So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers this doesn't really work either and i think  now makes it worse.. I'm thinking by time i start worrying about 100+ nodes sampling and early termination will be a must. Another crazy idea while i have you this far off track. top_doc requests go into this "pool" all 100 nodes try to run the queries in the pool and places the score of the lowest scoring doc into the response pool for that node. the the top_docs query submitter can decide how long to wait for a responses/how many responses to wait for.. and knows what nodes he will need to use top_docs from.




So what i'm doing in lucy_cluster is not trying to solve the 100node issue just yet.. and keeping the number of nodes small < 10. But at the same time keeping the number of shards about 30ish. Mainly so i can rebalance nodes my just moving shards... and nodes can search more than one shard locally with a multi-searcher.  

>  But this brings up Marvin's points about
> how to handle distributed TF/IDF...
> 

This is *easy* to solve on a per-Index basis with insider knowledge about the index and how it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key. 
I like the idea of  sampling a node or 2 to get the cache started (service the search) and then finish the cache out of band to get a better more complete picture. Unless your adding/updating to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly. 


-Dan


Re: [lucy-dev] Re: ClusterSearcher

Posted by Nathan Kurz <na...@verse.com>.
Dan --

I took a glance.  Sounds promising.  Could you talk a bit about the use
case you at anticipating?  What are you indexing?  How fast is it
changing?  Do the shards fit in memory?  What's a ballpark for the searches
per second you'd like to handle?

My first thought is that you may be able to trade off some latency for
increased throughput by sticking with partially serialized requests if you
were able to pass a threshold score along to each node/shard so you could
speed past low scoring results.  But this brings up Marvin's points about
how to handle distributed TF/IDF...

--nate