You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexey Verkhovsky <al...@gmail.com> on 2012/02/16 07:30:05 UTC

Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Hi, all,

I'm new here. Used Solr on a couple of projects before, but didn't need to
dive deep into anything until now. These days, I'm doing a spike for a
"yellow pages" type search server with the following technical requirements:

~10 mln listings in the database. A listing has a name, address,
description, coordinates and a number of tags / filtering fields; no more
than a kilobyte all told; i.e. theoretically the whole thing should fit in
RAM without sharding. A typical query is either "all text matches on name
and/or description within a bounded box", or "some combination of tag
matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
contain up to 10^5 unfiltered listings (the average is more like 10^3).
More than 50% of all the listings are in the frequently requested bounding
boxes, however a vast majority of listings are almost never displayed
(because they don't match the other filters).

Data "never changes" (i.e., a daily batch update; rebuild of the entire
index and restart of all search servers is feasible, as long as it takes
minutes, not hours). This thing ideally should serve up to 10^3 requests
per second on a small (as in, "less than 10 commodity boxes") cluster. In
other words, a typical request should be CPU bound and take ~100-200 msec
to process. Because of coordinates (that are almost never the same),
caching of queries makes no sense; from what little I understand about
Lucene internals, caching of filters probably doesn't make sense either.

After perusing documentation and some googling (but almost no source code
exploring yet), I understand how the schema and the queries will look like,
and now have to figure out a specific configuration that fits the
performance/scalability requirements. Here is what I'm thinking:

1. Search server is an internal service that uses embedded Solr for the
indexing part. RAMDirectoryFactory as index storage.
2. All data is in some sort of persistent storage on a file system, and is
loaded into the memory when a search server starts up.
3. Data updates are handled as "update the persistent storage, start
another cluster, load the world into RAM, flip the load balancer, kill the
old cluster"
4. Solr returns IDs with relevance scores; actual presentations of listings
(as JSON documents) are constructed outside of Solr and cached in
Memcached, as a mostly static content with a few templated bits, like
<distance><%=DISTANCE_TO(-123.0123, 45.6789) %>.
5. All Solr caching is switched off.

Obviously, we are not the first people to do something like this with Solr,
so I'm hoping for some collective wisdom on the following:

Does this sounds like a feasible set of requirements in terms of
performance and scalability for Solr? Are we on the right path to solving
this problem well? If not, what should we be doing instead? What nasty
technical/architectural gotchas are we probably missing at this stage?

One particular advice I'd be really happy to hear is "you may not need
RAMDataFactory if you use <some combination of fast distributed file system
and caching> instead".

Aso, is there a blog, wiki page or a maillist thread where a similar
problem is discussed? Yes, we have seen
http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a good
introduction that is outdated and doesn't go into the nasty bits, anyway.

Many thanks in advance,
-- Alex Verkhovsky

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Feb 16, 2012 at 4:06 PM, Alexey Verkhovsky
<al...@gmail.com> wrote:
> ly need ids, scores and total number of results out of Solr. Presentation of
> selected entities will have to include some write-heavy data (from RDBMS
> and/or memcached), therefore won't be Solr's business anyway.

It depends on if you're going to be doing distributed search - there
may be some scenarios there where it's used, but in general the query
cache is the least useful.
The filterCache is useful in a ton of ways if you're doing faceting too.

-Yonik
lucidimagination.com

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Posted by Alexey Verkhovsky <al...@gmail.com>.
On Thu, Feb 16, 2012 at 1:32 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> Your're making many assumptions about how Solr works internally.
>

True that. If this spike turns into a project, digging through the source
code will come. Meantime, we have to start somewhere, and the default
configuration may not be the greatest starting point for this problem.

We don't need highlighting, and only need ids, scores and total number of
results out of Solr. Presentation of selected entities will have to include
some write-heavy data (from RDBMS and/or memcached), therefore won't be
Solr's business anyway.

>From what you said, I guess it won't hurt to give it a small document
cache, just big enough to prevent streaming the same document twice within
the same query. Still don't have a reason to have a query cache - because
of lon/lat coming from the mobile devices, there are virtually no repeated
queries in our production logs. Or am I making a bad assumption here, too?

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Feb 16, 2012 at 3:03 PM, Alexey Verkhovsky
<al...@gmail.com> wrote:
>> 5. All Solr caching is switched off.
>
>> But why?
>>
>
> Because (a) I shouldn't need to cache documents, if they are all in memory
> anyway;

Your're making many assumptions about how Solr works internally.

One example of many:
  Solr streams documents (requests the stored fields right before they
are written to the response stream) to support returning any number of
documents.
If you highlight documents, the stored fields need to be retrieved
first.  When streaming those same documents later, Solr will retrieve
the stored fields again - reying on the fact that they should be
cached by the document cache since they were just used.

There are tons of examples of how things are architected to take
advantage of the caches - it pretty much never makes sense to outright
disable them.  If they take up too much memory, then just reduce the
size.

-Yonik
lucidimagination.com

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Posted by Alexey Verkhovsky <al...@gmail.com>.
On Thu, Feb 16, 2012 at 3:37 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Everybody start from daily bounce, but end up with UPDATED_AT column and
> delta updates , just consider urgent content fix usecase. Don't think it's
> worth to rely on daily bounce as a cornerstone of architecture.
>

I'd be happy to avoid it, for all the obvious reasons.

I do know that performance of this type of services tends to be not that
great (as in "700 to 5000 msec"), and there should be ways to do it several
times faster than this.


> you can use grid of coordinates to reduce their entropy


I don't understand this statement. Can you elaborate, please?

Since my bounding boxes are small, one [premature optimization] idea could
be to divide Earth into 2x2 degree overlapping tiles at 1 degree step in
both directions (such that any bounding box fits within at least one of
them, and any location belongs to 4 of them), then use tileId=X as a cached
filter and geofilt as a post-filter. Is that along the lines of what you
are talking about?


> <http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/>
> > Lucene internals, caching of filters probably doesn't make sense either.
> > from what little I understand about
> But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache
>

I didn't realize that multiple qf's in the same query were applied in
parallel as set intersections. In that case, the non-geography filters
should be cached (and added to the prewarming routine, I guess) even when
they are usually far less specific than the bounding box. Makes sense.


> > 1. Search server is an internal service that uses embedded Solr for the
> > indexing part. RAMDirectoryFactory as index storage.
> Bad idea. It's purposed mostly for tests, the closest purposed for
> production analogue is
> org.apache.lucene.store.instantiated.InstantiatedIndex
>
...

> AFAIK the state of art is use file directory (MMAP or whatever), rely on
> Linux file system RAM cache.
>

OK, I may as well start the spike from this angle, too. By the way, this is
precisely the kind of advice I was hoping for. Thanks a lot.

> 5. All Solr caching is switched off.

> But why?
>

Because (a) I shouldn't need to cache documents, if they are all in memory
anyway; (b) query caching will have abysmal hit/miss because of the spatial
component; and (c) I misunderstood how query filters work. So, now I'm
thinking a FastLFU query filter cache for non-geo filters.


> Btw, if you need multivalue geofield pls vote for SOLR-2155
>
Our data has one lon/lat pair per entity... so no, I don't need it. Or at
least haven't figured out that I do yet. :)

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky <
alexey.verkhovsky@gmail.com> wrote:

> Hi, all,
>
> I'm new here. Used Solr on a couple of projects before, but didn't need to
> dive deep into anything until now. These days, I'm doing a spike for a
> "yellow pages" type search server with the following technical
> requirements:
>
> ~10 mln listings in the database. A listing has a name, address,
> description, coordinates and a number of tags / filtering fields; no more
> than a kilobyte all told; i.e. theoretically the whole thing should fit in
> RAM without sharding. A typical query is either "all text matches on name
> and/or description within a bounded box", or "some combination of tag
> matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
> contain up to 10^5 unfiltered listings (the average is more like 10^3).
> More than 50% of all the listings are in the frequently requested bounding
> boxes, however a vast majority of listings are almost never displayed
> (because they don't match the other filters).
>
> Data "never changes" (i.e., a daily batch update; rebuild of the entire
> index and restart of all search servers is feasible, as long as it takes
> minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


> This thing ideally should serve up to 10^3 requests
> per second on a small (as in, "less than 10 commodity boxes") cluster. In
> other words, a typical request should be CPU bound and take ~100-200 msec
> to process. Because of coordinates (that are almost never the same),
> caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


> from what little I understand about
> Lucene internals, caching of filters probably doesn't make sense either.
>
But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache

>
> After perusing documentation and some googling (but almost no source code
> exploring yet), I understand how the schema and the queries will look like,
> and now have to figure out a specific configuration that fits the
> performance/scalability requirements. Here is what I'm thinking:
>
> 1. Search server is an internal service that uses embedded Solr for the
> indexing part. RAMDirectoryFactory as index storage.
>
Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


> 2. All data is in some sort of persistent storage on a file system, and is
> loaded into the memory when a search server starts up.
>
AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


> 3. Data updates are handled as "update the persistent storage, start
> another cluster, load the world into RAM, flip the load balancer, kill the
> old cluster"
>
no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


> 4. Solr returns IDs with relevance scores; actual presentations of listings
> (as JSON documents) are constructed outside of Solr and cached in
> Memcached, as a mostly static content with a few templated bits, like
> <distance><%=DISTANCE_TO(-123.0123, 45.6789) %>.
>
Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to <%=DISTANCE_TO(-123.0123,
45.6789) %> , just <%=doc.DISTANCE%> see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



> 5. All Solr caching is switched off.
>
But why?



>
> Obviously, we are not the first people to do something like this with Solr,
> so I'm hoping for some collective wisdom on the following:
>
> Does this sounds like a feasible set of requirements in terms of
> performance and scalability for Solr? Are we on the right path to solving
> this problem well? If not, what should we be doing instead? What nasty
> technical/architectural gotchas are we probably missing at this stage?
>
> One particular advice I'd be really happy to hear is "you may not need
> RAMDataFactory if you use <some combination of fast distributed file system
> and caching> instead".
>
> Aso, is there a blog, wiki page or a maillist thread where a similar
> problem is discussed? Yes, we have seen
> http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a
> good
> introduction that is outdated and doesn't go into the nasty bits, anyway.
>
Btw, if you need multivalue geofield pls vote for SOLR-2155


> Many thanks in advance,
> -- Alex Verkhovsky
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>