You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by samabhiK <qe...@gmail.com> on 2012/07/15 01:33:48 UTC

Solr - Spatial Search for Specif Areas on Map

Hi,

I am new to Solr Spatial Search and would like to understand if Solr can be
used successfully for very large data sets in the range of 4Billion records.
I need to search some filtered data based on a region - maybe a set of
lat/lons or polygon area. is that possible in solr? How fast is it with such
data size? Will it be able to handle the load for 10000 req/sec? If so, how?
Do you think solr can beat the performance of PostGIS? As I am about to
choose the right technology for my new project, I need some expert comments
from the community.

Regards
Sam

--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Spatial Search for Specif Areas on Map

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Thinking more about this, the way to get a Lucene based system to scale to
the maximum extent possible for geospatial queries would be to get a
geospatial query to be satisfied by just one (usually) Lucene index segment. 
It would take quite a bit of customization and work to make this happen.  I
suppose you could always optimize a Solr index and thus get one Lucene
segment, but deploy 10-20x the number of Solr shards (aka "Solr cores") that
one would normally do, and that wouldn't be that hard.  There would be some
work in determining which Solr core (== Lucene segment) a given document
should belong to and which ones to query.

~ David

-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051p3995357.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Spatial Search for Specif Areas on Map

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
samabhiK wrote
> 
> David,
> 
> Thanks for such a detailed response. The data volume I mentioned is the
> total set of records we have - but we would never ever need to search the
> entire base in one query; we would divide the data by region or zip code.
> So, in that case I assume that for a single region, we would not have more
> than 200M records (this is real , we have a region with that many
> records).
> 
> So, I can assume that I can create shards based on regions and the
> requests would get distributed among these region servers, right?
> 

The fact that your searches are always per region (or almost always) helps
things a lot.  Instead of doing a distributed search to all shards, you
would search the specific shard, or worst case 2 shards, and not burden the
other shards with queries you no won't be satisfied.  This new information
suggests that the total 10k queries per second volume would be divided
amongst your shards, so 10k / 40 shards = 250 queries per second.  Now we
are approaching something reasonable.  If any of your regions need to scale
up (more query volume) or out (big region) then you can do that on a case by
case basis.  I can think of ways to optimize that for spatial.

Thinking in terms of pure queries per second on a machine, say a 16 CPU
core/machine one, then 250/16 = ~ 16 queries per second per CPU core of a
shard.  I think that's plausible but you would really need to determine how
many exactly you could do.  I assume the spatial index is going to fit in
RAM.  If successful, this means ~40 machines (one per region). 



>  You also mentioned about ~20 concurrent queries per shard - do you have
> links to some benchmarks? I am very interested to know about the hardware
> sizing details for such a setup.
> 

The best I can offer is on the geospatial side: 
https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=12988316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12988316

But this was an index of "only" 2M distinct points.  It may be that these
figures still hold if the overhead of the spatial query with data is so low
that other constant elements comprise the times, but I really don't know. 
To be clear, this is older code that is not the same as the latest, but they
are algorithmically the same.  The current code has an error epsilon to the
query shape which helps scale further.  There is plenty more optimization
that could be done, like a more efficient binary grid scheme, using Hilbert
Curves, and using an optimizer to find the hotspots to try and optimize
them.



> About setting up Solr for a single shard, I think I will go by your
> advice.  Will see how much a single shard can handle in a decent machine
> :)
> 
> The reason why I came up with that figure was, I have a user base of 500k
> and theres a lot of activity which would happen on the map - every time
> someone moves the tiles, zooms in/out, scrolls, we are going to send a
> server side request to fetch some data ( I agree we can benefit much using
> caching but I believe Solr itself has its own local cache). I might be a
> bit unrealistic with my 10K rps projections but I have read about 9K rps
> to map servers from some sources on the internet. 
> 
> And, NO, I don't work for Google :) But who knows we might be building
> something that can get so much traffic to us in a while. :D
> 
> BTW, my question still remains - can we do search on polygonal areas on
> the map? If so, do you have any link where i can get more details?
> Bounding Box thing wont work for me I guess :(
> 
> Sam
> 

Polygons are supported; I've been doing them for years now.  But it requires
some extensions.  Today, you need the latest Solr trunk, and you need to
apply the Solr adapters to Lucene 4 spatial SOLR-3304, and you need to have
the JTS jar on your classpath, something you download separately.  BTW here
are some basic
docs:http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4  



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051p3995333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Spatial Search for Specif Areas on Map

Posted by samabhiK <qe...@gmail.com>.
David,

Thanks for such a detailed response. The data volume I mentioned is the
total set of records we have - but we would never ever need to search the
entire base in one query; we would divide the data by region or zip code.
So, in that case I assume that for a single region, we would not have more
than 200M records (this is real , we have a region with that many records).

So, I can assume that I can create shards based on regions and the requests
would get distributed among these region servers, right? You also mentioned
about ~20 concurrent queries per shard - do you have links to some
benchmarks? I am very interested to know about the hardware sizing details
for such a setup.

About setting up Solr for a single shard, I think I will go by your advice. 
Will see how much a single shard can handle in a decent machine :)

The reason why I came up with that figure was, I have a user base of 500k
and theres a lot of activity which would happen on the map - every time
someone moves the tiles, zooms in/out, scrolls, we are going to send a
server side request to fetch some data ( I agree we can benefit much using
caching but I believe Solr itself has its own local cache). I might be a bit
unrealistic with my 10K rps projections but I have read about 9K rps to map
servers from some sources on the internet. 

And, NO, I don't work for Google :) But who knows we might be building
something that can get so much traffic to us in a while. :D

BTW, my question still remains - can we do search on polygonal areas on the
map? If so, do you have any link where i can get more details? Bounding Box
thing wont work for me I guess :(

Sam


--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051p3995209.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Spatial Search for Specif Areas on Map

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Sam,

These are big numbers you are throwing around, especially the query volume. 
How big are these records that you have 4 billion of -- or put another way,
how much space would it take up in a pure form like in CSV?  And should I
assume the searches you are doing are more than geospatial?  In any case, a
Solr solution here is going to involve many machines.  The biggest number
you propose is 10k queries per second which is hard to imagine.

I've seen some say Solr 4 might have 100M records per shard, although there
is a good deal variability -- as usual, YMMV.  But lets go with that for
this paper-napkin calculation.  You would need 40 shards of 100M documents
each to get to 4000M (4B) documents.  That is a lot of shards, but people
have done it, I believe.  This scales out to your document collection but
not up to your query volume which is extremely high.  I have some old
benchmarks suggesting ~10ms geo queries on spatial queries for SOLR-2155
which was rolled into the spatial code in Lucene 4 (Solr adapters are on the
way).  But for a full query overhead and for a safer estimate, lets say
50ms.  So perhaps you might get 20 concurrent queries per second (which
seems high but we'll go with it).  But you require 10k/sec(!) so this means
you need 500 times the 20qps which means 500 *times* the base hardware to
support the 40 shards I mentioned before.  In other words, the 4B documents
need to be replicated 500 times to support 10k/second queries.  So
theoretically, we're talking 500 clusters, each cluster having 40 shards --
at ~4 shards/machine this is 10 machines per cluster: 5,000 machines in
total.  Wow.  Doesn't seem realistic.  If you have a reference to some
system or person's experience with any system that can, Solr or not, then
please share.

If you or anyone were to attempt to see if Solr scale's for their needs, a
good approach is to consider just one shard non-replicated, or even better a
handful that would all exist on one machine.  Optimize it as much as you
can.  Then see how much data you can put on this machine and with what
query-volume.  From this point, it's basic math to see how many more such
machines are required to scale out to your data size and up to your query
volume.

Care to explain why so much data needs to be searched at such a volume? 
Maybe you work for Google ;-)

To your question on scalability vs PostGIS, I think Solr shines in its
ability to scale out if you have the resources to do it.

~ David Smiley

-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051p3995197.html
Sent from the Solr - User mailing list archive at Nabble.com.