You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Smiley, David W." <ds...@mitre.org> on 2010/12/28 16:47:50 UTC

Re: Geospatial search in Lucene/Solr

(I was emailing Grant RE geospatial search and it occurred to me I should be doing this on the Lucene dev list so I've taken it there)

On Dec 28, 2010, at 8:31 AM, Grant Ingersoll wrote:
On Dec 27, 2010, at 11:52 PM, Smiley, David W. wrote:

Hi Grant.

I saw your latest blog post at Lucid Imagination in which you mentioned you were going to work on adding polygon search.  FWIW, I finished polygon search last week to my code base based onhttps://issues.apache.org/jira/browse/SOLR-2155 (geohash prefix implementation).

I've scanned this, but haven't had time to look in depth.

I used the JTS library to do the heavy lifting.  I’d be happy to release the code.  I’ve iterated more on SOLR-2155 on my local project code and plan to re-integrate it with the patch at some point.  There was almost 2 months of time in-between me releasing SOLR-2155 and me having other priorities but I’m back at it.

Cool.  The problem w/ JTS is it is LGPL.

At least it's not GPL.  Can you simply use JTS and make it an optional library, like Solr does for some other libs?  There's a lot of expertise in that library that's been refined over the last 10 years.  Even if you refuse to touch anything non-Apache licensed, I highly recommend you look through its code to see how geospatial point-in-polygon is efficiently done.  It has a concept of a "prepared geometry" object that is optimized to make large numbers of point-in-polygon queries more efficient.  It is implemented by putting the line segments of the polygon in an in-memory R-tree index.  If you'd like me to point you at specific classes, I'd be happy to.  Better yet, I could release an update to SOLR-2155 and you could debug step through.  FWIW  I used a separate class for the polygon search that implements my GeoShape interface.  If a user doesn't need to do a polygon search (which is not a common requirement of geospatial search), then the JTS library need not ship with Lucene/Solr.

Presently, I’m working on Lucene’s benchmark contrib module to evaluate the performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon range queries), and then I’ll work on a more efficient probably non-geohash implementation but based on the same underlying concept of a hierarchical grid.  I’m using the geonames.org<http://geonames.org/> data set.  Unfortunately, the benchmark code seems very oriented to a generic title-body document whereas I’m looking to create lat-lon pairs… and furthermore to create documents containing multiple lat-lon pairs, and even furthermore a query generator that generates random box queries centered on a random location from the data set.  I seem to be stretching the benchmark framework beyond the use-case it was designed for and so perhaps it won’t be committable but at least I’ll have a patch for other geospatial birds-of-a-feather like you to use.

Stretch away.  The Title/Body orientation is just a relic of what we have done in the past, it doesn't have to stay that way.

I am interested in your thoughts on evaluating the performance of geospatial queries.  Since reading LIA2, the Lucene's benchmark contrib module seems like the ideal way to test Lucene. I thought about programmatically generating points but I've warmed to the idea of using geonames.org<http://geonames.org> data as the space of possible points.  Geonames has 7.5M unique lat-lon pairs[1], including population.  In SOLR-2155, there are multiple points per document as a key feature.  For testing, I want to be able to configure 1 point per document for comparison to algorithms that only support that but I must support a random variable number of them too.  Consequently, each place does NOT correspond 1-1 to a document.  The number of documents indexed should be configurable which will be randomly generated based on randomly picking one or more points from the input data set.  The number of points available from the input data set should be configurable too (i.e. < 7.5M).   Assuming that a more populace place is more likely to be referenced than a less populace one, I want to skew choosing a place weighted by the place's population.  This creates a much more realistic skew of documents mapped to points than an evenly distributed one, which is important.  On the query side, I'd like to generate random queries centered on a random one of these points, with a radius that is either 1km, 10km, 100km, 1000km, or 10000km, in order to match a wide variety of documents.  For reporting, I'd like to see a chart of response time vs number of documents matched.

I'm perhaps half-done with implementing all this. Because I need to randomly choose points in the data set, I can't stream it in, I need to read it all into memory as a singleton object used by both the indexing side, and the query side (since the queries need to pick a random point).

[1] http://www.geonames.org/about.html

~ David Smiley

Re: Geospatial search in Lucene/Solr

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
As a follow-up to this thread, I've contributed my geospatial benchmark
performance code here:
https://issues.apache.org/jira/browse/LUCENE-2844 "benchmark geospatial
performance based on geonames.org"

-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2186103.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Geospatial search in Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
Great!

I would suggest a new /modules for gis. It is worthwhile to have a
/modules/gis/geonames for large-scale tests/demos/benchmarks, with ant
scripts to download datasets and run the tests.

About demos: there is a lot of GEO code out there: libraries
(http://www.openmap.org/), data (geonames, openstreetmap) and map
server protocols and implementations. Geosoftware is really tricky to
get right.

On Mon, Jan 3, 2011 at 11:22 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 28, 2010, at 1:02 PM, Robert Muir wrote:
>
>> On Tue, Dec 28, 2010 at 11:59 AM, Smiley, David W. <ds...@mitre.org> wrote:
>>> Thanks for letting me know about this Rob.  I think geonames is much simpler (and much less data) to work with than wikipedia.  It's plain tab-delimited and I like that it includes the population.  I'll press forward with my benchmark module based patch.  I can relatively easily switch between the lat-lon type and my geohash type since they both conform to the SpatialQueriable interface, and so consequently I don't need two complete Lucene checkouts.  I had to add Solr & spatial as dependencies to the benchmark module but it's worth it to me.
>>>
>>
>> well in my opinion there isn't a reason why benchmark couldn't be
>> moved to /modules and depend on solr..
>
> +1.  I've often thought it as well.  Config's could even reference schema/solrconfig.xml to use (longer term)
>
>> . others might disagree but i
>> would prefer that benchmark be a module where you can benchmark
>> everything.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Geospatial search in Lucene/Solr

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 28, 2010, at 1:02 PM, Robert Muir wrote:

> On Tue, Dec 28, 2010 at 11:59 AM, Smiley, David W. <ds...@mitre.org> wrote:
>> Thanks for letting me know about this Rob.  I think geonames is much simpler (and much less data) to work with than wikipedia.  It's plain tab-delimited and I like that it includes the population.  I'll press forward with my benchmark module based patch.  I can relatively easily switch between the lat-lon type and my geohash type since they both conform to the SpatialQueriable interface, and so consequently I don't need two complete Lucene checkouts.  I had to add Solr & spatial as dependencies to the benchmark module but it's worth it to me.
>> 
> 
> well in my opinion there isn't a reason why benchmark couldn't be
> moved to /modules and depend on solr..

+1.  I've often thought it as well.  Config's could even reference schema/solrconfig.xml to use (longer term)

> . others might disagree but i
> would prefer that benchmark be a module where you can benchmark
> everything.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Geospatial search in Lucene/Solr

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Dec 28, 2010 at 11:59 AM, Smiley, David W. <ds...@mitre.org> wrote:
> Thanks for letting me know about this Rob.  I think geonames is much simpler (and much less data) to work with than wikipedia.  It's plain tab-delimited and I like that it includes the population.  I'll press forward with my benchmark module based patch.  I can relatively easily switch between the lat-lon type and my geohash type since they both conform to the SpatialQueriable interface, and so consequently I don't need two complete Lucene checkouts.  I had to add Solr & spatial as dependencies to the benchmark module but it's worth it to me.
>

well in my opinion there isn't a reason why benchmark couldn't be
moved to /modules and depend on solr... others might disagree but i
would prefer that benchmark be a module where you can benchmark
everything.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Geospatial search in Lucene/Solr

Posted by "Smiley, David W." <ds...@mitre.org>.
Thanks for letting me know about this Rob.  I think geonames is much simpler (and much less data) to work with than wikipedia.  It's plain tab-delimited and I like that it includes the population.  I'll press forward with my benchmark module based patch.  I can relatively easily switch between the lat-lon type and my geohash type since they both conform to the SpatialQueriable interface, and so consequently I don't need two complete Lucene checkouts.  I had to add Solr & spatial as dependencies to the benchmark module but it's worth it to me.

~ David

On Dec 28, 2010, at 11:18 AM, Robert Muir wrote:

> On Tue, Dec 28, 2010 at 10:47 AM, Smiley, David W. <ds...@mitre.org> wrote:
>> Presently, I’m working on Lucene’s benchmark contrib module to evaluate the
>> performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon
>> range queries), and then I’ll work on a more efficient probably non-geohash
>> implementation but based on the same underlying concept of a hierarchical
>> grid.  I’m using the geonames.org data set.  Unfortunately, the benchmark
>> code seems very oriented to a generic title-body document whereas I’m
>> looking to create lat-lon pairs… and furthermore to create documents
>> containing multiple lat-lon pairs, and even furthermore a query generator
>> that generates random box queries centered on a random location from the
>> data set.  I seem to be stretching the benchmark framework beyond the
>> use-case it was designed for and so perhaps it won’t be committable but at
>> least I’ll have a patch for other geospatial birds-of-a-feather like you to
>> use.
>> 
>> Stretch away.  The Title/Body orientation is just a relic of what we have
>> done in the past, it doesn't have to stay that way.
> 
> just for reference, a couple of us are using a python front-end to
> contrib/benchmark that Mike developed:
> 
> http://code.google.com/p/luceneutil/
> 
> This is nice as its designed for you to just declare 'competitors' (2
> checkouts of solrcene), and then you run the python script and it
> gives you the relative comparison... because they are 2 different
> checkouts its simple to compare different approaches, and each
> checkout can run with a different index (e.g. different codecs or test
> index format changes).
> 
> I thought it might be interesting to you, because there's a variety of
> queries tested here like numeric range, sorting, primary-key lookup,
> span queries etc beyond the "standard" set of queries. The framework
> also ensures that you are bringing back the same results in the same
> order, runs multiple iterations (including iterations in new JVMs),
> makes it easy to test optimized, optimized with deletions,
> multi-segment, multi-segment with deletions, and can output to txt,
> html, jira format for convenience.
> 
> currently we are generally testing with a line file format from
> wikipedia, but besides geonames i wanted to point out that wikipedia
> does include lat/long information for many articles (this is a major
> source for much of geonames place data!).
> 
> it would definitely be cool if we could test spatial queries with this
> as well... e.g by parsing out the lat/long from the wikipedia XML and
> adding to the line files, and adding some spatial queries to the
> default list of queries being tested.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Geospatial search in Lucene/Solr

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Dec 28, 2010 at 10:47 AM, Smiley, David W. <ds...@mitre.org> wrote:
> Presently, I’m working on Lucene’s benchmark contrib module to evaluate the
> performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon
> range queries), and then I’ll work on a more efficient probably non-geohash
> implementation but based on the same underlying concept of a hierarchical
> grid.  I’m using the geonames.org data set.  Unfortunately, the benchmark
> code seems very oriented to a generic title-body document whereas I’m
> looking to create lat-lon pairs… and furthermore to create documents
> containing multiple lat-lon pairs, and even furthermore a query generator
> that generates random box queries centered on a random location from the
> data set.  I seem to be stretching the benchmark framework beyond the
> use-case it was designed for and so perhaps it won’t be committable but at
> least I’ll have a patch for other geospatial birds-of-a-feather like you to
> use.
>
> Stretch away.  The Title/Body orientation is just a relic of what we have
> done in the past, it doesn't have to stay that way.

just for reference, a couple of us are using a python front-end to
contrib/benchmark that Mike developed:

http://code.google.com/p/luceneutil/

This is nice as its designed for you to just declare 'competitors' (2
checkouts of solrcene), and then you run the python script and it
gives you the relative comparison... because they are 2 different
checkouts its simple to compare different approaches, and each
checkout can run with a different index (e.g. different codecs or test
index format changes).

I thought it might be interesting to you, because there's a variety of
queries tested here like numeric range, sorting, primary-key lookup,
span queries etc beyond the "standard" set of queries. The framework
also ensures that you are bringing back the same results in the same
order, runs multiple iterations (including iterations in new JVMs),
makes it easy to test optimized, optimized with deletions,
multi-segment, multi-segment with deletions, and can output to txt,
html, jira format for convenience.

currently we are generally testing with a line file format from
wikipedia, but besides geonames i wanted to point out that wikipedia
does include lat/long information for many articles (this is a major
source for much of geonames place data!).

it would definitely be cool if we could test spatial queries with this
as well... e.g by parsing out the lat/long from the wikipedia XML and
adding to the line files, and adding some spatial queries to the
default list of queries being tested.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org