You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2010/10/13 06:18:32 UTC

[jira] Created: (SOLR-2155) Geospatial search using geohash prefixes

Geospatial search using geohash prefixes
----------------------------------------

Key: SOLR-2155
URL: https://issues.apache.org/jira/browse/SOLR-2155
Project: Solr
Issue Type: Improvement
Reporter: David Smiley

There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.

I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.

This work was presented at LuceneRevolution in Boston on October 8th.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923300#action_12923300 ] 

Lance Norskog commented on SOLR-2155:
-------------------------------------

You're right, this had no context! 

>From the [geohash site|http://geohash.org/site/tips.html]: _Geohashes offer properties like arbitrary precision, similar prefixes for nearby positions, and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision)._

If you store geohashes in a bitwise format, you get the "N leading bits" trick: the Manhattan distance between any two hashes is the length of the first N matching bits. The more matching bits starting from the highest or "hemisphere" bit, the closer two points are.

You can use this to search bounding boxes to a given Level Of Detail (LOD) by only comparing the first N bits (The LOD is of course a power of 2). 

The core problem with this bitwise comparison trick is that one zero crossing is in Greenwich, in Greater London. The other is at the equator. So this bitwise search trick works in most of the world, just not in London or at the Equator.Street mapping and "find the nearest X" are major use cases for geo-search. So, we have an ultra-fast bounding box search *that blows up in London*.  (Of course, not just London everything at Longitude 0.00.) 

The longitude above goes through Unalaska in an area with no roads, giving a zero crossing that blows up in a sparsely inhabited area. Then, instead of the Equator, use the North Pole as the zero crossing. The longitude passes through the island where there are no roads, and there are no streets (yet) at the North Pole.  _Street mapping applications would work perfectly well with a Rotated Geohash._ Thus, rotating the geohash gives a variable-LOD bitwise search that always works and is very very fast.

Super-fast Manhattan distance search may not be an interesting goal any more, since CPUs are so fast. So, rotating the basis of the Geohash is probably not worthwhile. Also, it would generate loads of confused traffic on solr-user.

Does this help?


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920498#action_12920498 ] 

Robert Muir commented on SOLR-2155:
-----------------------------------

Since a geohash is a textual representation of a Morton number (interleaving bits), 
wouldn't it be better to use a Morton Number (numeric representation)
so that NumericRangeQuery/Filter  could be used instead of PrefixQuery or 
TermRangeQuery?

It would be the same results, only faster, as it would need to visit less terms at search-time.


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924098#action_12924098 ] 

Lance Norskog edited comment on SOLR-2155 at 10/22/10 8:32 PM:
---------------------------------------------------------------

I've reread the patch a few times and I understand it better now, and yes there should be no equator/prime meridian problems. I retract any overt or implied criticism.

bq. First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter - it would be a worthwhile optimization to index them in that fashion as it would be more compact.

Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version?  Trie fields (can be made to) show up correctly in facets.

And thank you for the word [gazateer|http://encyclopedia2.thefreedictionary.com/Gazateer].

About unit tests: I've stumbled so many times with floating point that I only trust real-world data. A good unit test would be indexing a [gazateer| http://encyclopedia2.thefreedictionary.com/Gazateer] of world data and randomly comparing points. OpenStreetMaps or Wikipedia locations for example.


      was (Author: lancenorskog):
    I've reread the patch a few times and I understand it better now, and yes there should be no equator/prime meridian problems. I retract any overt or implied criticism.

bq. First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter - it would be a worthwhile optimization to index them in that fashion as it would be more compact.

Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version?  Trie fields (can be made to) show up correctly in facets.

And thank you for the word [gazateer|http://encyclopedia2.thefreedictionary.com/Gazateer|.

About unit tests: I've stumbled so many times with floating point that I only trust real-world data. A good unit test would be indexing a [gazateer|http://encyclopedia2.thefreedictionary.com/Gazateer| of world data and randomly comparing points. OpenStreetMaps or Wikipedia locations for example.

  
> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922068#action_12922068 ] 

David Smiley commented on SOLR-2155:
------------------------------------

The presentation I gave at Lucene Revolution on this subject was finally uploaded:  http://www.lucidimagination.com/files/Lucene%20Rev%20Preso%20Smiley%20Spatial%20Search.pdf

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920554#action_12920554 ] 

Robert Muir commented on SOLR-2155:
-----------------------------------

{quote}
One area that I know nothing about is how scoring/sorting actually works within Lucene. For the work here I wasn't in need of that but many people clearly want that. In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated?
{quote}

Thats a good question, I'm not sure this stuff will help with that. But there is also a lot of people who like you, don't need/want to integrate it into scoring and maybe just want to filter on distance really fast, and score based on something else.

I'm not a spatial guy and don't understand the spatial goings-on, but it seems like maybe people who want to do relevance based on distance could achieve that some other way and use the trie value to just have a really fast bounding "box" filter.

maybe they use a solr function query or however this is done but the filter would speed it up tremendously, of course with some loss of precision, but this is search, its not like the textual component has perfect precision, and a lot of people arent going across meridians or the earth's poles or anything.


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925245#action_12925245 ] 

Lance Norskog commented on SOLR-2155:
-------------------------------------

bq. I'm not sure how a "trie version" of geohash is developed. I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers.

This is why I mention the Trie classes- they seemed like the same tool, and the lessons learned in making facets etc. work are worth knowing. It seems like a Trie for geohash would just be a character-by-character Trie.

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924098#action_12924098 ] 

Lance Norskog commented on SOLR-2155:
-------------------------------------

I've reread the patch a few times and I understand it better now, and yes there should be no equator/prime meridian problems. I retract any overt or implied criticism.

bq. First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter - it would be a worthwhile optimization to index them in that fashion as it would be more compact.

Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version?  Trie fields (can be made to) show up correctly in facets.

And thank you for the word [gazateer|http://encyclopedia2.thefreedictionary.com/Gazateer|.

About unit tests: I've stumbled so many times with floating point that I only trust real-world data. A good unit test would be indexing a [gazateer|http://encyclopedia2.thefreedictionary.com/Gazateer| of world data and randomly comparing points. OpenStreetMaps or Wikipedia locations for example.


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920432#action_12920432 ] 

David Smiley commented on SOLR-2155:
------------------------------------

My next step is to measure performance, perhaps using Lucene's benchmark contrib module with some as yet unidentified source of lat-lons.  I then can due some tuning.  I've identified several areas for improvement I will tackle, most notably indexing geohashes at all character lengths thereby enabling an algorithm that can do faster queries that cover many points.  I've heard others call this "geospatial tiers" which is in effect what I'll be doing.  I'll also add a PolygonGeoShape.


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923925#action_12923925 ] 

David Smiley commented on SOLR-2155:
------------------------------------

Lance, did you look at my patch or just skim the issue description?  There seems to be a big disconnect in what I'm doing and what you think I'm doing.

First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion.  But that doesn't really matter -- it would be a worthwhile optimization to index them in that fashion as it would be more compact.

Secondly, as stated in the issue description, this filter finds _multiple_ geohash grid squares to cover the queried area.  It doesn't matter where boundaries are, 0, 90, 180, Unalaska, North pole, whatever -- it doesn't matter.  A search over the London or Greenwhich areas will yield grid squares that have no prefixes in common, but that doesn't matter; +each grid square is subsequently searched independently against the user's query+.

Thirdly, the *only* distance measurements in this patch are against _resolved_ latitude-longitudes points (e.g. decoded geohashes) in the index against the user's query *if* that query is point-radius (vs lat-lon bounding box which need not calculate distance).  This uses haversine bu  I'm not using geohashes for distance calculation.

If you still insist there is a shortcoming of my implementation, then I challenge you to add a unit test proving that my implementation here doesn't work.  The existing tests I used in unit

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920536#action_12920536 ] 

David Smiley commented on SOLR-2155:
------------------------------------

Yes, absolutely Rob.  I went with geohashes because it was a straight-forward implementation to prove out the concept. It appears my patch is the first of its kind for Lucene/Solr.  For doing a more efficient Morton representation, I have already looked at the work going on at javageomodel:  http://code.google.com/p/javageomodel/  which was built for use with Google BigTable.  The code there is largely pure java, keep in mind.  It's the same concept but it uses a dictionary of size 16 (representable by 4 bits) which results in cleaner algorithms than geohashes' 5-bit dictionary which has some even/odd rules to it which are awkward.  But yes, it would be more efficient to store the actual intended bits, not characters.

One area that I know nothing about is how scoring/sorting actually works within Lucene.  For the work here I wasn't in need of that but many people clearly want that.  In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated?

(I meant to track you down at LuceneRevolution to say hi but I missed the opportunity.)

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925001#action_12925001 ] 

David Smiley commented on SOLR-2155:
------------------------------------

bq. Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version? Trie fields (can be made to) show up correctly in facets.

The geohash usage is purely internal to the implementation; users don't see it when they use this field.  And even if they were exposed, they can be generated on-demand.  There's even javascript code I've seen to do this.  So I'm not married to using geohashes -- it's the underlying heirarchical/gridded nature of them that is key.  I'm not sure how a "trie version" of geohash is developed.  I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers.

Thanks for the suggestion of using OpenStreetMaps to get locations; I'll look into that.  I want to put together a useful data set -- using real data as much as possible is good.  I'll need to synthesize a one-to-many document to points mapping, randomly, however.  And I'll need to come up with various random lat-lon box queries to perform.  I'd like to use Lucene's benchmark contrib module as a framework to develop the performance test.  I read about it in LIA2 and it seems to fit the bill.

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921761#action_12921761 ] 

Lance Norskog edited comment on SOLR-2155 at 10/16/10 5:23 PM:
---------------------------------------------------------------

The problem with Geohash is that it puts zeros in Greater London and at the Equator, so every computation that uses it has to dodge at these points. More to the point, the Hamming distance trick does not work, so a simple super-fast scan of an array of Lucene Trie-Integers does not work.

On the Aleutian island of Unalaska, there is a longitude which goes through a mountainous region with no roads. The longitude does not touch any other land north of Antarctica.

[Unalaska|http://maps.google.com/maps?q=65%C2%B037%E2%80%B221%E2%80%B3N+168%C2%B020%E2%80%B242%E2%80%B3W]

65°37′21″N 168°20′42″W

If you rotate the geohash frame of reference to latitude North Pole and a longitude through Unalaska, you can cheerfully ignore all of the zero points, at the cost of alienating some Inuit.

Seriously, if you limit Geohash accuracy to land-based services (like maps), with an accuracy warning about comparing Nome and Vladivostok, sacrificing a road-less part of an Aleutian island seems a small price to pay. 

      was (Author: lancenorskog):
    The problem with Geohash is that it puts zeros in Greater London and at the Equator, so every computation that uses it has to dodge at these points. More to the point, the Hamming distance trick does not work, so a simple super-fast scan of an array of Lucene Trie-Integers does not work.

On the Aleutian island of Unalaska, there is a longitude which goes through a mountainous region with no roads. The longitude does not touch any other land north of Antarctica.

[Unalaska|http://maps.google.com/maps?q=65%C2%B037%E2%80%B221%E2%80%B3N+168%C2%B020%E2%80%B242%E2%80%B3W]

65°37′21″N 168°20′42″W

If you rotate the geohash frame of reference to latitude North Pole and a longitude through Unalaska, you can cheerfully ignore all of the zero points, at the cost of alienating some Inuit.

  
> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated SOLR-2155:
-------------------------------

    Attachment: GeoHashPrefixFilter.patch

This attached patch is tested, both at the GeoHashUtils level and via SpatialFilterTest.  I added tests to both of these.  I added ASF headers.

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921761#action_12921761 ] 

Lance Norskog commented on SOLR-2155:
-------------------------------------

The problem with Geohash is that it puts zeros in Greater London and at the Equator, so every computation that uses it has to dodge at these points. More to the point, the Hamming distance trick does not work, so a simple super-fast scan of an array of Lucene Trie-Integers does not work.

On the Aleutian island of Unalaska, there is a longitude which goes through a mountainous region with no roads. The longitude does not touch any other land north of Antarctica.

[Unalaska|http://maps.google.com/maps?q=65%C2%B037%E2%80%B221%E2%80%B3N+168%C2%B020%E2%80%B242%E2%80%B3W]

65°37′21″N 168°20′42″W

If you rotate the geohash frame of reference to latitude North Pole and a longitude through Unalaska, you can cheerfully ignore all of the zero points, at the cost of alienating some Inuit.


> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922066#action_12922066 ] 

David Smiley commented on SOLR-2155:
------------------------------------

Lance, sorry, you've completely lost me; I don't understand anything you've said.  Can you please try to explain your points in more detail, any of them at least?

I don't see what's so special about this point on the earth you've drawn attention to:
65°37′21″N 168°20′42″W        geohash: b7b01fqvuff1
No point or location in geohash is special or problematic.  There are gridlines at every resolution which need to be dealt with -- and it's not hard.

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points.  This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text.  None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter.  A geohash refers to a lat-lon box on the earth.  Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid.  The first step in this scheme is figuring out which geohash grid squares cover the user's search query.  I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose.  The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index.  Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches.  I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org