You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by gwk <gi...@eyefi.nl> on 2009/09/08 10:00:28 UTC

Geographic clustering

Hi,

I'm working on a search-on-map interface for our website. I've created a 
little proof of concept which uses the MarkerClusterer 
(http://code.google.com/p/gmaps-utility-library-dev/) which clusters the 
markers nicely. But because sending tens of thousands of markers over 
Ajax is not quite as fast as I would like it to be, I'd prefer to do the 
clustering on the server side. I've considered a few options like 
storing the morton-order and throwing away precision to cluster, 
assigning all locations to a grid position. Or simply cluster based on 
country/region/city depending on zoom level by adding latitude on 
longitude fields for each zoom level (so that for smaller countries you 
have to be zoomed in further to get the next level of clustering).

I was wondering if anybody else has worked on something similar and if 
so what their solutions are.

Regards,

gwk

Re: Geographic clustering

Posted by gwk <gi...@eyefi.nl>.

Hi all,

I've just got my geographic clustering component working (somewhat). 
I've attached a sample resultset to this mail. It seems to work pretty 
well and it's pretty fast. I have one issue I need help with concerning 
the API though. At the moment my Hilbert field is a Sortable Integer, 
and I do the following call to get the count for a specific cluster:

Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true);
searcher.numDocs(rangeQ, docs);

But I'd like to further reduce the DocSet by the given longitude and 
latitude bounds given in the geocluster arguments (swlat, swlng, nelat 
and nelng) but only for the purposes of clustering, I don't just want to 
have to add fq arguments for to the query as I want my non-geocluster 
results (like facet counts and numFound) to not be affected by the 
selected range. So how would I achieve the effect of filterqueries 
(including the awesome caching) by manipulating either the rangeQ or 
docs. And since the snippet above is called multiple times with 
different rangeQ but the same (filtered) DocSet I guess manipulating 
docs would be faster (I think).

Regards,

gwk

gwk wrote:
> Hi Joe,
>
> Thanks for the link, I'll check it out, I'm not sure it'll help in my 
> situation though since the clustering should happen at runtime due to 
> faceted browsing (unless I'm mistaken at what the preprocessing does).
>
> More on my progress though, I thought some more about using Hilbert 
> curve mapping and it seems really suited for what I want. I've just 
> added a Hilbert field to my schema (Trie Integer field) with latitude 
> and longitude at 15bits precision (didn't use 16 bits to avoid the 
> sign bit) so I have a 30 bit number in said field. Getting facet 
> counts for 0 to (2^30 - 1) should get me the entire map while getting 
> counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 
> 1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four 
> equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 ....  
> (2^30 - 4 to 2^30 - 1) and of course faceting on every separate term. 
> Of course since if you're zoomed in far enough to need such fine 
> grained clustering you'll be looking at a small portion of the map and 
> only a part of the whole range should be counted, but that should be 
> doable by calculating the Hilbert number for the lower and upper bounds.
>
> The only problem is the location of the clusters, if I use this method 
> I'll only have the Hilbert number and the number of items in that part 
> of the, what is essentially a quadtree. But I suppose I can calculate 
> the facet counts for one precision finer than the requested precision 
> and use a weighted average of the four parts of the cluster, I'll have 
> to see if that is accurate enough.
>
> Hopefully I'll have the time to complete this today or tomorrow. I'll 
> report back if it has worked.
>
> Regards,
>
> gwk
>
> Joe Calderon wrote:
>> there are clustering libraries like
>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
>> bindings to perl/python, you can preprocess your results and create
>> clusters for each zoom level
>>
>> On Tue, Sep 8, 2009 at 8:08 AM, gwk<gi...@eyefi.nl> wrote:
>>  
>>> Hi,
>>>
>>> I just completed a simple proof-of-concept clusterer component which
>>> naively clusters with a specified bounding box around each position,
>>> similar to what the javascript MarkerClusterer does. It's currently 
>>> very
>>> slow as I loop over the entire docset and request the longitude and
>>> latitude of each document (Not to mention that my unfamiliarity with
>>> Lucene/Solr isn't helping the implementations performance any, most 
>>> code
>>> is copied from grep-ing the solr source). Clustering a set of about
>>> 80.000 documents takes about 5-6 seconds. I'm currently looking into
>>> storing the hilber curve mapping in Solr and clustering using facet
>>> counts on numerical ranges of that mapping but I'm not sure it will 
>>> pan out.
>>>
>>> Regards,
>>>
>>> gwk
>>>
>>> Grant Ingersoll wrote:
>>>    
>>>> Not directly related to geo clustering, but
>>>> http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
>>>> interface to clustering implementations.  It currently has Carrot2
>>>> implemented, but the APIs are marked as experimental.  I would 
>>>> definitely be
>>>> interested in hearing your experience with implementing your 
>>>> clustering
>>>> algorithm in it.
>>>>
>>>> -Grant
>>>>
>>>> On Sep 8, 2009, at 4:00 AM, gwk wrote:
>>>>
>>>>      
>>>>> Hi,
>>>>>
>>>>> I'm working on a search-on-map interface for our website. I've 
>>>>> created a
>>>>> little proof of concept which uses the MarkerClusterer
>>>>> (http://code.google.com/p/gmaps-utility-library-dev/) which 
>>>>> clusters the
>>>>> markers nicely. But because sending tens of thousands of markers 
>>>>> over Ajax
>>>>> is not quite as fast as I would like it to be, I'd prefer to do the
>>>>> clustering on the server side. I've considered a few options like 
>>>>> storing
>>>>> the morton-order and throwing away precision to cluster, assigning 
>>>>> all
>>>>> locations to a grid position. Or simply cluster based on 
>>>>> country/region/city
>>>>> depending on zoom level by adding latitude on longitude fields for 
>>>>> each zoom
>>>>> level (so that for smaller countries you have to be zoomed in 
>>>>> further to get
>>>>> the next level of clustering).
>>>>>
>>>>> I was wondering if anybody else has worked on something similar 
>>>>> and if so
>>>>> what their solutions are.
>>>>>
>>>>> Regards,
>>>>>
>>>>> gwk
>>>>>         
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>>> using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>       
>>>
>>>     
>

Re: Geographic clustering

Posted by gwk <gi...@eyefi.nl>.

Hi Joe,

Thanks for the link, I'll check it out, I'm not sure it'll help in my 
situation though since the clustering should happen at runtime due to 
faceted browsing (unless I'm mistaken at what the preprocessing does).

More on my progress though, I thought some more about using Hilbert 
curve mapping and it seems really suited for what I want. I've just 
added a Hilbert field to my schema (Trie Integer field) with latitude 
and longitude at 15bits precision (didn't use 16 bits to avoid the sign 
bit) so I have a 30 bit number in said field. Getting facet counts for 0 
to (2^30 - 1) should get me the entire map while getting counts for 0 to 
(2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 1) and (2^29 + 
2^28) to (2^30 - 1) should give me counts for four equal quadrants, all 
the way down to 0 to 3, 4 to 7, 8 to 11 ....  (2^30 - 4 to 2^30 - 1) and 
of course faceting on every separate term. Of course since if you're 
zoomed in far enough to need such fine grained clustering you'll be 
looking at a small portion of the map and only a part of the whole range 
should be counted, but that should be doable by calculating the Hilbert 
number for the lower and upper bounds.

The only problem is the location of the clusters, if I use this method 
I'll only have the Hilbert number and the number of items in that part 
of the, what is essentially a quadtree. But I suppose I can calculate 
the facet counts for one precision finer than the requested precision 
and use a weighted average of the four parts of the cluster, I'll have 
to see if that is accurate enough.

Hopefully I'll have the time to complete this today or tomorrow. I'll 
report back if it has worked.

Regards,

gwk

Joe Calderon wrote:
> there are clustering libraries like
> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
> bindings to perl/python, you can preprocess your results and create
> clusters for each zoom level
>
> On Tue, Sep 8, 2009 at 8:08 AM, gwk<gi...@eyefi.nl> wrote:
>   
>> Hi,
>>
>> I just completed a simple proof-of-concept clusterer component which
>> naively clusters with a specified bounding box around each position,
>> similar to what the javascript MarkerClusterer does. It's currently very
>> slow as I loop over the entire docset and request the longitude and
>> latitude of each document (Not to mention that my unfamiliarity with
>> Lucene/Solr isn't helping the implementations performance any, most code
>> is copied from grep-ing the solr source). Clustering a set of about
>> 80.000 documents takes about 5-6 seconds. I'm currently looking into
>> storing the hilber curve mapping in Solr and clustering using facet
>> counts on numerical ranges of that mapping but I'm not sure it will pan out.
>>
>> Regards,
>>
>> gwk
>>
>> Grant Ingersoll wrote:
>>     
>>> Not directly related to geo clustering, but
>>> http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
>>> interface to clustering implementations.  It currently has Carrot2
>>> implemented, but the APIs are marked as experimental.  I would definitely be
>>> interested in hearing your experience with implementing your clustering
>>> algorithm in it.
>>>
>>> -Grant
>>>
>>> On Sep 8, 2009, at 4:00 AM, gwk wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> I'm working on a search-on-map interface for our website. I've created a
>>>> little proof of concept which uses the MarkerClusterer
>>>> (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the
>>>> markers nicely. But because sending tens of thousands of markers over Ajax
>>>> is not quite as fast as I would like it to be, I'd prefer to do the
>>>> clustering on the server side. I've considered a few options like storing
>>>> the morton-order and throwing away precision to cluster, assigning all
>>>> locations to a grid position. Or simply cluster based on country/region/city
>>>> depending on zoom level by adding latitude on longitude fields for each zoom
>>>> level (so that for smaller countries you have to be zoomed in further to get
>>>> the next level of clustering).
>>>>
>>>> I was wondering if anybody else has worked on something similar and if so
>>>> what their solutions are.
>>>>
>>>> Regards,
>>>>
>>>> gwk
>>>>         
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>       
>>
>>

Re: Geographic clustering

Posted by Joe Calderon <ca...@gmail.com>.

there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk<gi...@eyefi.nl> wrote:
> Hi,
>
> I just completed a simple proof-of-concept clusterer component which
> naively clusters with a specified bounding box around each position,
> similar to what the javascript MarkerClusterer does. It's currently very
> slow as I loop over the entire docset and request the longitude and
> latitude of each document (Not to mention that my unfamiliarity with
> Lucene/Solr isn't helping the implementations performance any, most code
> is copied from grep-ing the solr source). Clustering a set of about
> 80.000 documents takes about 5-6 seconds. I'm currently looking into
> storing the hilber curve mapping in Solr and clustering using facet
> counts on numerical ranges of that mapping but I'm not sure it will pan out.
>
> Regards,
>
> gwk
>
> Grant Ingersoll wrote:
>>
>> Not directly related to geo clustering, but
>> http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
>> interface to clustering implementations.  It currently has Carrot2
>> implemented, but the APIs are marked as experimental.  I would definitely be
>> interested in hearing your experience with implementing your clustering
>> algorithm in it.
>>
>> -Grant
>>
>> On Sep 8, 2009, at 4:00 AM, gwk wrote:
>>
>>> Hi,
>>>
>>> I'm working on a search-on-map interface for our website. I've created a
>>> little proof of concept which uses the MarkerClusterer
>>> (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the
>>> markers nicely. But because sending tens of thousands of markers over Ajax
>>> is not quite as fast as I would like it to be, I'd prefer to do the
>>> clustering on the server side. I've considered a few options like storing
>>> the morton-order and throwing away precision to cluster, assigning all
>>> locations to a grid position. Or simply cluster based on country/region/city
>>> depending on zoom level by adding latitude on longitude fields for each zoom
>>> level (so that for smaller countries you have to be zoomed in further to get
>>> the next level of clustering).
>>>
>>> I was wondering if anybody else has worked on something similar and if so
>>> what their solutions are.
>>>
>>> Regards,
>>>
>>> gwk
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
>
>

Re: Geographic clustering

Posted by gwk <gi...@eyefi.nl>.

Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently very
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most code
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will pan out.

Regards,

gwk

Grant Ingersoll wrote:
> Not directly related to geo clustering, but 
> http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable 
> interface to clustering implementations.  It currently has Carrot2 
> implemented, but the APIs are marked as experimental.  I would 
> definitely be interested in hearing your experience with implementing 
> your clustering algorithm in it.
>
> -Grant
>
> On Sep 8, 2009, at 4:00 AM, gwk wrote:
>
>> Hi,
>>
>> I'm working on a search-on-map interface for our website. I've 
>> created a little proof of concept which uses the MarkerClusterer 
>> (http://code.google.com/p/gmaps-utility-library-dev/) which clusters 
>> the markers nicely. But because sending tens of thousands of markers 
>> over Ajax is not quite as fast as I would like it to be, I'd prefer 
>> to do the clustering on the server side. I've considered a few 
>> options like storing the morton-order and throwing away precision to 
>> cluster, assigning all locations to a grid position. Or simply 
>> cluster based on country/region/city depending on zoom level by 
>> adding latitude on longitude fields for each zoom level (so that for 
>> smaller countries you have to be zoomed in further to get the next 
>> level of clustering).
>>
>> I was wondering if anybody else has worked on something similar and 
>> if so what their solutions are.
>>
>> Regards,
>>
>> gwk
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

Re: Geographic clustering

Posted by Grant Ingersoll <gs...@apache.org>.

Not directly related to geo clustering, but http://issues.apache.org/jira/browse/SOLR-769 
  is all about a pluggable interface to clustering implementations.   
It currently has Carrot2 implemented, but the APIs are marked as  
experimental.  I would definitely be interested in hearing your  
experience with implementing your clustering algorithm in it.

-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:

> Hi,
>
> I'm working on a search-on-map interface for our website. I've  
> created a little proof of concept which uses the MarkerClusterer (http://code.google.com/p/gmaps-utility-library-dev/ 
> ) which clusters the markers nicely. But because sending tens of  
> thousands of markers over Ajax is not quite as fast as I would like  
> it to be, I'd prefer to do the clustering on the server side. I've  
> considered a few options like storing the morton-order and throwing  
> away precision to cluster, assigning all locations to a grid  
> position. Or simply cluster based on country/region/city depending  
> on zoom level by adding latitude on longitude fields for each zoom  
> level (so that for smaller countries you have to be zoomed in  
> further to get the next level of clustering).
>
> I was wondering if anybody else has worked on something similar and  
> if so what their solutions are.
>
> Regards,
>
> gwk

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search