You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "David Smiley (@MITRE.org)" <DS...@mitre.org> on 2012/08/09 06:56:02 UTC

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

Hi! 

Sorry for the belated response; my google alerts didn't kick in for some
weird reason until you posted to the lucene dev list.


solr-user wrote
> 
> hopefully someone is using the lucene spatial toolkit aka LSP aka
> spatial4j, and can answer this question
> 
> we are using this spatial tool for doing searches.  overall, it seems to
> work very well.  however, finding documentation is difficult.
> 
> 

I'm using it ;-)

The current in-progress documentation is here:
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4


solr-user wrote
> 
> 
> I have a couple of questions:
> 
> 1. I have a geohash field in my solr schema that contains indexed
> geographic polygon data.  I want to find all docs where that polygon
> intersects a given lat/long.  I was experimenting with returning distance
> in the resultset and with sorting by distance and found that the following
> query works.  However, I dont know what distance means in the query.  i.e.
> is it distance from point to the polygon centroid, to the closest outer
> edge of the polygon, its a useless random value, etc. Does anyone know??
> 
> http://solrserver:solrport/solr/core0/select?q=*:*&fq={!v=$geoq%20cache=false}&geoq=wkt_search:%22Intersects(Circle(-97.057%2047.924%20d=0.000001))%22&sort=query($geoq)+asc&fl=catchment_wkt1_trimmed,school_name,latitude,longitude,dist:query($geoq,-1),loc_city,loc_state
> 

It's from the center of the indexed shape to the center of the query shape.

At some point soonish, the score of a geo query is going to be similar to
the inverted distance so that it's a better relevancy metric which is what
scores should be.  I expect some alternative means to show up to actually
get the distance in search results -- perhaps a special Solr function query.


solr-user wrote
> 
> 2. some of the polygons, being geographic representations, are very big
> (ie state/province polygons).  when solr starts processing a spatial query
> (like the one above), I can see ("INFO: Building Cache [xxxxxx]") it fills
> in some sort of memory cache
> (org.apache.lucene.spatial.strategy.util.ShapeFieldCache) of the indexed
> polygon data.  We are encountering Java OOM issues when this occurs (even
> when we booested the mem to 7GB). I know that some of the polygons can
> have more than 2300 points, but heavy trimming isn't really an option due
> to level of detail issues. Can we control this caching, or the indexing of
> the polygons, in any way to reduce the memory requirements??
> 

All center points get cached into memory upon first use in a score.  I'm
unsatisfied with the current details of this which is not real-time-search
friendly and is a memory pig since it's a ArrayList of ArrayList of
PointImpl objects.  If you have a single shape value per field, then I
suggest indexing the center point into a solr.LatLonType field for sorting,
which uses the lucene FieldCache and it'll use much less memory.  Consider
making it float based to halve your memory requirements further.

p.s. I suggest "watching" this JIRA issue:
https://issues.apache.org/jira/browse/SOLR-3304

~ David Smiley



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000024.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.

On Aug 9, 2012, at 4:16 PM, solr-user [via Lucene] wrote:

I didn't know how the cache got triggered and the "needScore=false" now allows some of my problem queries to finally work, and well within 2gb of mem.

needScore is an unfortunate hack in the Solr adapter to the Lucene spatial module to work-around the fact that Solr only knows how to get queries from a field type, not filters.  Unlike filters, queries have scores.  For spatial, they are expensive (lots of ram) and you may not even want them!  Consider voting for this issue:

https://issues.apache.org/jira/browse/SOLR-2883

~ David




-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000294.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

Posted by solr-user <so...@hotmail.com>.

Thanks David.  You are a life saver.  

I didn't know how the cache got triggered and the "needScore=false" now
allows some of my problem queries to finally work, and well within 2gb of
mem.

will look at your other suggestion when I can. 

MANY thanks again.



--
View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000286.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.

solr-user wrote
> 
> Thanks David.  No worries about the delay; am always happy and
> appreciative when someone responds.
> 
> I don't understand what you mean by "All center points get cached into
> memory upon first use in a score" in question 2 about the Java OOM errors
> I am seeing.
> 

The underlying field type receives one internal Shape instance per WKT
string that is handed to it, no matter wether that WKT is MultiGeometry or
not.  The center point of that shape is indexed in such a way that it can be
read into a cache later.  It doesn't matter how many vertexes/coordinates
your geometries have or quantity of shapes that exist in a single WKT
string; it results in one point given one WKT string value.  Just wanted to
be clear on that.  STNumPoints is the wrong statistic since that counts
internal coordinates, from my reading of its documentation just now. 
STNumGeometries isn't right either if your WKT uses any of the Multi* type
geometries.


solr-user wrote
> 
> The Solr instance I have setup for testing has around 200k docs, with one
> WKT field per doc (indexed and stored and set to multivalue).
> 
> I did a count of the number of points that get indexed in Solr (computed
> in MS SQL by counting the number of points (using STNumPoints) for each
> geometry (using STNumGeometries) in the WKT data I am indexing), and I
> have around 35M points total.
> 
> If only the center points for 190K docs get cached, wouldn't that easily
> fit in 7GB of heap? 
> 
> Even if Solr was caching 35M points, that still doesn't sound like 7GB
> worth of data.
> 

Yeah... the memory cache may be pig-ish but not that bad.  There's something
about the implementation that tells me there could be a bug if any of your
polygon shapes are small and/or you index at a high resolution.  Given that
you have multi-valued spatial data per document, you can't simply use
solr.LatLonType.  Try this -- create a new field called centerPoints or
something like that, and also use the same field type as for the geohash one
you are already using.  But for this one, hand Solr the center-points of
your shape data.  Hopefully it's straight-forward for you to calculate this. 
Then when you do sorting by distance or need to retrieve the distance via a
dist:query(...) etc., be sure to use this field and NOT the main shape one
that has the full shape indexed.  To be sure the spatial module doesn't load
the center points for the main shape field, pass needScore=false as a Solr
local-param in your filter query for it.

Hopefully that fixes it.  If it does, there is a bug and I know what it is.

~ David



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000276.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

Posted by solr-user <so...@hotmail.com>.

Thanks David.  No worries about the delay; am always happy and appreciative
when someone responds.

I don't understand what you mean by "All center points get cached into
memory upon first use in a score" in question 2 about the Java OOM errors I
am seeing.

The Solr instance I have setup for testing has around 200k docs, with one
WKT field per doc (indexed and stored and set to multivalue).

I did a count of the number of points that get indexed in Solr (computed in
MS SQL by counting the number of points (using STNumPoints) for each
geometry (using STNumGeometries) in the WKT data I am indexing), and I have
around 35M points total.

If only the center points for 190K docs get cached, wouldn't that easily fit
in 7GB of heap? 

Even if Solr was caching 35M points, that still doesn't sound like 7GB worth
of data.



--
View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000268.html
Sent from the Solr - User mailing list archive at Nabble.com.