You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sascha Fahl <sa...@googlemail.com> on 2008/07/19 11:53:33 UTC

GeoSort approach - your opinion

Hi,

last week I realized an approach for GeoSort in lucene. Inspired by  
"Lucene in action" I modified the algorithm in the following way. When  
an IndexReader for a certain index is created, a cache for  
geoinformation is created - this simply is a 2 dimensional int Array.  
So it is possible to cache geoinformation for 1.000.000 docs in around  
8 MB. Everytime the ScoreDocComparator.compare(ScoreDoc i, ScoreDoc j)  
method is called I fetch the int Array with the geoinfo from the cache  
and calculate the distance.
I think this is a quite good solution:
1. Only the distances of real Hits are calculated. So only needed  
operations are done.
2. The geoinformation is not fetched via IndexReader.doc(i) but  
directly from the cache that is placed in the RAM
3. All hits get returned because this approach does not work with a  
boxed model, that excludes documents that are not within a certain  
radius (this is very annoying if there is a hit with a distance of 51  
km and the radius is 50 km)

What do you think about this approach? The only possible advantage is  
the cache I think because I do not really know if the JVM is good in  
handling 10 MB of data in the RAM.


MfG

Sascha Fahl
sascha.fahl@gmail.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: GeoSort approach - your opinion

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Sat, 2008-07-19 at 11:53 +0200, Sascha Fahl wrote:
> last week I realized an approach for GeoSort in lucene. Inspired by  
> "Lucene in action" I modified the algorithm in the following way. When  
> an IndexReader for a certain index is created, a cache for  
> geoinformation is created - this simply is a 2 dimensional int Array.  
> So it is possible to cache geoinformation for 1.000.000 docs in around  
> 8 MB.

Be aware that arrays in themselves take up a fair amount of memory, so
you'll want to use only 3 arrays in total and not 1000001:

int[][] coordinates = new int[2];
coordinates[0] = new int[1000000];
coordinates[1] = new int[1000000];

[...]

> What do you think about this approach?

Sounds fine when the index rarely changes.

> The only possible advantage is the cache I think because I do not really 
> know if the JVM is good in handling 10 MB of data in the RAM.

The Sun JVM is perfectly capable of handling large arrays efficiently.
We use an array-based structure of ints and longs for quick facet look
up that is approximately 300MB.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org