You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Uwe Schindler <uw...@thetaphi.de> on 2008/11/25 20:10:35 UTC

RE: TrieRangeQuery for contrib?

Hi Mike, hi Paul,

Mike: You are right, the algorithm has one advantage and one disadvantage:

- There is not only a logarithmic bound, there is a hard upper bound. In my
algorithm with 8 precisions (so from 1 to 8 bytes length of keys) the
maximum numbers of terms to be visited is limited to 3825 terms (see the
java docs or the cited paper). The upper limit only applies, if you make a
very big range and have a lot of different homogenous distributed terms
inside the range (without the optimization). Testing with a 500,000 document
index (6 GB) with numeric values (doubles) has shown, that even with large
ranges the maximum number of terms, you mostly only get about 300 terms to
be visited. This is not related to index size!
- The index size is bigger: You store for each numeric field not only one
term in index, but eight, so index size increases. But this is neglectible
in my opinion and for large indexes the speed increase is great.
- Inside Luke, the values of such "Trie" fields are not human readable
(because of the encoding). Even when stored, the current implementation uses
the special encoding to store the field. For displaying the field you have
to use the decoder from the TrieUtils class. But this is the same with
current DateUtils from Lucene (but they are more readable :-) )
 
Comparisions with the above 500,000 doc index showed that the old RangeQuery
(with raised BooleanQuery clause count) took about 30-40 secs to complete,
ConstantScoreRangeQuery took 5 secs and TrieRangeQuery took <100ms to
complete (on an Opteron64 machine, Java 1.5). You can test a little bit on
http://www.pangaea.de/advanced/advsearch.php by entering something into the
geographic bounding box or temporal coverage). As you can see, the usage of
this range query type is optimal for geographic searches using doubles (not
fixed decimals!), longs or dates as keys.

I have no problem with including it into Lucene contrib. I just have some
questions/comments:

- Code is Apache 2.0 licensed, so it is simple to include. I would change
the package prefix, update the JavaDocs and create a contrib patch out of
it. References to commons logging can be removed (they are just for
debugging). Code is Java 1.5 (using StringBuilder), but this could easily be
changed.
- I want to be able to develop the code further once in contrib, is this
possible? How would be the best to handle this? Let the code stay in my SVN
and you update it or let me commit to the contrib folder in Lucene?
Currently the code is in SVN of panFMP (www.panfmp.org) that uses it. When
donated to Apache, I would put a dependency into panFMP to the contrib
Package, once released and remove it from my tree. I do not want to get the
code into a dead end or start a fork of it inside contrib, because I want to
actively maintain it.

My intentions for giving the code to Lucene were some questions from other
projects (from geographic information systems), to also use the optimized
range queries for such type of geo queries, e.g. GeoNetwork Opensource, also
using Lucene, is interested. Maybe Solr wants to make use of it (using
another field data type). Instead of distributing the code to different
projects, I wanted to make it available as plugin for everybody from Lucene
itself.

I would start an issue in JIRA and attach the patch.

Uwe

> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, November 25, 2008 6:41 PM
> To: java-dev@lucene.apache.org
> Subject: Re: TrieRangeQuery for contrib? was: [jira] Commented: (LUCENE-
> 1461) Cached filter for a single term field
> 
> 
> I would love to see this find it's way into Lucene!
> 
> It puts a logarithmic bound (I think?) on the number of terms whose
> docs must be visited to implement the range query, at the expense
> of an increase in index size for those fields that need range search.
> 
> For large indexes this should make an incredible difference on
> RangeQuery performance.
> 
> Mike
> 
> Uwe Schindler wrote:
> 
> > I do not want to attach this to the issue report, just for
> > completeness:
> >
> > I implemented (based on RangeFilter) another approach for faster
> > RangeQueries, based on longs stored in index in a special format.
> >
> > The idea behind this is to store the longs in different precision in
> > index
> > and partition the query range in such a way, that the outer
> > boundaries are
> > search using terms from the highest precision, but the center of the
> > search
> > Range with lower precision. The implementation stores the longs in 8
> > different precisions (using a class called TrieUtils). It also has
> > support
> > for Doubles, using the IEEE 754 floating-point "double format" bit
> > layout
> > with some bit mappings to make them binary sortable. The approach is
> > used in
> > rather big indexes, query times are even on low performance desktop
> > computers <<100 ms (!) for very big ranges on indexes with 500000
> > docs.
> >
> > I called this RangeQuery variant and format "TrieRangeRange" query
> > because
> > the idea looks like the well-known Trie structures (but it is not
> > identical
> > to real tries, but algorithms are related to it).
> >
> > I was thinking about supplying this to Lucene contrib, any thoughts
> > about
> > it?
> >
> > More infos from the JavaDoc of the project:
> >
> http://www.panfmp.org/apidocs/de/pangaea/metadataportal/search/TrieRangeQu
> er
> > y.html
> >
> http://www.panfmp.org/apidocs/de/pangaea/metadataportal/utils/TrieUtils.ht
> ml
> >
> >
> > Source:
> >
> http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/
> me
> > tadataportal/search/TrieRangeQuery.java?view=markup
> >
> http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/
> me
> > tadataportal/utils/TrieUtils.java?view=markup
> >
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TrieRangeQuery for contrib?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Uwe Schindler wrote:

> - Inside Luke, the values of such "Trie" fields are not human readable
> (because of the encoding). Even when stored, the current implementation uses
> the special encoding to store the field. For displaying the field you have
> to use the decoder from the TrieUtils class. But this is the same with
> current DateUtils from Lucene (but they are more readable :-) )

Recent versions of Luke can present field contents using different 
decoders. I can add a DateUtils decoder, as well as TrieRange decoder 
once it becomes a Lucene contrib module.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: TrieRangeQuery for contrib?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Mike,

> > - Inside Luke, the values of such "Trie" fields are not human readable
> > (because of the encoding). Even when stored, the current
> > implementation uses
> > the special encoding to store the field. For displaying the field
> > you have
> > to use the decoder from the TrieUtils class. But this is the same with
> > current DateUtils from Lucene (but they are more readable :-) )
> 
> These seems OK, for starters.  Eventually maybe such a "range field"
> could provide an interface that knows how to "subdivide" intervals on
> its space of all terms, assigning more human readable labels to these
> subdivisions, instead of always casting to unsigned long.

This maybe a way to encode the vaules different. The other approach would be
to simpy "store" the field value as plain human-readable string, but have
the terms for searching encoded. For my project panFMP the encoded variant
was fitted cleaner in the framework.

The interface would be more complicated to work, but maybe possible (I
haven't thought about it). To note: The encoded values in the different
precisions are not stored in different fields, all different precissions
have the same field name, which makes index maintenance easier. The values
with lower precision are prefixed by a "precision" marker that makes it
possible to distinguish them and put the Term iterator to the lower bound in
the current precision for the range. Without prefix, the values for the
different precisions would be mixed. The first version of TrieRangeQuery
used different field names, but by the prefix trick, all terms could be in
one field.

> > Comparisions with the above 500,000 doc index showed that the old
> > RangeQuery
> > (with raised BooleanQuery clause count) took about 30-40 secs to
> > complete,
> > ConstantScoreRangeQuery took 5 secs and TrieRangeQuery took <100ms to
> > complete (on an Opteron64 machine, Java 1.5). You can test a little
> > bit on
> > http://www.pangaea.de/advanced/advsearch.php by entering something
> > into the
> > geographic bounding box or temporal coverage). As you can see, the
> > usage of
> > this range query type is optimal for geographic searches using
> > doubles (not
> > fixed decimals!), longs or dates as keys.
> 
> Wow it's very fast!  I first searched for "water", which returned
> ~428K docs, then bounded it roughly around Africa and it returned ~78K
> docs, very quickly.  Now I'd really love to get this into Lucene!

Thank you! As I see, you also tested the map :)

> > - I want to be able to develop the code further once in contrib, is
> > this
> > possible? How would be the best to handle this? Let the code stay in
> > my SVN
> > and you update it or let me commit to the contrib folder in Lucene?
> > Currently the code is in SVN of panFMP (www.panfmp.org) that uses
> > it. When
> > donated to Apache, I would put a dependency into panFMP to the contrib
> > Package, once released and remove it from my tree. I do not want to
> > get the
> > code into a dead end or start a fork of it inside contrib, because I
> > want to
> > actively maintain it.
> 
> I think for starters open an issue, attach a patch, and then we
> iterate from there?  Probably having the code in Apache's SVN, with
> the eventual goal of giving you commit rights to contrib, is what we
> should aim for?

I think, this would be good. Give me some time, to prepare the patch and we
discuss it then in JIRA.

Thanks,
Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TrieRangeQuery for contrib?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Uwe Schindler wrote:

> Hi Mike, hi Paul,
>
> Mike: You are right, the algorithm has one advantage and one  
> disadvantage:
>
> - There is not only a logarithmic bound, there is a hard upper  
> bound. In my
> algorithm with 8 precisions (so from 1 to 8 bytes length of keys) the
> maximum numbers of terms to be visited is limited to 3825 terms (see  
> the
> java docs or the cited paper). The upper limit only applies, if you  
> make a
> very big range and have a lot of different homogenous distributed  
> terms
> inside the range (without the optimization). Testing with a 500,000  
> document
> index (6 GB) with numeric values (doubles) has shown, that even with  
> large
> ranges the maximum number of terms, you mostly only get about 300  
> terms to
> be visited. This is not related to index size!

Awesome!

> - The index size is bigger: You store for each numeric field not  
> only one
> term in index, but eight, so index size increases. But this is  
> neglectible
> in my opinion and for large indexes the speed increase is great.

I agree this should be a very small cost, especially because it's only  
the range-tokens, often one per doc per ranged-field, that have  
increased storage.  I would imagine in practice it's quite low.

> - Inside Luke, the values of such "Trie" fields are not human readable
> (because of the encoding). Even when stored, the current  
> implementation uses
> the special encoding to store the field. For displaying the field  
> you have
> to use the decoder from the TrieUtils class. But this is the same with
> current DateUtils from Lucene (but they are more readable :-) )

These seems OK, for starters.  Eventually maybe such a "range field"  
could provide an interface that knows how to "subdivide" intervals on  
its space of all terms, assigning more human readable labels to these  
subdivisions, instead of always casting to unsigned long.

> Comparisions with the above 500,000 doc index showed that the old  
> RangeQuery
> (with raised BooleanQuery clause count) took about 30-40 secs to  
> complete,
> ConstantScoreRangeQuery took 5 secs and TrieRangeQuery took <100ms to
> complete (on an Opteron64 machine, Java 1.5). You can test a little  
> bit on
> http://www.pangaea.de/advanced/advsearch.php by entering something  
> into the
> geographic bounding box or temporal coverage). As you can see, the  
> usage of
> this range query type is optimal for geographic searches using  
> doubles (not
> fixed decimals!), longs or dates as keys.

Wow it's very fast!  I first searched for "water", which returned  
~428K docs, then bounded it roughly around Africa and it returned ~78K  
docs, very quickly.  Now I'd really love to get this into Lucene!

> I have no problem with including it into Lucene contrib. I just have  
> some
> questions/comments:
>
> - Code is Apache 2.0 licensed, so it is simple to include. I would  
> change
> the package prefix, update the JavaDocs and create a contrib patch  
> out of
> it. References to commons logging can be removed (they are just for
> debugging). Code is Java 1.5 (using StringBuilder), but this could  
> easily be
> changed.

Contrib code can be Java 1.5.

> - I want to be able to develop the code further once in contrib, is  
> this
> possible? How would be the best to handle this? Let the code stay in  
> my SVN
> and you update it or let me commit to the contrib folder in Lucene?
> Currently the code is in SVN of panFMP (www.panfmp.org) that uses  
> it. When
> donated to Apache, I would put a dependency into panFMP to the contrib
> Package, once released and remove it from my tree. I do not want to  
> get the
> code into a dead end or start a fork of it inside contrib, because I  
> want to
> actively maintain it.

I think for starters open an issue, attach a patch, and then we  
iterate from there?  Probably having the code in Apache's SVN, with  
the eventual goal of giving you commit rights to contrib, is what we  
should aim for?

> My intentions for giving the code to Lucene were some questions from  
> other
> projects (from geographic information systems), to also use the  
> optimized
> range queries for such type of geo queries, e.g. GeoNetwork  
> Opensource, also
> using Lucene, is interested. Maybe Solr wants to make use of it (using
> another field data type). Instead of distributing the code to  
> different
> projects, I wanted to make it available as plugin for everybody from  
> Lucene
> itself.

I agree, this should be in Lucene.

> I would start an issue in JIRA and attach the patch.

Excellent!

Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org