You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Uwe Schindler <uw...@thetaphi.de> on 2009/04/13 17:40:53 UTC

Move TrieRange to Core/Module and integration issues

Hi,

it was discussed now many times on this list, but I did not get a solution,
if we should include TrieRange into the core or not.

When thinking about it and looking in the latest developments about
TrieRange (TokenStreams for indexing), I plan to do the following:

a) Put the classes into the correct packages:
- (Int|Long)TrieRangeFilter into o.a.l.search, with maybe new name
(Int|Long)NumericRangeFilter or possible both in one class
NumericRangeFilter (this is possible, problem is only that you will have 2
ctors taking long or int and are else identical, and auto-casting in the
compiler can do bad things...)
- (Int|Long)TrieTokenStream into o.a.l.analysis as NumericTokenStream (same
note as above)
- ShiftAttribute into o.a.l.analysis.tokenattributes
- TrieUtils as new NumberUtils in a not yet known place: o.a.l.utils?
o.a.l.document?
- The TrieValueSource for LUCENE-831 would move to o.a.l.search (see patch
there)

b) Make NumericRangeQuery (not yet existing as own class) a subclass of the
new MultiTermQuery. By this you get the ConstantScore, BooleanQuery rewrite
and the Filter for free. To enable this, I must create a
Numeric/TrieRangeTermEnum, here I propose some changes:

MultiTermQuery has in its protected getEnum() returning FilteredTermEnum.
For TrieRange, the return should be changed to TermEnum, it is not needed to
have a FilteredTermEnum (FilteredTermEnum is only an implementation, the
method should return an abstract TermEnum). If this is fixed, I can write a
NumericRangeTermEnum extends TermEnum, that enumerates the terms for all
sub-ranges (with FilteredTermEnum this is not possible), so it must be a
"own" extension. FilteredTermEnum could be used if it would be possible to
access the inner enum and term members (currently private), but this would
be a completely "unclean hack". The NumericRangeTermEnum would get the range
bounds by the TrieUtils.RangeBuilder and in its next() method return the
terms with skipping to the correct terms on range change automatically
(until TermEnum.skipTo() works performant by using a new one from the
IndexReader).


Any thoughts? How to proceed with TrieRange?

Something other: How about storing the "type" information in FieldInfos and
invent a AbstractField subclass for numbers (NumberField) returning the
TrieTokenStream in tokenSteam()? This could help people to index. When
searching, query parsers could use the information and construct the right
queries, sorting would automatically choose the right ValueSource/Parser and
so on.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: Move TrieRange to Core/Module and integration issues

Posted by Uwe Schindler <uw...@thetaphi.de>.
> > For me it would not be a problem, I would use a FilteredTermEnum and
> > subclass it, but would only implement next() and the other abstract
> methods
> > would be dummies (including difference() returning 1.0f). Only the enum
> and
> > the term should have a protected access or a getter in this class.
> 
> Seems like this is simplest (relax FilteredTermEnum so that it could
> be extended, and then you can subclass it with dummies)?

I will prepare a patch that makes the members protected instead of private
with appropinquate javadocs and then I can go on with a first implementation
of the TermEnum. I will do this all in the contrib area with the current
package name. Moving/renaming can be done later (but hopefully before 2.9).

> This sounds like a great step forward overall; it's nice to have all
> queries that are based on multiple terms share MultiTermQuery.
> 
> Moving things to the proper sub-packages, and renaming, also makes
> alot of sense.
> 
> Re core or module or contrib, it's still being discussed under
> uber-thread "Modularization".

I wanted to direct attention to that again :-)

> > Something other: How about storing the "type" information in FieldInfos
> and
> > invent a AbstractField subclass for numbers (NumberField) returning the
> > TrieTokenStream in tokenSteam()? This could help people to index. When
> > searching, query parsers could use the information and construct the
> right
> > queries, sorting would automatically choose the right ValueSource/Parser
> and
> > so on.
> 
> I would love to do something along these lines (LUCENE-1597 is also
> exploring better typed fields/documents).
> 
> Once FieldInfos can properly store the fact that a given field has
> NumericType (which'd have options to turn on sorting, range filtering,
> etc.), then we could default many things properly without requiring
> the app to do "per field" things in N different places.

Yes. Just for comparison: This is exactly done like so in Solr (with the
schema in xml format fort he index). Maybe this can be a reference, how such
a schema could look like. The important things in the Schema for a Field are
things like type, how to get a ValueSource (both for function queries and
for sorting), how to convert the type to java objects (toObject() in solr
returns e.g. java.lang.Integer and so on). But solrs schema has too many
thing is comparison to Lucene like analyzers and so on. This should be
stripped down, to the important things.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Move TrieRange to Core/Module and integration issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Apr 13, 2009 at 12:05 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> For me it would not be a problem, I would use a FilteredTermEnum and
> subclass it, but would only implement next() and the other abstract methods
> would be dummies (including difference() returning 1.0f). Only the enum and
> the term should have a protected access or a getter in this class.

Seems like this is simplest (relax FilteredTermEnum so that it could
be extended, and then you can subclass it with dummies)?

This sounds like a great step forward overall; it's nice to have all
queries that are based on multiple terms share MultiTermQuery.

Moving things to the proper sub-packages, and renaming, also makes
alot of sense.

Re core or module or contrib, it's still being discussed under
uber-thread "Modularization".

> Something other: How about storing the "type" information in FieldInfos and
> invent a AbstractField subclass for numbers (NumberField) returning the
> TrieTokenStream in tokenSteam()? This could help people to index. When
> searching, query parsers could use the information and construct the right
> queries, sorting would automatically choose the right ValueSource/Parser and
> so on.

I would love to do something along these lines (LUCENE-1597 is also
exploring better typed fields/documents).

Once FieldInfos can properly store the fact that a given field has
NumericType (which'd have options to turn on sorting, range filtering,
etc.), then we could default many things properly without requiring
the app to do "per field" things in N different places.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: Move TrieRange to Core/Module and integration issues

Posted by Uwe Schindler <uw...@thetaphi.de>.
> > MultiTermQuery has in its protected getEnum() returning
> FilteredTermEnum.
> > For TrieRange, the return should be changed to TermEnum, it is not
> needed to
> > have a FilteredTermEnum (FilteredTermEnum is only an implementation, the
> > method should return an abstract TermEnum). If this is fixed, I can
> write a
> > NumericRangeTermEnum extends TermEnum, that enumerates the terms for all
> > sub-ranges (with FilteredTermEnum this is not possible), so it must be a
> > "own" extension. FilteredTermEnum could be used if it would be possible
> to
> > access the inner enum and term members (currently private), but this
> would
> > be a completely "unclean hack".
> Have you considered how to fix this? Fuzzy is what expects the
> FilteredTermEnum - it could just be changed to cast though, but we still
> have a back compat issue changing that method. I think we'd have to
> deprecate and add another call? TrieRange could throw USOE with the old
> enum?

Ahhh, I forgot the difference() method.

For me it would not be a problem, I would use a FilteredTermEnum and
subclass it, but would only implement next() and the other abstract methods
would be dummies (including difference() returning 1.0f). Only the enum and
the term should have a protected access or a getter in this class.

> I think thats worth fixing in either case.

I do not know, how to do it correctly.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Move TrieRange to Core/Module and integration issues

Posted by Mark Miller <ma...@gmail.com>.
Uwe Schindler wrote:
> MultiTermQuery has in its protected getEnum() returning FilteredTermEnum.
> For TrieRange, the return should be changed to TermEnum, it is not needed to
> have a FilteredTermEnum (FilteredTermEnum is only an implementation, the
> method should return an abstract TermEnum). If this is fixed, I can write a
> NumericRangeTermEnum extends TermEnum, that enumerates the terms for all
> sub-ranges (with FilteredTermEnum this is not possible), so it must be a
> "own" extension. FilteredTermEnum could be used if it would be possible to
> access the inner enum and term members (currently private), but this would
> be a completely "unclean hack". 
Have you considered how to fix this? Fuzzy is what expects the 
FilteredTermEnum - it could just be changed to cast though, but we still 
have a back compat issue changing that method. I think we'd have to 
deprecate and add another call? TrieRange could throw USOE with the old 
enum?

I think thats worth fixing in either case.

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org