You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ype Kingma <yk...@xs4all.nl> on 2003/12/07 23:16:30 UTC
Multilevel indexing (was: Position increment)

On Friday 05 December 2003 18:26, Doug Cutting wrote:
...
>
> Speaking of sentence/paragraph boundaries: Following is a proposal I
> just wrote.  It also includes the PhraseQuery change I alluded to in my
> message yesterday to Erik.  If this proposal is accepted, then I'll have
> funding to develop it ASAP.
>
> Does this sound like a reasonable approach to the problem?
>
> It adds no costs for folks who don't use the new features.  When folks
> do use it, their index gets larger, but sentence- and paragraph- and
> section-type searches are exact and as efficient as ordinary phrase
> searches.
>
> Using position-increment is still possible: one can, e.g., increment by
> 100 for sentence, 1000 for paragraph and 10,000 for section.  This does
> not increase index size nearly as much, but searches are not exact, and
> are slightly slower, since they must specify non-zero slop (slop=0
> queries use a faster algorithm).
>
> Which approach is preferred?  Would it be confusing to support both?

I'd prefer to have support for both. To avoid confusion the tradeoffs
(time/space) could be mentioned in the API docs.
The main advantage is that it allows for configurable distance search
at the speed of current phrase search. For me, that is definitely worth adding.

>
> Note that the relative term position in PhraseQuery feature is not
> dependent on the levels feature, and could be easily added right now.
>
> I hope this is not too confusing...

Not at all. The only thing that I don't understand good enough is how
the search for normal phrases (with and without slop and truncation)
is done, but I guess there is only one  way to find out: the source code
and some experimentation.

> Doug
>
> ------
>
> MATCH WITHIN SENTENCE, PARAGRAPH, SECTION, ETC.
>
> The token class has a field called "type", which is used by analyzers
> to distinguish lexical types, but is currently ignored when indexing.
> We change this so that token types can name "levels".  Each token type
> can be declared to define a unique level.  There is also a default
> level for all undeclared token types.  When indexing a token stream, a
> position counter is maintained for each level.  Tokens at non-default
> levels are not indexed, and simply increment their counter.  Tokens at
> the default level are both indexed and increment the default counter.
>
> Consider the following example.
>
> Types "sent" and "para" are declared as levels.  Types "word" and
> "abbr" are undeclared, and hence default.
>
> The first two lines below are a sample sequence of token texts and
> types.  The next two lines are the values of the counters when
> indexing this sequence of tokens.
>
> Tokens:
>
>    text:    The start    .  The  USA         .  The  end    .
>    type:   word  word sent word abbr para sent word word sent
>
> Counters:
>
>    default:   0      1   1    2   3     3    3    4    5    5
>    sent:      0      0   0    1   1     1    1    2    2    2
>    para:      0      0   0    0   0     0    1    1    1    1
>
> Thus the word "the" appears at default level positions 0, 2, and 4,
> and at "sent" level positions 0, 1, and 2, and at para level positions
> 0, 0, and 1.

I'm trying to let this get through to me: every token would have a 
one or more positions at each level. That's quite general, and
could be very powerful.

>
> The following method is used to declare a level:
>
>    IndexWriter.addLevel(String field, String type);

As the type of level is not a number, am I correct to assume that
the levels are not necessarily hierarchical? Eg. in the example
would it be possible that the sentence nr is not increased 
but the paragraph nr is increased at a token?

I would certainly not mind having the possibities that this non strict
hierarchy opens, eg. addition of more linguistic information
in the index. A level could function as an attribute for a token,
eg. noun, verb etc.. However, this might also require a
proximity search over different levels. 
I'd like to hear some comment from a linguist on this, as it might
help to get better results in searching for longer phrases
in which the level information would be provided by a
document analyzer and a query parser.

>
> Note that this must be performed each time an IndexWriter is
> constructed.  If it is performed when some documents are added but not
> when others are, or if indexes are merged with different positioned
> tokens, then all positions will be zero for levels undeclared when
> indexing.
>
> An index maintains, for each field, the known levels, accessed through:
>
>    public String[] IndexReader.positionedLevels(String field);
>
> The stream of positions at each level is accessed with:
>
>    public TermPositions IndexReader.getPositions(Term term, String level);
>
> Note that the index size increases substantially for each level added.
> Each additional level adds around 50% to the default-only index size.
> Thus an index with two additional levels would require around 200% the
> size of an index with no additional levels.
>
> PhraseQuery's constructor takes a level for the match:
>
>    public PhraseQuery(String level);
>
> Phrase queries are also permitted to explicitly specify the position
> of each term in the phrase.  For example, if two phrase query terms
> are both at position=0, with phrase slop=0, then they must occur at
> exactly the same positions.  And if they're at position=0 and
> position=1 respectively, with phrase slop=0, then they must occur
> adjacently, as an exact phrase.
>
>    public PhraseQuery.add(Term term, int position);
>
> If, for example, sentence boundaries are at level=sent, then a phrase
> query at level=sent with slop=0 and position=0 for all terms would
> require that all query terms are in the same sentence.

Would there be a way to relate the scores of two phrase queries
done at different levels? Eg. for a query that requires two terms to be
in the same sentence, one could use half of the avarage sentence length
as the edit distance in sloppyFreq(), but this would require
storing the nr of sentences, ie. sth like the sentence norm.
I guess this question boils down to whether a norm would be stored for
each level. Also, for non strict hierarchical levels, the norm would
not be a trivial thing.

And how about scoring for interlevel search, if that would be possible?
Interlevel search might also require that all levels for a
document are stored close to each other in the proximity index.
From your earlier remark:
> It adds no costs for folks who don't use the new features.  When folks
> do use it, their index gets larger, but sentence- and paragraph- and
> section-type searches are exact and as efficient as ordinary phrase
> searches.
I got the impression that each level would be stored separately.
However, for levels that are more attribute like (eg. noun/verb)
it might be better to give them the same norm, and store them
close to each other in order to allow interlevel search.

Anyway, I think this multilevel indexing is a great idea even 
for strict hierarchical levels, but these don't need a name,
a level number (or even only the number of levels) will do.

Kind regards,
Ype Kingma


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org