You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rene Hackl-Sommer <re...@gmx.de> on 2010/03/15 10:03:26 UTC

Increase number of available positions?

Hello,

I am working at a use case that is very demanding regarding the number 
of token positions. For one special field in the index, I need to 
represent different hierarchy levels, like this:

<MyField>
<Level_1>
<Level_2>
<Level_3>

Please note that I need to do this with Lucene, not a XML search engine.

Now, on Level_3 there a hundreds of tokens, Level_2 also has hundreds of 
entries and Level_1 is in there with a low 3-digit figure. For those who 
wish to know: this is an intricate system of chemical entities and some 
their properties.

I need this information to be searchable in all conceivable ways. What I 
am doing right now is use position increment gaps to separate the Levels 
and search with SpanQueries. It works like a charm for a setup with 
limited entries. But Integer.MAX_VALUE poses a cap on the approach, of 
course. Would it be thinkable to replace the current integer counting 
system with a long based system? What issues should I consider?

Thanks,
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Increase number of available positions?

Posted by Erick Erickson <er...@gmail.com>.

Not quite what I had in mind, more like
level1-1/level2-1/level3-1/Term1 level1-1/level2-1/level3-1/Term2
level1-1/level2-1/level3-2/Term3 level1-1/level2-1/level3-2/Term4

With an increment gap 0f 100 and an analyzer that split on slashes, the term
positions would be
something like:

term   term
pos
0        level1-1
1        level2-1
2        level3-1
3        Term1
104    level1-1
105    level2-1
106    level3-1
107    Term2
208     level1-1
209     level2-1
210     level3-2
211     Term3
312     level1-1
313     level2-1
314     level3-2
315     Term4

As you see, a lot or repetition, but perhaps acceptable...

Or, you could choose an analyzer that didn't break up the terms
(although this would make your index somewhat bigger due to
more unique terms).
term           term
pos
0          level1-1/level2-1/level3-1/Term1
101      level1-1/level2-1/level3-1/Term2
202      level1-1/level2-1/level3-2/Term3
303      level1-1/level2-1/level3-2/Term4

Although I don't know if you really need an increment gap here.....

This latter would make gathering all the documents with specific levels
easier although the former would also work if you didn't need partial
terms (that is, wildcards inside of phrases are new, see
JIRA-1486, ComplexPhraseQueryParser).

Best
Erick

On Mon, Mar 15, 2010 at 5:09 PM, Rene Hackl-Sommer <re...@gmx.de>wrote:

> Hi Erick,
>
>> What about indexing
>> the triplets with a small increment gap between? That is:
>> ...
>>
>> gets indexed as:
>>
>> level1-1/level2-1/level3-1  +gap 100
>> level1-1/level2-1/level3-2  +gap 100
>> level1-1/level2-2/level3-3  +gap 100
>> level1-1/level2-2/level3-4
>>
>>
>
> If I understand this correctly, the field would look like
> "level1-1/level2-1/level3-1 Term1 Term2 level1-1/level2-1/level3-2 Term3
> Term4 "?
>
> I think, the problem here is the same like in the Payloads approach I wrote
> of in my response to Steve's mail. We cannot test for equality at search
> time (please correct me if we actually can do this). So if we have
>
>
> level1-1/level2-1/level3-1
> ...
> level1-1/level2-1/level3-244
> level1-1/level2-2/level3-1
> level1-1/level2-2/level3-105
>
> and I search for T1 and T2 on level3, but want them to be in the same
> level2, this cannot be done satisfactorily.
>
>
>  Or you could think about *documents* being your level1, that is each
>> document has one and only one level1 element but many documents
>> may have the same level1 token. Combining this with your increment
>> gap notion for level2-3 might work for you.
>>
>>
>
> I was thinking about this, yet the trouble is that the issue at hand is
> just one field in an already not quite trivial scenario involving 200+
> fields. If I add say 50 level1-documents per real document, I would still
> need to be able to relate these level1-documents to the real documents to
> which they belong, and, during retrieval, there are use cases where I need
> to look into each of the level1-documents to see if they fulfill certain
> criteria and then, in a further step, ascertain whether I can gather the
> needed level1-documents to fulfill the query on a "MyField"-Level (not
> existant here per se). I feel this might get somewhat unwieldy.
>
>
>  You might also search the list for "Heirarchal" or "tree" indexing,
>> this is a variant of such I think.
>>
>>
>
> Thank you, I'll look into this.
>
>
> Cheers
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>