You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rene Hackl-Sommer <re...@gmx.de> on 2010/03/15 10:03:26 UTC

Increase number of available positions?

Hello,

I am working at a use case that is very demanding regarding the number 
of token positions. For one special field in the index, I need to 
represent different hierarchy levels, like this:

<MyField>
<Level_1>
<Level_2>
<Level_3>

Please note that I need to do this with Lucene, not a XML search engine.

Now, on Level_3 there a hundreds of tokens, Level_2 also has hundreds of 
entries and Level_1 is in there with a low 3-digit figure. For those who 
wish to know: this is an intricate system of chemical entities and some 
their properties.

I need this information to be searchable in all conceivable ways. What I 
am doing right now is use position increment gaps to separate the Levels 
and search with SpanQueries. It works like a charm for a setup with 
limited entries. But Integer.MAX_VALUE poses a cap on the approach, of 
course. Would it be thinkable to replace the current integer counting 
system with a long based system? What issues should I consider?

Thanks,
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Erick Erickson <er...@gmail.com>.
Not quite what I had in mind, more like
level1-1/level2-1/level3-1/Term1 level1-1/level2-1/level3-1/Term2
level1-1/level2-1/level3-2/Term3 level1-1/level2-1/level3-2/Term4

With an increment gap 0f 100 and an analyzer that split on slashes, the term
positions would be
something like:

term   term
pos
0        level1-1
1        level2-1
2        level3-1
3        Term1
104    level1-1
105    level2-1
106    level3-1
107    Term2
208     level1-1
209     level2-1
210     level3-2
211     Term3
312     level1-1
313     level2-1
314     level3-2
315     Term4

As you see, a lot or repetition, but perhaps acceptable...

Or, you could choose an analyzer that didn't break up the terms
(although this would make your index somewhat bigger due to
more unique terms).
term           term
pos
0          level1-1/level2-1/level3-1/Term1
101      level1-1/level2-1/level3-1/Term2
202      level1-1/level2-1/level3-2/Term3
303      level1-1/level2-1/level3-2/Term4

Although I don't know if you really need an increment gap here.....

This latter would make gathering all the documents with specific levels
easier although the former would also work if you didn't need partial
terms (that is, wildcards inside of phrases are new, see
JIRA-1486, ComplexPhraseQueryParser).

Best
Erick

On Mon, Mar 15, 2010 at 5:09 PM, Rene Hackl-Sommer <re...@gmx.de>wrote:

> Hi Erick,
>
>> What about indexing
>> the triplets with a small increment gap between? That is:
>> ...
>>
>> gets indexed as:
>>
>> level1-1/level2-1/level3-1  +gap 100
>> level1-1/level2-1/level3-2  +gap 100
>> level1-1/level2-2/level3-3  +gap 100
>> level1-1/level2-2/level3-4
>>
>>
>
> If I understand this correctly, the field would look like
> "level1-1/level2-1/level3-1 Term1 Term2 level1-1/level2-1/level3-2 Term3
> Term4 "?
>
> I think, the problem here is the same like in the Payloads approach I wrote
> of in my response to Steve's mail. We cannot test for equality at search
> time (please correct me if we actually can do this). So if we have
>
>
> level1-1/level2-1/level3-1
> ...
> level1-1/level2-1/level3-244
> level1-1/level2-2/level3-1
> level1-1/level2-2/level3-105
>
> and I search for T1 and T2 on level3, but want them to be in the same
> level2, this cannot be done satisfactorily.
>
>
>  Or you could think about *documents* being your level1, that is each
>> document has one and only one level1 element but many documents
>> may have the same level1 token. Combining this with your increment
>> gap notion for level2-3 might work for you.
>>
>>
>
> I was thinking about this, yet the trouble is that the issue at hand is
> just one field in an already not quite trivial scenario involving 200+
> fields. If I add say 50 level1-documents per real document, I would still
> need to be able to relate these level1-documents to the real documents to
> which they belong, and, during retrieval, there are use cases where I need
> to look into each of the level1-documents to see if they fulfill certain
> criteria and then, in a further step, ascertain whether I can gather the
> needed level1-documents to fulfill the query on a "MyField"-Level (not
> existant here per se). I feel this might get somewhat unwieldy.
>
>
>  You might also search the list for "Heirarchal" or "tree" indexing,
>> this is a variant of such I think.
>>
>>
>
> Thank you, I'll look into this.
>
>
> Cheers
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Increase number of available positions?

Posted by Erick Erickson <er...@gmail.com>.
Sure. I'd start a new thread though, referencing this one and outlining why
none of the solutions you tried worked.....

Erick

On Tue, Mar 16, 2010 at 4:35 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:

> Hi Guys,
>
> Thanks for the input! I am now going to put in some work to see how things
> fare.
>
> Should I post the question about substituting int with long on lucene-dev
> again, if need arises?
>
> Thanks again,
> Rene
>
>
>
> Am 15.03.2010 23:04, schrieb Steven A Rowe:
>
>  Hi Rene,
>>
>> Have you seen SpanNotQuery?:
>>
>> <
>> http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html
>> >
>>
>> For a document that looks like:
>>
>> <Level_1 id="1">
>>   <Level_2 id="1">
>>     <Level_3 id="1">T1 T2 T3</Level_3>
>>     <Level_3 id="2">T4 T5 T6</Level_3>
>>     <Level_3 id="3">T7 T8 T9</Level_3>
>>   </Level_2>
>>   <Level_2 id="2">
>>     <Level_3 id="1">T10 T11 T12</Level_3>
>>     <Level_3 id="2">T13 T14 T15</Level_3>
>>     <Level_3 id="3">T16 T17 T18</Level_3>
>>   </Level_2>
>>   ...
>> </Level1>
>> ...
>>
>> You could generate the following token stream (L_X being a concrete level
>> boundary token):
>>
>> L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
>>     L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
>>     L_2 ...
>> ...
>>
>> A query to find T2 and T8 on the same Level_2 would require you to find a
>> span containing T2 and T8, but not containing L_2.
>>
>> This scheme will generalize to as many levels as you need, and you can use
>> nested span queries to simultaneously provide constraints at multiple
>> levels.  No position increment gap required.
>>
>> Caveat: this scheme is not tested - I could be way off base :).
>>
>> Steve
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Guys,

Thanks for the input! I am now going to put in some work to see how 
things fare.

Should I post the question about substituting int with long on 
lucene-dev again, if need arises?

Thanks again,
Rene



Am 15.03.2010 23:04, schrieb Steven A Rowe:
> Hi Rene,
>
> Have you seen SpanNotQuery?:
>
> <http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>
>
> For a document that looks like:
>
> <Level_1 id="1">
>    <Level_2 id="1">
>      <Level_3 id="1">T1 T2 T3</Level_3>
>      <Level_3 id="2">T4 T5 T6</Level_3>
>      <Level_3 id="3">T7 T8 T9</Level_3>
>    </Level_2>
>    <Level_2 id="2">
>      <Level_3 id="1">T10 T11 T12</Level_3>
>      <Level_3 id="2">T13 T14 T15</Level_3>
>      <Level_3 id="3">T16 T17 T18</Level_3>
>    </Level_2>
>    ...
> </Level1>
> ...
>
> You could generate the following token stream (L_X being a concrete level boundary token):
>
> L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
>      L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
>      L_2 ...
> ...
>
> A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.
>
> This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels.  No position increment gap required.
>
> Caveat: this scheme is not tested - I could be way off base :).
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Steve,

> I'm not sure what's wrong with the above (have you tried each of the two nested SpanNot clauses independently?), but here's another thing to try:
>
>    

Your query works. And as turns out, if I don't commit the same 
embarrassing lower case / upper case inconsistency over and over again, 
the query I constructed works, too. My bad. Thanks for the alternative 
query, brought me right back on track!

Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Increase number of available positions?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,

On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote:
> <SpanNot fieldName="MyField">
> <Include>
> <!-- Gets all the matching spans within L_2 boundaries and includes
> them -->
> <SpanNot>
> <Include>
> <SpanNear slop="2147483647" inOrder="false" >
> <SpanTerm>t293</SpanTerm>
> <SpanTerm>t4979</SpanTerm>
> </SpanNear>
> </Include>
> <Exclude>
> <SpanTerm>L_2</SpanTerm>
> </Exclude>
> </SpanNot>
> </Include>
> <Exclude>
> <!-- Gets all the matching spans from L_3 boundaries and excludes them
> -->
> <SpanNot>
> <Include>
> <SpanNear slop="2147483647" inOrder="false" >
> <SpanTerm>t293</SpanTerm>
> <SpanTerm>t4979</SpanTerm>
> </SpanNear>
> </Include>
> <Exclude>
> <SpanTerm>L_3</SpanTerm>
> </Exclude>
> </SpanNot>
> </Exclude>
> </SpanNot>
>
> Shouldn't this query only leave documents, where t293 and t4979 are in
> the same L_2, but not within the same L_3?

I'm not sure what's wrong with the above (have you tried each of the two nested SpanNot clauses independently?), but here's another thing to try:

<SpanNot>
  <Include>
    <SpanOr>
      <SpanNear slop="2147483647" inOrder="true">
        <SpanTerm>t293</SpanTerm>
        <SpanTerm>L_3</SpanTerm>
        <SpanTerm>t4979</SpanTerm>
      </SpanNear>
      <SpanNear slop="2147483647" inOrder="true">
        <SpanTerm>t4979</SpanTerm>
        <SpanTerm>L_3</SpanTerm>
        <SpanTerm>t293</SpanTerm>
      </SpanNear>
    <SpanOr>
  <Exclude>
    <SpanTerm>L_2</SpanTerm>
  </Exclude>
</SpanNot>

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi,

I was looking at SpanNotQuery to see if I could make do without the 
position increment gaps. A search requirement that's causing me some 
trouble to implement is when two terms are supposed to be on the same 
L_2, yet on different L_3's (L_3's are hierarchically below L_2).

With the position increments in place, I can do this:

<SpanNot fieldName="MyField">
<Include>
<SpanNear slop="100000" inOrder="false">
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanNear slop="1000" inOrder="false">
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Exclude>
</SpanNot>

This query returns the expected documents.

I didn't manage to come up with a working solution for the approach 
without posIncGaps. The following, I thought, should work, but for some 
reason it doesn't:

<SpanNot fieldName="MyField">
<Include>
<!-- Gets all the matching spans within L_2 boundaries and includes them -->
<SpanNot>
<Include>
<SpanNear slop="2147483647" inOrder="false" >
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanTerm>L_2</SpanTerm>
</Exclude>
</SpanNot>
</Include>
<Exclude>
<!-- Gets all the matching spans from L_3 boundaries and excludes them -->
<SpanNot>
<Include>
<SpanNear slop="2147483647" inOrder="false" >
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanTerm>L_3</SpanTerm>
</Exclude>
</SpanNot>
</Exclude>
</SpanNot>

Shouldn't this query only leave documents, where t293 and t4979 are in 
the same L_2, but not within the same L_3? I fiddled about with 
different queries to no avail and I feel the above is the most 
straightforward try. But the query doesn't match any document at all.

Any ideas on how to improve the second query would be greatly appreciated.

Thanks
Rene

> Hi Rene,
>
> Have you seen SpanNotQuery?:
>
> <http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>
>
> For a document that looks like:
>
> <Level_1 id="1">
>    <Level_2 id="1">
>      <Level_3 id="1">T1 T2 T3</Level_3>
>      <Level_3 id="2">T4 T5 T6</Level_3>
>      <Level_3 id="3">T7 T8 T9</Level_3>
>    </Level_2>
>    <Level_2 id="2">
>      <Level_3 id="1">T10 T11 T12</Level_3>
>      <Level_3 id="2">T13 T14 T15</Level_3>
>      <Level_3 id="3">T16 T17 T18</Level_3>
>    </Level_2>
>    ...
> </Level1>
> ...
>
> You could generate the following token stream (L_X being a concrete level boundary token):
>
> L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
>      L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
>      L_2 ...
> ...
>
> A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.
>
> This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels.  No position increment gap required.
>
> Caveat: this scheme is not tested - I could be way off base :).
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


RE: Increase number of available positions?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,

Have you seen SpanNotQuery?: 

<http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>

For a document that looks like:

<Level_1 id="1">
  <Level_2 id="1">
    <Level_3 id="1">T1 T2 T3</Level_3>
    <Level_3 id="2">T4 T5 T6</Level_3>
    <Level_3 id="3">T7 T8 T9</Level_3>
  </Level_2>
  <Level_2 id="2">
    <Level_3 id="1">T10 T11 T12</Level_3>
    <Level_3 id="2">T13 T14 T15</Level_3>
    <Level_3 id="3">T16 T17 T18</Level_3>
  </Level_2>
  ...
</Level1>
...

You could generate the following token stream (L_X being a concrete level boundary token):

L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
    L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
    L_2 ...
...

A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.

This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels.  No position increment gap required.

Caveat: this scheme is not tested - I could be way off base :).

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Erick,
> What about indexing
> the triplets with a small increment gap between? That is:
> ...
> gets indexed as:
>
> level1-1/level2-1/level3-1  +gap 100
> level1-1/level2-1/level3-2  +gap 100
> level1-1/level2-2/level3-3  +gap 100
> level1-1/level2-2/level3-4
>    

If I understand this correctly, the field would look like 
"level1-1/level2-1/level3-1 Term1 Term2 level1-1/level2-1/level3-2 Term3 
Term4 "?

I think, the problem here is the same like in the Payloads approach I 
wrote of in my response to Steve's mail. We cannot test for equality at 
search time (please correct me if we actually can do this). So if we have

level1-1/level2-1/level3-1
...
level1-1/level2-1/level3-244
level1-1/level2-2/level3-1
level1-1/level2-2/level3-105

and I search for T1 and T2 on level3, but want them to be in the same 
level2, this cannot be done satisfactorily.

> Or you could think about *documents* being your level1, that is each
> document has one and only one level1 element but many documents
> may have the same level1 token. Combining this with your increment
> gap notion for level2-3 might work for you.
>    

I was thinking about this, yet the trouble is that the issue at hand is 
just one field in an already not quite trivial scenario involving 200+ 
fields. If I add say 50 level1-documents per real document, I would 
still need to be able to relate these level1-documents to the real 
documents to which they belong, and, during retrieval, there are use 
cases where I need to look into each of the level1-documents to see if 
they fulfill certain criteria and then, in a further step, ascertain 
whether I can gather the needed level1-documents to fulfill the query on 
a "MyField"-Level (not existant here per se). I feel this might get 
somewhat unwieldy.

> You might also search the list for "Heirarchal" or "tree" indexing,
> this is a variant of such I think.
>    

Thank you, I'll look into this.

Cheers
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Erick Erickson <er...@gmail.com>.
I was wondering about Steven's approach to, have you considered it?

I don't know the internals of whether you could go to a 64 bit quantity for
term positions, but I suspect it would be *very* involved, but perhaps
people more familiar with the code could comment.....

How big is your corpus? Assuming that, for some reason, you can't
follow Steven's approach there are other possibilities. It really goes
against the grain for all DB/computer geeks to de-normalize data,
but Lucene handles really large amounts of text. What about indexing
the triplets with a small increment gap between? That is:
level1-1 - level2-1 - level3-1
                         - level 3-2
              level2-2 - level3-3
                            level3-4


gets indexed as:

level1-1/level2-1/level3-1  +gap 100
level1-1/level2-1/level3-2  +gap 100
level1-1/level2-2/level3-3  +gap 100
level1-1/level2-2/level3-4

with a gap of 100 (or even 10) between? your index will NOT grow
linearly with the tokens since there will be so many repeats
of the first couple of levels. This also gives you an easier way to
search for, say, all children of level1-1/level2-1 just by using
a prefix query.

Or you could think about *documents* being your level1, that is each
document has one and only one level1 element but many documents
may have the same level1 token. Combining this with your increment
gap notion for level2-3 might work for you.

Do note that Lucene has no requirement that all documents have
the same fields, so you can think about part of your documents
being your "level" documents with different fields than other
documents in your index....

You might also search the list for "Heirarchal" or "tree" indexing,
this is a variant of such I think.

HTH
Erick

On Mon, Mar 15, 2010 at 9:59 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:

>
>  Is your entire corpus a single document? Because I'm having trouble
>> imagining a single document where this would be a problem, unless
>> your increment gap is huge. The term positions are relative to
>> a single document...
>>
>>
>
> It is getting pretty huge, yes (see below). The term positions are also
> relative to a single field, aren't they?
>
>
>  <MyField>
>>> <Level_1>
>>> <Level_2>
>>> <Level_3>
>>>
>>>
>>>
>> Let me plug in some figures to help clarify. On Level 3 there are hundreds
> of tokens. So to be able to search two or more terms in MyField in the same
> Level_3, I put a position gap of 1000 between all Level_3's. Per Level_2
> there might be hundreds of Level_3 entries. As I want to restrict the search
> to all Level_3 entries of a Level_2, I set the position increment gap for
> Level_2 at 1000x1000 = 1,000,000 (1000 for the Tokens in Level_3 and 1000
> for the Level_3 entries in Level_2).
>
> This done, Level_1 still needs to be accomodated. If you're looking at 500
> Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry, to be
> able to search only within each of the Level_1 elements.That way only four
> Level_1 entries can be included before the maximum value is reached.
>
> Queries I am looking to support might look like this in an easy case:
>
> Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on
> Level_3, which should both be in the same Level_1.
>
> Sorry if this is confusing, what with all these levels going on. I think
> what it comes down to is whether the integer based position counting might
> be replaced by long. Can this be done at all? Are performance or other
> implications conceivable? Or is the current implementation too deeply wired
> to Lucene core workings to make this a reasonable endeavour?
>
> Cheers
>
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Steve,

> Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field?
>    

Well, the hierarchical structure needs to be maintained. As hundreds of 
Level_X entities can be found on levels 2 and 3, I need to be able to 
tell for instance which Level_3 entities belong to which common Level_2 
entity. Throwing all Level_3 entities in a field of their own would 
remove this information, as far as I can see.

I was also thinking about adding Payloads at some point, but the main 
caveat here is that the Payload data cannot be tested for equality at 
search time. E.g. if I have a term T1 and add a payload that states this 
term belongs to Level_3:200;Level_2:65;Level_1:44 and I have a term T2 
with Level_3:200;Level_2:66;Level_1:44 I cannot state at search time 
that I would like the number for Level_2 entities to be the same. I 
could say Level_2 has to be 65, but I don't now that beforehand of 
course. Or am I overlooking something here?

> On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote:
>    
>>>> Search in MyField: Terms T1 and T2 on Level_2 and T3,
>>>> T4, and T5 on  Level_3, which should both be in the
>>>> same Level_1.
>>>>          
> I don't understand what you mean by "which should both be in the same Level_1".  Can you give more details?
>
>    

I guess my initial pseudo-XML construct might have been misleading, my 
apologies. To be more precise, it is like this:

<!ELEMENT MyField (Level_1+) >
<!ELEMENT Level_1 (Level_2+) >
<!ELEMENT Level_2 (Level_3+) >
<!ELEMENT Level_3 (Terms+) >
<!ELEMENT Terms (#PCDATA) >

What I am adding to Lucene is a single Field MyField. I preprocess the 
input string so that it looks like "Term1 Term2 endOfLevel_3 Term3 Term4 
endOfLevel_3 Term4 Term5 endOfLevel_3 endOfLevel_2 Term8 Term9 
endOfLevel_3 ...". Note the appearance of Level_2.

I use a custom Filter to switch the position increment as needed and as 
indicated by the marker tokens. The marker tokens themselves don't get 
indexed.

Cheers
Rene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Increase number of available positions?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,

Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field?

On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote:
> > > Search in MyField: Terms T1 and T2 on Level_2 and T3,
> > > T4, and T5 on  Level_3, which should both be in the
> > > same Level_1.

I don't understand what you mean by "which should both be in the same Level_1".  Can you give more details?

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Rene Hackl-Sommer <re...@gmx.de>.
> Is your entire corpus a single document? Because I'm having trouble
> imagining a single document where this would be a problem, unless
> your increment gap is huge. The term positions are relative to
> a single document...
>    

It is getting pretty huge, yes (see below). The term positions are also 
relative to a single field, aren't they?

>> <MyField>
>> <Level_1>
>> <Level_2>
>> <Level_3>
>>
>>      
Let me plug in some figures to help clarify. On Level 3 there are 
hundreds of tokens. So to be able to search two or more terms in MyField 
in the same Level_3, I put a position gap of 1000 between all Level_3's. 
Per Level_2 there might be hundreds of Level_3 entries. As I want to 
restrict the search to all Level_3 entries of a Level_2, I set the 
position increment gap for Level_2 at 1000x1000 = 1,000,000 (1000 for 
the Tokens in Level_3 and 1000 for the Level_3 entries in Level_2).

This done, Level_1 still needs to be accomodated. If you're looking at 
500 Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry, 
to be able to search only within each of the Level_1 elements.That way 
only four Level_1 entries can be included before the maximum value is 
reached.

Queries I am looking to support might look like this in an easy case:

Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on 
Level_3, which should both be in the same Level_1.

Sorry if this is confusing, what with all these levels going on. I think 
what it comes down to is whether the integer based position counting 
might be replaced by long. Can this be done at all? Are performance or 
other implications conceivable? Or is the current implementation too 
deeply wired to Lucene core workings to make this a reasonable endeavour?

Cheers
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Increase number of available positions?

Posted by Erick Erickson <er...@gmail.com>.
Is your entire corpus a single document? Because I'm having trouble
imagining a single document where this would be a problem, unless
your increment gap is huge. The term positions are relative to
a single document...

You say that your levels have less than 1,000 elements each With
an increment gap of 100, you're only talking a total here of 300,000
as your increment gap "holes", so you've got room for, uhhhhmm, a lot
more tokens per document. If you're  running over that limit, the
increment gap is the least of your problems <G>...

Of course I may be missing the point completely...

Erick

On Mon, Mar 15, 2010 at 5:03 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:

> Hello,
>
> I am working at a use case that is very demanding regarding the number of
> token positions. For one special field in the index, I need to represent
> different hierarchy levels, like this:
>
> <MyField>
> <Level_1>
> <Level_2>
> <Level_3>
>
> Please note that I need to do this with Lucene, not a XML search engine.
>
> Now, on Level_3 there a hundreds of tokens, Level_2 also has hundreds of
> entries and Level_1 is in there with a low 3-digit figure. For those who
> wish to know: this is an intricate system of chemical entities and some
> their properties.
>
> I need this information to be searchable in all conceivable ways. What I am
> doing right now is use position increment gaps to separate the Levels and
> search with SpanQueries. It works like a charm for a setup with limited
> entries. But Integer.MAX_VALUE poses a cap on the approach, of course. Would
> it be thinkable to replace the current integer counting system with a long
> based system? What issues should I consider?
>
> Thanks,
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>