You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rene Hackl-Sommer <re...@gmx.de> on 2010/03/15 10:03:26 UTC
Increase number of available positions?
Hello,
I am working at a use case that is very demanding regarding the number
of token positions. For one special field in the index, I need to
represent different hierarchy levels, like this:
<MyField>
<Level_1>
<Level_2>
<Level_3>
Please note that I need to do this with Lucene, not a XML search engine.
Now, on Level_3 there a hundreds of tokens, Level_2 also has hundreds of
entries and Level_1 is in there with a low 3-digit figure. For those who
wish to know: this is an intricate system of chemical entities and some
their properties.
I need this information to be searchable in all conceivable ways. What I
am doing right now is use position increment gaps to separate the Levels
and search with SpanQueries. It works like a charm for a setup with
limited entries. But Integer.MAX_VALUE poses a cap on the approach, of
course. Would it be thinkable to replace the current integer counting
system with a long based system? What issues should I consider?
Thanks,
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Erick Erickson <er...@gmail.com>.
Not quite what I had in mind, more like
level1-1/level2-1/level3-1/Term1 level1-1/level2-1/level3-1/Term2
level1-1/level2-1/level3-2/Term3 level1-1/level2-1/level3-2/Term4
With an increment gap 0f 100 and an analyzer that split on slashes, the term
positions would be
something like:
term term
pos
0 level1-1
1 level2-1
2 level3-1
3 Term1
104 level1-1
105 level2-1
106 level3-1
107 Term2
208 level1-1
209 level2-1
210 level3-2
211 Term3
312 level1-1
313 level2-1
314 level3-2
315 Term4
As you see, a lot or repetition, but perhaps acceptable...
Or, you could choose an analyzer that didn't break up the terms
(although this would make your index somewhat bigger due to
more unique terms).
term term
pos
0 level1-1/level2-1/level3-1/Term1
101 level1-1/level2-1/level3-1/Term2
202 level1-1/level2-1/level3-2/Term3
303 level1-1/level2-1/level3-2/Term4
Although I don't know if you really need an increment gap here.....
This latter would make gathering all the documents with specific levels
easier although the former would also work if you didn't need partial
terms (that is, wildcards inside of phrases are new, see
JIRA-1486, ComplexPhraseQueryParser).
Best
Erick
On Mon, Mar 15, 2010 at 5:09 PM, Rene Hackl-Sommer <re...@gmx.de>wrote:
> Hi Erick,
>
>> What about indexing
>> the triplets with a small increment gap between? That is:
>> ...
>>
>> gets indexed as:
>>
>> level1-1/level2-1/level3-1 +gap 100
>> level1-1/level2-1/level3-2 +gap 100
>> level1-1/level2-2/level3-3 +gap 100
>> level1-1/level2-2/level3-4
>>
>>
>
> If I understand this correctly, the field would look like
> "level1-1/level2-1/level3-1 Term1 Term2 level1-1/level2-1/level3-2 Term3
> Term4 "?
>
> I think, the problem here is the same like in the Payloads approach I wrote
> of in my response to Steve's mail. We cannot test for equality at search
> time (please correct me if we actually can do this). So if we have
>
>
> level1-1/level2-1/level3-1
> ...
> level1-1/level2-1/level3-244
> level1-1/level2-2/level3-1
> level1-1/level2-2/level3-105
>
> and I search for T1 and T2 on level3, but want them to be in the same
> level2, this cannot be done satisfactorily.
>
>
> Or you could think about *documents* being your level1, that is each
>> document has one and only one level1 element but many documents
>> may have the same level1 token. Combining this with your increment
>> gap notion for level2-3 might work for you.
>>
>>
>
> I was thinking about this, yet the trouble is that the issue at hand is
> just one field in an already not quite trivial scenario involving 200+
> fields. If I add say 50 level1-documents per real document, I would still
> need to be able to relate these level1-documents to the real documents to
> which they belong, and, during retrieval, there are use cases where I need
> to look into each of the level1-documents to see if they fulfill certain
> criteria and then, in a further step, ascertain whether I can gather the
> needed level1-documents to fulfill the query on a "MyField"-Level (not
> existant here per se). I feel this might get somewhat unwieldy.
>
>
> You might also search the list for "Heirarchal" or "tree" indexing,
>> this is a variant of such I think.
>>
>>
>
> Thank you, I'll look into this.
>
>
> Cheers
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Increase number of available positions?
Posted by Erick Erickson <er...@gmail.com>.
Sure. I'd start a new thread though, referencing this one and outlining why
none of the solutions you tried worked.....
Erick
On Tue, Mar 16, 2010 at 4:35 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:
> Hi Guys,
>
> Thanks for the input! I am now going to put in some work to see how things
> fare.
>
> Should I post the question about substituting int with long on lucene-dev
> again, if need arises?
>
> Thanks again,
> Rene
>
>
>
> Am 15.03.2010 23:04, schrieb Steven A Rowe:
>
> Hi Rene,
>>
>> Have you seen SpanNotQuery?:
>>
>> <
>> http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html
>> >
>>
>> For a document that looks like:
>>
>> <Level_1 id="1">
>> <Level_2 id="1">
>> <Level_3 id="1">T1 T2 T3</Level_3>
>> <Level_3 id="2">T4 T5 T6</Level_3>
>> <Level_3 id="3">T7 T8 T9</Level_3>
>> </Level_2>
>> <Level_2 id="2">
>> <Level_3 id="1">T10 T11 T12</Level_3>
>> <Level_3 id="2">T13 T14 T15</Level_3>
>> <Level_3 id="3">T16 T17 T18</Level_3>
>> </Level_2>
>> ...
>> </Level1>
>> ...
>>
>> You could generate the following token stream (L_X being a concrete level
>> boundary token):
>>
>> L_1 L_2 L_3 T1 T2 T3 L_3 T4 T5 T6 L_3 T7 T8 T9
>> L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
>> L_2 ...
>> ...
>>
>> A query to find T2 and T8 on the same Level_2 would require you to find a
>> span containing T2 and T8, but not containing L_2.
>>
>> This scheme will generalize to as many levels as you need, and you can use
>> nested span queries to simultaneously provide constraints at multiple
>> levels. No position increment gap required.
>>
>> Caveat: this scheme is not tested - I could be way off base :).
>>
>> Steve
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Guys,
Thanks for the input! I am now going to put in some work to see how
things fare.
Should I post the question about substituting int with long on
lucene-dev again, if need arises?
Thanks again,
Rene
Am 15.03.2010 23:04, schrieb Steven A Rowe:
> Hi Rene,
>
> Have you seen SpanNotQuery?:
>
> <http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>
>
> For a document that looks like:
>
> <Level_1 id="1">
> <Level_2 id="1">
> <Level_3 id="1">T1 T2 T3</Level_3>
> <Level_3 id="2">T4 T5 T6</Level_3>
> <Level_3 id="3">T7 T8 T9</Level_3>
> </Level_2>
> <Level_2 id="2">
> <Level_3 id="1">T10 T11 T12</Level_3>
> <Level_3 id="2">T13 T14 T15</Level_3>
> <Level_3 id="3">T16 T17 T18</Level_3>
> </Level_2>
> ...
> </Level1>
> ...
>
> You could generate the following token stream (L_X being a concrete level boundary token):
>
> L_1 L_2 L_3 T1 T2 T3 L_3 T4 T5 T6 L_3 T7 T8 T9
> L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
> L_2 ...
> ...
>
> A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.
>
> This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels. No position increment gap required.
>
> Caveat: this scheme is not tested - I could be way off base :).
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Steve,
> I'm not sure what's wrong with the above (have you tried each of the two nested SpanNot clauses independently?), but here's another thing to try:
>
>
Your query works. And as turns out, if I don't commit the same
embarrassing lower case / upper case inconsistency over and over again,
the query I constructed works, too. My bad. Thanks for the alternative
query, brought me right back on track!
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Increase number of available positions?
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,
On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote:
> <SpanNot fieldName="MyField">
> <Include>
> <!-- Gets all the matching spans within L_2 boundaries and includes
> them -->
> <SpanNot>
> <Include>
> <SpanNear slop="2147483647" inOrder="false" >
> <SpanTerm>t293</SpanTerm>
> <SpanTerm>t4979</SpanTerm>
> </SpanNear>
> </Include>
> <Exclude>
> <SpanTerm>L_2</SpanTerm>
> </Exclude>
> </SpanNot>
> </Include>
> <Exclude>
> <!-- Gets all the matching spans from L_3 boundaries and excludes them
> -->
> <SpanNot>
> <Include>
> <SpanNear slop="2147483647" inOrder="false" >
> <SpanTerm>t293</SpanTerm>
> <SpanTerm>t4979</SpanTerm>
> </SpanNear>
> </Include>
> <Exclude>
> <SpanTerm>L_3</SpanTerm>
> </Exclude>
> </SpanNot>
> </Exclude>
> </SpanNot>
>
> Shouldn't this query only leave documents, where t293 and t4979 are in
> the same L_2, but not within the same L_3?
I'm not sure what's wrong with the above (have you tried each of the two nested SpanNot clauses independently?), but here's another thing to try:
<SpanNot>
<Include>
<SpanOr>
<SpanNear slop="2147483647" inOrder="true">
<SpanTerm>t293</SpanTerm>
<SpanTerm>L_3</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
<SpanNear slop="2147483647" inOrder="true">
<SpanTerm>t4979</SpanTerm>
<SpanTerm>L_3</SpanTerm>
<SpanTerm>t293</SpanTerm>
</SpanNear>
<SpanOr>
<Exclude>
<SpanTerm>L_2</SpanTerm>
</Exclude>
</SpanNot>
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi,
I was looking at SpanNotQuery to see if I could make do without the
position increment gaps. A search requirement that's causing me some
trouble to implement is when two terms are supposed to be on the same
L_2, yet on different L_3's (L_3's are hierarchically below L_2).
With the position increments in place, I can do this:
<SpanNot fieldName="MyField">
<Include>
<SpanNear slop="100000" inOrder="false">
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanNear slop="1000" inOrder="false">
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Exclude>
</SpanNot>
This query returns the expected documents.
I didn't manage to come up with a working solution for the approach
without posIncGaps. The following, I thought, should work, but for some
reason it doesn't:
<SpanNot fieldName="MyField">
<Include>
<!-- Gets all the matching spans within L_2 boundaries and includes them -->
<SpanNot>
<Include>
<SpanNear slop="2147483647" inOrder="false" >
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanTerm>L_2</SpanTerm>
</Exclude>
</SpanNot>
</Include>
<Exclude>
<!-- Gets all the matching spans from L_3 boundaries and excludes them -->
<SpanNot>
<Include>
<SpanNear slop="2147483647" inOrder="false" >
<SpanTerm>t293</SpanTerm>
<SpanTerm>t4979</SpanTerm>
</SpanNear>
</Include>
<Exclude>
<SpanTerm>L_3</SpanTerm>
</Exclude>
</SpanNot>
</Exclude>
</SpanNot>
Shouldn't this query only leave documents, where t293 and t4979 are in
the same L_2, but not within the same L_3? I fiddled about with
different queries to no avail and I feel the above is the most
straightforward try. But the query doesn't match any document at all.
Any ideas on how to improve the second query would be greatly appreciated.
Thanks
Rene
> Hi Rene,
>
> Have you seen SpanNotQuery?:
>
> <http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>
>
> For a document that looks like:
>
> <Level_1 id="1">
> <Level_2 id="1">
> <Level_3 id="1">T1 T2 T3</Level_3>
> <Level_3 id="2">T4 T5 T6</Level_3>
> <Level_3 id="3">T7 T8 T9</Level_3>
> </Level_2>
> <Level_2 id="2">
> <Level_3 id="1">T10 T11 T12</Level_3>
> <Level_3 id="2">T13 T14 T15</Level_3>
> <Level_3 id="3">T16 T17 T18</Level_3>
> </Level_2>
> ...
> </Level1>
> ...
>
> You could generate the following token stream (L_X being a concrete level boundary token):
>
> L_1 L_2 L_3 T1 T2 T3 L_3 T4 T5 T6 L_3 T7 T8 T9
> L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
> L_2 ...
> ...
>
> A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.
>
> This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels. No position increment gap required.
>
> Caveat: this scheme is not tested - I could be way off base :).
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
RE: Increase number of available positions?
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,
Have you seen SpanNotQuery?:
<http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/spans/SpanNotQuery.html>
For a document that looks like:
<Level_1 id="1">
<Level_2 id="1">
<Level_3 id="1">T1 T2 T3</Level_3>
<Level_3 id="2">T4 T5 T6</Level_3>
<Level_3 id="3">T7 T8 T9</Level_3>
</Level_2>
<Level_2 id="2">
<Level_3 id="1">T10 T11 T12</Level_3>
<Level_3 id="2">T13 T14 T15</Level_3>
<Level_3 id="3">T16 T17 T18</Level_3>
</Level_2>
...
</Level1>
...
You could generate the following token stream (L_X being a concrete level boundary token):
L_1 L_2 L_3 T1 T2 T3 L_3 T4 T5 T6 L_3 T7 T8 T9
L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
L_2 ...
...
A query to find T2 and T8 on the same Level_2 would require you to find a span containing T2 and T8, but not containing L_2.
This scheme will generalize to as many levels as you need, and you can use nested span queries to simultaneously provide constraints at multiple levels. No position increment gap required.
Caveat: this scheme is not tested - I could be way off base :).
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Erick,
> What about indexing
> the triplets with a small increment gap between? That is:
> ...
> gets indexed as:
>
> level1-1/level2-1/level3-1 +gap 100
> level1-1/level2-1/level3-2 +gap 100
> level1-1/level2-2/level3-3 +gap 100
> level1-1/level2-2/level3-4
>
If I understand this correctly, the field would look like
"level1-1/level2-1/level3-1 Term1 Term2 level1-1/level2-1/level3-2 Term3
Term4 "?
I think, the problem here is the same like in the Payloads approach I
wrote of in my response to Steve's mail. We cannot test for equality at
search time (please correct me if we actually can do this). So if we have
level1-1/level2-1/level3-1
...
level1-1/level2-1/level3-244
level1-1/level2-2/level3-1
level1-1/level2-2/level3-105
and I search for T1 and T2 on level3, but want them to be in the same
level2, this cannot be done satisfactorily.
> Or you could think about *documents* being your level1, that is each
> document has one and only one level1 element but many documents
> may have the same level1 token. Combining this with your increment
> gap notion for level2-3 might work for you.
>
I was thinking about this, yet the trouble is that the issue at hand is
just one field in an already not quite trivial scenario involving 200+
fields. If I add say 50 level1-documents per real document, I would
still need to be able to relate these level1-documents to the real
documents to which they belong, and, during retrieval, there are use
cases where I need to look into each of the level1-documents to see if
they fulfill certain criteria and then, in a further step, ascertain
whether I can gather the needed level1-documents to fulfill the query on
a "MyField"-Level (not existant here per se). I feel this might get
somewhat unwieldy.
> You might also search the list for "Heirarchal" or "tree" indexing,
> this is a variant of such I think.
>
Thank you, I'll look into this.
Cheers
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Erick Erickson <er...@gmail.com>.
I was wondering about Steven's approach to, have you considered it?
I don't know the internals of whether you could go to a 64 bit quantity for
term positions, but I suspect it would be *very* involved, but perhaps
people more familiar with the code could comment.....
How big is your corpus? Assuming that, for some reason, you can't
follow Steven's approach there are other possibilities. It really goes
against the grain for all DB/computer geeks to de-normalize data,
but Lucene handles really large amounts of text. What about indexing
the triplets with a small increment gap between? That is:
level1-1 - level2-1 - level3-1
- level 3-2
level2-2 - level3-3
level3-4
gets indexed as:
level1-1/level2-1/level3-1 +gap 100
level1-1/level2-1/level3-2 +gap 100
level1-1/level2-2/level3-3 +gap 100
level1-1/level2-2/level3-4
with a gap of 100 (or even 10) between? your index will NOT grow
linearly with the tokens since there will be so many repeats
of the first couple of levels. This also gives you an easier way to
search for, say, all children of level1-1/level2-1 just by using
a prefix query.
Or you could think about *documents* being your level1, that is each
document has one and only one level1 element but many documents
may have the same level1 token. Combining this with your increment
gap notion for level2-3 might work for you.
Do note that Lucene has no requirement that all documents have
the same fields, so you can think about part of your documents
being your "level" documents with different fields than other
documents in your index....
You might also search the list for "Heirarchal" or "tree" indexing,
this is a variant of such I think.
HTH
Erick
On Mon, Mar 15, 2010 at 9:59 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:
>
> Is your entire corpus a single document? Because I'm having trouble
>> imagining a single document where this would be a problem, unless
>> your increment gap is huge. The term positions are relative to
>> a single document...
>>
>>
>
> It is getting pretty huge, yes (see below). The term positions are also
> relative to a single field, aren't they?
>
>
> <MyField>
>>> <Level_1>
>>> <Level_2>
>>> <Level_3>
>>>
>>>
>>>
>> Let me plug in some figures to help clarify. On Level 3 there are hundreds
> of tokens. So to be able to search two or more terms in MyField in the same
> Level_3, I put a position gap of 1000 between all Level_3's. Per Level_2
> there might be hundreds of Level_3 entries. As I want to restrict the search
> to all Level_3 entries of a Level_2, I set the position increment gap for
> Level_2 at 1000x1000 = 1,000,000 (1000 for the Tokens in Level_3 and 1000
> for the Level_3 entries in Level_2).
>
> This done, Level_1 still needs to be accomodated. If you're looking at 500
> Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry, to be
> able to search only within each of the Level_1 elements.That way only four
> Level_1 entries can be included before the maximum value is reached.
>
> Queries I am looking to support might look like this in an easy case:
>
> Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on
> Level_3, which should both be in the same Level_1.
>
> Sorry if this is confusing, what with all these levels going on. I think
> what it comes down to is whether the integer based position counting might
> be replaced by long. Can this be done at all? Are performance or other
> implications conceivable? Or is the current implementation too deeply wired
> to Lucene core workings to make this a reasonable endeavour?
>
> Cheers
>
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
Hi Steve,
> Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field?
>
Well, the hierarchical structure needs to be maintained. As hundreds of
Level_X entities can be found on levels 2 and 3, I need to be able to
tell for instance which Level_3 entities belong to which common Level_2
entity. Throwing all Level_3 entities in a field of their own would
remove this information, as far as I can see.
I was also thinking about adding Payloads at some point, but the main
caveat here is that the Payload data cannot be tested for equality at
search time. E.g. if I have a term T1 and add a payload that states this
term belongs to Level_3:200;Level_2:65;Level_1:44 and I have a term T2
with Level_3:200;Level_2:66;Level_1:44 I cannot state at search time
that I would like the number for Level_2 entities to be the same. I
could say Level_2 has to be 65, but I don't now that beforehand of
course. Or am I overlooking something here?
> On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote:
>
>>>> Search in MyField: Terms T1 and T2 on Level_2 and T3,
>>>> T4, and T5 on Level_3, which should both be in the
>>>> same Level_1.
>>>>
> I don't understand what you mean by "which should both be in the same Level_1". Can you give more details?
>
>
I guess my initial pseudo-XML construct might have been misleading, my
apologies. To be more precise, it is like this:
<!ELEMENT MyField (Level_1+) >
<!ELEMENT Level_1 (Level_2+) >
<!ELEMENT Level_2 (Level_3+) >
<!ELEMENT Level_3 (Terms+) >
<!ELEMENT Terms (#PCDATA) >
What I am adding to Lucene is a single Field MyField. I preprocess the
input string so that it looks like "Term1 Term2 endOfLevel_3 Term3 Term4
endOfLevel_3 Term4 Term5 endOfLevel_3 endOfLevel_2 Term8 Term9
endOfLevel_3 ...". Note the appearance of Level_2.
I use a custom Filter to switch the position increment as needed and as
indicated by the marker tokens. The marker tokens themselves don't get
indexed.
Cheers
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Increase number of available positions?
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Rene,
Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field?
On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote:
> > > Search in MyField: Terms T1 and T2 on Level_2 and T3,
> > > T4, and T5 on Level_3, which should both be in the
> > > same Level_1.
I don't understand what you mean by "which should both be in the same Level_1". Can you give more details?
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Rene Hackl-Sommer <re...@gmx.de>.
> Is your entire corpus a single document? Because I'm having trouble
> imagining a single document where this would be a problem, unless
> your increment gap is huge. The term positions are relative to
> a single document...
>
It is getting pretty huge, yes (see below). The term positions are also
relative to a single field, aren't they?
>> <MyField>
>> <Level_1>
>> <Level_2>
>> <Level_3>
>>
>>
Let me plug in some figures to help clarify. On Level 3 there are
hundreds of tokens. So to be able to search two or more terms in MyField
in the same Level_3, I put a position gap of 1000 between all Level_3's.
Per Level_2 there might be hundreds of Level_3 entries. As I want to
restrict the search to all Level_3 entries of a Level_2, I set the
position increment gap for Level_2 at 1000x1000 = 1,000,000 (1000 for
the Tokens in Level_3 and 1000 for the Level_3 entries in Level_2).
This done, Level_1 still needs to be accomodated. If you're looking at
500 Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry,
to be able to search only within each of the Level_1 elements.That way
only four Level_1 entries can be included before the maximum value is
reached.
Queries I am looking to support might look like this in an easy case:
Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on
Level_3, which should both be in the same Level_1.
Sorry if this is confusing, what with all these levels going on. I think
what it comes down to is whether the integer based position counting
might be replaced by long. Can this be done at all? Are performance or
other implications conceivable? Or is the current implementation too
deeply wired to Lucene core workings to make this a reasonable endeavour?
Cheers
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Increase number of available positions?
Posted by Erick Erickson <er...@gmail.com>.
Is your entire corpus a single document? Because I'm having trouble
imagining a single document where this would be a problem, unless
your increment gap is huge. The term positions are relative to
a single document...
You say that your levels have less than 1,000 elements each With
an increment gap of 100, you're only talking a total here of 300,000
as your increment gap "holes", so you've got room for, uhhhhmm, a lot
more tokens per document. If you're running over that limit, the
increment gap is the least of your problems <G>...
Of course I may be missing the point completely...
Erick
On Mon, Mar 15, 2010 at 5:03 AM, Rene Hackl-Sommer <re...@gmx.de>wrote:
> Hello,
>
> I am working at a use case that is very demanding regarding the number of
> token positions. For one special field in the index, I need to represent
> different hierarchy levels, like this:
>
> <MyField>
> <Level_1>
> <Level_2>
> <Level_3>
>
> Please note that I need to do this with Lucene, not a XML search engine.
>
> Now, on Level_3 there a hundreds of tokens, Level_2 also has hundreds of
> entries and Level_1 is in there with a low 3-digit figure. For those who
> wish to know: this is an intricate system of chemical entities and some
> their properties.
>
> I need this information to be searchable in all conceivable ways. What I am
> doing right now is use position increment gaps to separate the Levels and
> search with SpanQueries. It works like a charm for a setup with limited
> entries. But Integer.MAX_VALUE poses a cap on the approach, of course. Would
> it be thinkable to replace the current integer counting system with a long
> based system? What issues should I consider?
>
> Thanks,
> Rene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>