You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Renaud Delbru <re...@deri.org> on 2010/02/09 13:04:44 UTC

Flex & Docs/AndPositionsEnum

Hi Michael,

I have updated my lucene-1458, and I discovered there was big 
modifications in the StandardCodec interface.
I updated my own codecs to this new interface, but I encounter a 
problem. My codecs are creating DocsAndPositionsEnum subclasses that 
allow to access more information than simply the doc, freq and position 
(I have other information encoded into the Prox file).
In the code, to be able to manipulate the additional interface that my 
classes provide, I was casting the DocsAndPositionsEnum object returned 
by IndexReader#termPositionsEnum() into the correct subclass. While this 
approach was working in the previous flewx branch, this does not work 
anymore with the last committed changes. In certain cases, the 
IndexReader#termPositionsEnum() does not return the DocsAndPositionsEnum 
created by the StandardPostingsReader, but a MultiDocsAndPositionsEnum. 
However, I am not able either to subclass the MultiDocsAndPositionsEnum 
or to wrap it into a decorator because it is declared as 'private static 
final' in DirectoryReader.

Are these classes (MultiTermEnum, MultiDocsAndPositionsEnum, etc.) 
hidden in a voluntary manner ? Or is there is another way to extends 
StandardCodec without having to deal with these classes ?

Cheers
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Synonym map

Posted by Simon Willnauer <si...@googlemail.com>.

Maybe I miss something but what is wrong with SynonymTokenFilter in
contrib/wordnet?

simon

On Tue, Feb 9, 2010 at 5:03 PM, Ian Lea <ia...@gmail.com> wrote:
> Lucene in Action second edition has Synonym stuff that I think will
> work with lucene 3.0.
>
> Source code available from http://www.manning.com/hatcher3/
>
>
> --
> Ian.
>
>
> On Tue, Feb 9, 2010 at 2:03 PM, Marc Schwarz <in...@qboad.de> wrote:
>> Hi,
>>
>> i try to implement synonyma, but i didn't exactly know how to do it
>> (lucene 3.0).
>>
>> Is anybody out there who has some small code snippets or a good link ?
>>
>> Thanks & Greetings,
>> Marc
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Synonym map

Posted by Ian Lea <ia...@gmail.com>.

Lucene in Action second edition has Synonym stuff that I think will
work with lucene 3.0.

Source code available from http://www.manning.com/hatcher3/


--
Ian.


On Tue, Feb 9, 2010 at 2:03 PM, Marc Schwarz <in...@qboad.de> wrote:
> Hi,
>
> i try to implement synonyma, but i didn't exactly know how to do it
> (lucene 3.0).
>
> Is anybody out there who has some small code snippets or a good link ?
>
> Thanks & Greetings,
> Marc
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Synonym map

Posted by Marc Schwarz <in...@qboad.de>.

Hi,

i try to implement synonyma, but i didn't exactly know how to do it 
(lucene 3.0).

Is anybody out there who has some small code snippets or a good link ?

Thanks & Greetings,
Marc




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 11, 2010 at 08:30:14AM -0500, Michael McCandless wrote:
> Oh you're saying we don't know if the underlying enum actually skipped vs
> just scanned?

Yep.

> Isn't the skip data also based on deltas?  

Yes, but that's internal to the skip reader, in both Lucene and Lucy/KS.  When
it comes time to skip, the skip reader's doc id is assigned directly, in both
libraries.  From StandardPostingsReaderImpl.java:

          doc = skipper.getDoc();

Trying to apply the skip reader's doc id information as a delta would get
quite complicated.  (A delta against...  what?)  I'm not sure that's even
possible.

> So even if real skipping happened, Lucy/KS would not "lose" the offset that
> the aggregator had previously added?  Or maybe I'm lost on what the issue is
> here...

It would indeed "lose" the offset, because the skip reader's doc id
information gets assigned directly rather than applied as a delta.

And since the aggregator layer is not aware of when this occurs, it cannot
intervene to re-apply the offset.

Having driven down this dead-end, turned around and come back, I've become
persuaded that requiring the segment-level postings iterator to be aware of
its consumer is not a good idea.

> > A generic aggregator wouldn't know that it needed to do that.  The postings
> > codec developer would be forced to write aggregation code in addition to
> > segment-level code.
> 
> Right, if position were not primitive but contained within an opaque
> (to the aggregator) object.  And, you were doing the flat positions
> space.
> 
> I guess... this restriction still seems academic... ie, not a real
> issue in Lucene.  

Not for the standard posting formats that Lucene offers.  But the point of
flex is to provide an extension framework, I thought.

Well, whatever.  It's just another place where Lucy and Lucene will part ways.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Feb 10, 2010 at 2:42 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote:
>
>> In Lucene, skipping is done through the aggregator,
>
> I had a look at MultiDocsEnum in the flex blanch.  It doesn't know when
> sub-enum is reading skip data.

I'm confused -- the MultiDocsEnum's advance method impl is the only
place where we invoke advance on the sub readers.  Oh you're saying we
don't know if the underlying enum actually skipped vs just scanned?

Isn't the skip data also based on deltas?  So even if real skipping
happened, Lucy/KS would not "lose" the offset that the aggregator had
previously added?  Or maybe I'm lost on what the issue is here...

>> > I suppose another possibility would have been to have the aggregator
>> > keep its own Posting and copy all data over from the
>> > SegPostingList's Posting on each iteration then add its offset.
>>
>> I think this is what Lucene does (?).  EG the aggregator holds its own
>> "int doc" which it must copy to (adding the offset) from the
>> underlying sub enum.
>
> That's fine for a *primitive* type.  Modifying an int returned by a sub-enum
> doesn't affect the sub-enum.  :)
>
> The problem arises when there's an opaque *object* conveying data to the
> consumer.  The aggregator knows everything there is to know about an int, but
> it doesn't know what it needs to do to prepare an opaque object owned by the
> sub-enum for consumption at the aggregate level.

OK.

>> > However, that would have been a lot less efficient, and it still
>> > wouldn't have worked for the "flat positions space" example because
>> > the generic aggregator would not have known about the needs of the
>> > specific codec.
>>
>> But aggregator could also add the positions offset on each
>> nextPosition() call, in Lucene.  Like that use case could be made to
>> work, if Lucene had used a flat position space.
>
> A generic aggregator wouldn't know that it needed to do that.  The postings
> codec developer would be forced to write aggregation code in addition to
> segment-level code.

Right, if position were not primitive but contained within an opaque
(to the aggregator) object.  And, you were doing the flat positions
space.

I guess... this restriction still seems academic... ie, not a real
issue in Lucene.  We use primitives in Lucene for doc/position, which
we can remap as needed.  We then require that opaque stuff (using
attributes) "survive", unchanged, when passed through the aggregator.
Either that, or, you enum segment by segment in the code.  I don't [yet]
see this as an issue for Lucene...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote:

> In Lucene, skipping is done through the aggregator,

I had a look at MultiDocsEnum in the flex blanch.  It doesn't know when
sub-enum is reading skip data.

> > I suppose another possibility would have been to have the aggregator
> > keep its own Posting and copy all data over from the
> > SegPostingList's Posting on each iteration then add its offset.
> 
> I think this is what Lucene does (?).  EG the aggregator holds its own
> "int doc" which it must copy to (adding the offset) from the
> underlying sub enum.

That's fine for a *primitive* type.  Modifying an int returned by a sub-enum
doesn't affect the sub-enum.  :)

The problem arises when there's an opaque *object* conveying data to the
consumer.  The aggregator knows everything there is to know about an int, but
it doesn't know what it needs to do to prepare an opaque object owned by the
sub-enum for consumption at the aggregate level.

> > However, that would have been a lot less efficient, and it still
> > wouldn't have worked for the "flat positions space" example because
> > the generic aggregator would not have known about the needs of the
> > specific codec.
> 
> But aggregator could also add the positions offset on each
> nextPosition() call, in Lucene.  Like that use case could be made to
> work, if Lucene had used a flat position space.

A generic aggregator wouldn't know that it needed to do that.  The postings
codec developer would be forced to write aggregation code in addition to
segment-level code.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Feb 10, 2010 at 8:27 AM, Marvin Humphrey <ma...@rectangular.com> wrote:

>> But why didn't you have the Multi*Enums layer add the offset (so
>> that the codec need not know who's consuming it)?  Performance?
>
> That would have involved something like this within the aggregator:
>
>    posting.setDocID(posting.getDodID() + docBase).
>
> The problem is that that's the docID the SegPostingList is using for
> its deltas.  If the SegPostingList skips during a call to advance(),
> it needs to reset that docID to the what the skip data says -- but
> if the aggregator layer doesn't tell it that it needs to account for
> a docBase, the new docID will lose the offset.  Can't solve that
> problem at the aggregator level either -- the aggregator doesn't
> know when skipping is occurring, so it can't intervene on an
> as-needed basis.

In Lucene, skipping is done through the aggregator, so it knows that
it's skipping, and in fact skips whole segments at a time until it
gets to the segment that may contain the doc.

> The fix was to make SegPostingList aware of a docBase, so that on
> skipping it could add it to the docID in the skip data and land at
> the right docID from the perspective of the consumer.  Messy.

OK

> I suppose another possibility would have been to have the aggregator
> keep its own Posting and copy all data over from the
> SegPostingList's Posting on each iteration then add its offset.

I think this is what Lucene does (?).  EG the aggregator holds its own
"int doc" which it must copy to (adding the offset) from the
underlying sub enum.

> However, that would have been a lot less efficient, and it still
> wouldn't have worked for the "flat positions space" example because
> the generic aggregator would not have known about the needs of the
> specific codec.

But aggregator could also add the positions offset on each
nextPosition() call, in Lucene.  Like that use case could be made to
work, if Lucene had used a flat position space.

>> > That example may not be a deal breaker for you, but I'm not
>> > willing to guarantee that Lucy will always return primitives from
>> > these enums, now and forever, one per method call.
>>
>> But it'd be a major API change down the road to change this, for
>> Lucy/KS?
>
> I suppose so.  It's either foreclose on the possibility of aggregating (Lucy),
> or foreclose on the possibility of using properties that cannot be aggregated
> (Lucene).

Right, though... if this even happens in practice for some future app,
that app can choose to avoid Multi*Enum.  Lucene internally doesn't
use Multi*Enum (except during merging, which your codec can
override, as of flex).

>> Also, this is why we're adding Attribute* to all the postings enums,
>> with flex -- any codec & consumer can use their own private
>> attributes.  The attrs pass through Multi*Enum.
>
> Hmm.  Does that mean that the consumer needs to refresh the attributes with
> each iteration?  Because what happens when you switch sub-enums within the
> Multi*Enum?  Don't those attributes go stale, as they belong to a sub-enum
> that has finished?

Switching sub-enums is indeed tricky (we're iterating on this in
LUCENE-2154).  Our current plan is to pass an attr source (maps attr
interface to an actual instance that implements it) to each sub-enum,
meaning, all codecs being aggregated must be able to use the same attr
impl.

So consumer gets a single instance for TupleAttribute, next's through
the enum, calling TupleAttribute.get() each time, regardless of
whether it's an aggreggated or non-aggregated enum.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Feb 10, 2010 at 06:58:01AM -0500, Michael McCandless wrote:
> But why didn't you have the Multi*Enums layer add the offset (so that
> the codec need not know who's consuming it)?  Performance?

That would have involved something like this within the aggregator:

    posting.setDocID(posting.getDodID() + docBase).

The problem is that that's the docID the SegPostingList is using for its
deltas.  If the SegPostingList skips during a call to advance(), it needs to
reset that docID to the what the skip data says -- but if the aggregator layer
doesn't tell it that it needs to account for a docBase, the new docID will
lose the offset.  Can't solve that problem at the aggregator level either --
the aggregator doesn't know when skipping is occurring, so it can't intervene
on an as-needed basis. 

The fix was to make SegPostingList aware of a docBase, so that on skipping it
could add it to the docID in the skip data and land at the right docID from
the perspective of the consumer.  Messy.

I suppose another possibility would have been to have the aggregator keep its
own Posting and copy all data over from the SegPostingList's Posting on each
iteration then add its offset.  However, that would have been a lot less
efficient, and it still wouldn't have worked for the "flat positions space"
example because the generic aggregator would not have known about the needs of
the specific codec.

> > That example may not be a deal breaker for you, but I'm not willing
> > to guarantee that Lucy will always return primitives from these
> > enums, now and forever, one per method call.
> 
> But it'd be a major API change down the road to change this, for
> Lucy/KS?  

I suppose so.  It's either foreclose on the possibility of aggregating (Lucy),
or foreclose on the possibility of using properties that cannot be aggregated
(Lucene).

> Also, this is why we're adding Attribute* to all the postings enums,
> with flex -- any codec & consumer can use their own private
> attributes.  The attrs pass through Multi*Enum.

Hmm.  Does that mean that the consumer needs to refresh the attributes with
each iteration?  Because what happens when you switch sub-enums within the
Multi*Enum?  Don't those attributes go stale, as they belong to a sub-enum
that has finished?

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey <ma...@rectangular.com> wrote:

>> Interesting... and segment merging just does its own private
>> concatenation/mapping-around-deletes of the doc/positions?
>
> I think the answer is yes, but I'm not sure I understand the
> question completely since I'm not sure why you'd ask that in this
> context.

Segment merging is one place that "legitimately" needs to append
docs/positions enum of multiple sub readers... but obviously it can
just do this itself (and it must, since it renumbers the docIDs).

>> what's a "flat positions space"?
>
> It's something Google once used.  Instead of positions starting with
> 0 at each document, they just keep going.
>
>  doc 1:  "Three Blind Mice"           - positions 0, 1, 2
>  doc 2:  "Peter Peter Pumpkin Eater"  - positions 3, 4, 5, 6
>
>> And we don't return "objects or aggregates" with Multi*Enum now...
>
> Yeah, this is different.  In KS right now, we use a generic
> PostingList, which conveys different information depending on what
> class of Posting it contains.

OK

>> In flex right now the codec is unware that it's being "consumed" by
>> a Multi*Enum.
>
> Right, but in KinoSearch's case PostingList had to be aware of that
> because the Posting object could be consumed at either the segment
> level or the index level -- so it needed a setDocBase(offset) method
> which adjusted the doc num in the Posting.  It was messy.
>
> The change I made was to eliminate PolyPostingList and
> PolyPostingListReader, which made it possible to remove the
> setDocBase() method from SegPostingList.

But why didn't you have the Multi*Enums layer add the offset (so that
the codec need not know who's consuming it)?  Performance?

>> It still returns primitives.  If instead we returned an int[] for
>> positions (hmm -- may be a good reason to make positions be an
>> Attribute, Uwe), I think it would still be OK?
>
> In the flat positions space example, it would be necessary to add an
> offset to each of the positions in that array.  Each segment would
> have a "positions max" analogous to maxDoc(); these would be summed
> to obtain the positions offset the same way we add up maxDoc() now
> to obtain the doc id offset.

OK, but [so far] we don't have that problem with the flex APIs -- the
codec is not aware that there's a multi enum layer consuming it.

> That example may not be a deal breaker for you, but I'm not willing
> to guarantee that Lucy will always return primitives from these
> enums, now and forever, one per method call.

But it'd be a major API change down the road to change this, for
Lucy/KS?  Ie this example seems not to apply to Lucene, and even for
KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up
and make such a major API change to the enums, once "committed".

Also, this is why we're adding Attribute* to all the postings enums,
with flex -- any codec & consumer can use their own private
attributes.  The attrs pass through Multi*Enum.

>> Still torn... I think it's convenience vs performance.
>
> But convenience for the posting format plugin developer matters too,
> right?

Right but the existince of Multi*Enums isn't affecting the codec dev
(so far, I think).

> Are you confident that a generic aggregator can support all possible
> codecs, or will plugin developers be forced to ensure that
> aggregation works because you've guaranteed to users like Renaud
> that it will?

Well... pretty confident.  So far, at least?  We have an existence
proof :) The codec API really should not (and, should not have to)
bake in details of who's consuming it.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Feb 10, 2010 at 9:47 AM, Renaud Delbru <re...@deri.org> wrote:
> On 10/02/10 13:15, Uwe Schindler wrote:
>>>
>>> Could you provide pointers to search code that uses the segment-level
>>> enum ?
>>> As I explained in my last answer to Michael, the TermScorer is using
>>> the
>>> DocsEnum interface, and therefore do not know if it manipulates
>>> segment-level enum or a Multi*Enums. What search (or query operators)
>>> in
>>> Lucene is using segment-level enums ?
>>>
>>
>> All of them, only rewrites are currently done on the top-level reader.
>> IndexSearcher since 2.9 creates Scorers in separate for each segment and
>> merges the results in its collector. Because of that we have a modified
>> Collector interface that has setNextReader() methods and so on.
>>
>
> Ok, so for example, in TermQuery$TermWeight#scorer(reader, scoreDocsInOrder,
> topScorer), the reader passed as parameter is one of the subscorer ? Is that
> right ?

Right, it will be a SegmentReader.

But, you're right -- the scorer method will also accept a
Multi/DirectoryReader, and iterate a Multi*Enum in that case.  It's
just less performant, so, Lucene doesn't do that when it creates
scorers.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

On 10/02/10 13:15, Uwe Schindler wrote:
>> Could you provide pointers to search code that uses the segment-level
>> enum ?
>> As I explained in my last answer to Michael, the TermScorer is using
>> the
>> DocsEnum interface, and therefore do not know if it manipulates
>> segment-level enum or a Multi*Enums. What search (or query operators)
>> in
>> Lucene is using segment-level enums ?
>>      
> All of them, only rewrites are currently done on the top-level reader. IndexSearcher since 2.9 creates Scorers in separate for each segment and merges the results in its collector. Because of that we have a modified Collector interface that has setNextReader() methods and so on.
>    
Ok, so for example, in TermQuery$TermWeight#scorer(reader, 
scoreDocsInOrder, topScorer), the reader passed as parameter is one of 
the subscorer ? Is that right ?

If this is the case, now I understand why Michael was saying that the 
way I am testing the postings (using termPositionsEnum on the top-level 
reader) was not really the proper way to test it, and that the correct 
way will be instead to use directly a TermQuery.

Thanks for the clarification.
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Flex & Docs/AndPositionsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Could you provide pointers to search code that uses the segment-level
> enum ?
> As I explained in my last answer to Michael, the TermScorer is using
> the
> DocsEnum interface, and therefore do not know if it manipulates
> segment-level enum or a Multi*Enums. What search (or query operators)
> in
> Lucene is using segment-level enums ?

All of them, only rewrites are currently done on the top-level reader. IndexSearcher since 2.9 creates Scorers in separate for each segment and merges the results in its collector. Because of that we have a modified Collector interface that has setNextReader() methods and so on.

So you can assume that every Scorer uses a SegmentReader, but legacy code may behave different (like if somebody instantiates a TermScorer and passes the top level reader to it). Also Solr is not yet completely free of global readers (as far as I know).


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

On 10/02/10 09:47, Uwe Schindler wrote:
> Positions as attributes would be good. For positions we need a new Attribute (not PositionIncrement), but e.g. for offsets and payloads we can use the standard attributes from the analysis, which is really cool. This would also make it possible to add all custom attributes from the analysis phase to the posting list and make them visible in the TermDocs enum. In my opinion, there should be no DocsEnum, DocsAndPositionsEnum and so on enums, just one class, which only differes in provided attributes. So if you want the payloads, ask for a standard DocsEnum and pass the requested attribute classes as parameter):
> 	IndexReader.termDocsEnum(Bits skipDocs, String field, BytesRef term, Class<? extends Attribute>... atts)
>
> If somebody wants offsets and payloads:
> 	reader.termDocsEnum(skipDocs, "field", term, OffsetAttribute.class, PayloadAttribute.class);
>    
I kind of like this idea. This interface to iterate over the postings 
looks more flexible, and imho it will be easy to use this interface with 
any "home-brewed" codec.
Read optimisations based on the user need such as the current 
termDocsEnum and termPositionsEnum (where one is reading only the freq 
file, the second one is also reading the prox file) will be done under 
the hood by the respective PostingReader. Given the set of Attribute 
class received, the PostingReader knows what he needs to read, and what 
he does not need to read. So, there is also a simplification of the 
interface for the user. It does not have to take care of choosing the 
right enum.
> I am not sure if this is very good in Lucene as it would break lots of apps. E.g. simple autocompletes use a PrefixTerm(s)Enums, but must use the top-level reader or they have to emulate merging of all TermsEnums themselves. A second problem (currently) is rewrites (e.g. Fuzzy) to BooleanQuery for MTQs. They operate on the top level reader.
>
> So I propose "simple" and not so performant Enums for MultiReaders. In my opinion, it would also be possible without ProxyAttributes, if we simply copy them around. It’s a performance problem, but if somebody needs speed, segment-level enums should be used (and search does this by the way).
>    
Could you provide pointers to search code that uses the segment-level 
enum ?
As I explained in my last answer to Michael, the TermScorer is using the 
DocsEnum interface, and therefore do not know if it manipulates 
segment-level enum or a Multi*Enums. What search (or query operators) in 
Lucene is using segment-level enums ?

Cheers
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Flex & Docs/AndPositionsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

> > And we don't return "objects or aggregates" with Multi*Enum now...
> 
> Yeah, this is different.  In KS right now, we use a generic
> PostingList, which
> conveys different information depending on what class of Posting it
> contains.
> 
> > In flex right now the codec is unware that it's being "consumed" by a
> > Multi*Enum.
> 
> Right, but in KinoSearch's case PostingList had to be aware of that
> because
> the Posting object could be consumed at either the segment level or the
> index
> level -- so it needed a setDocBase(offset) method which adjusted the
> doc num in
> the Posting.  It was messy.

The doc base adaption is done in the MultiDocsEnum in Lucene.

> The change I made was to eliminate PolyPostingList and
> PolyPostingListReader,
> which made it possible to remove the setDocBase() method from
> SegPostingList.
> 
> > It still returns primitives.  If instead we returned an int[] for
> positions
> > (hmm -- may be a good reason to make positions be an Attribute, Uwe),
> I
> > think it would still be OK?

Positions as attributes would be good. For positions we need a new Attribute (not PositionIncrement), but e.g. for offsets and payloads we can use the standard attributes from the analysis, which is really cool. This would also make it possible to add all custom attributes from the analysis phase to the posting list and make them visible in the TermDocs enum. In my opinion, there should be no DocsEnum, DocsAndPositionsEnum and so on enums, just one class, which only differes in provided attributes. So if you want the payloads, ask for a standard DocsEnum and pass the requested attribute classes as parameter):
	IndexReader.termDocsEnum(Bits skipDocs, String field, BytesRef term, Class<? extends Attribute>... atts)

If somebody wants offsets and payloads:
	reader.termDocsEnum(skipDocs, "field", term, OffsetAttribute.class, PayloadAttribute.class);

But before we can implement this for MultiEnums we need the Proxy attributes or we need to copy them around (and the MultiEnums get their own AttributeSource). For this to work I will add a AttributeSource.copyTo(AttributeSource), which is on my todolist, but still missing. For some TokenStreams this method may also be convenient (e.g. concenating TokenStreams).

On the other hand: with Proxy attributes, concenating TokenStreams are easy (and very performant!), too.

> > You should (when possible/reasonable) instead use
> > ReaderUtil.gatherSubReaders, then iterate through those sub readers
> > asking each for its flex fields.
> >
> > But if this is only for testing purposes, and Multi*Enum is more
> > convenient (and, once attrs work correctly), then Multi*Enum is
> > perfectly fine.
> 
> Mike, FWIW, I've removed the ability to iterate over posting data at
> anything
> other than the segment level from KS.  There's still a priority-queue-
> based
> aggregator for iterating over all terms in a multi-segment index, but
> not for
> anything lower.

I am not sure if this is very good in Lucene as it would break lots of apps. E.g. simple autocompletes use a PrefixTerm(s)Enums, but must use the top-level reader or they have to emulate merging of all TermsEnums themselves. A second problem (currently) is rewrites (e.g. Fuzzy) to BooleanQuery for MTQs. They operate on the top level reader.

So I propose "simple" and not so performant Enums for MultiReaders. In my opinion, it would also be possible without ProxyAttributes, if we simply copy them around. It’s a performance problem, but if somebody needs speed, segment-level enums should be used (and search does this by the way).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Feb 09, 2010 at 03:47:19PM -0500, Michael McCandless wrote:

> Interesting... and segment merging just does its own private
> concatenation/mapping-around-deletes of the doc/positions?

I think the answer is yes, but I'm not sure I understand the question
completely since I'm not sure why you'd ask that in this context.

> what's a "flat positions space"?  

It's something Google once used.  Instead of positions starting with 0 at each
document, they just keep going.

  doc 1:  "Three Blind Mice"           - positions 0, 1, 2
  doc 2:  "Peter Peter Pumpkin Eater"  - positions 3, 4, 5, 6

> And we don't return "objects or aggregates" with Multi*Enum now...

Yeah, this is different.  In KS right now, we use a generic PostingList, which
conveys different information depending on what class of Posting it contains.

> In flex right now the codec is unware that it's being "consumed" by a
> Multi*Enum.  

Right, but in KinoSearch's case PostingList had to be aware of that because
the Posting object could be consumed at either the segment level or the index
level -- so it needed a setDocBase(offset) method which adjusted the doc num in
the Posting.  It was messy.

The change I made was to eliminate PolyPostingList and PolyPostingListReader,
which made it possible to remove the setDocBase() method from SegPostingList.

> It still returns primitives.  If instead we returned an int[] for positions
> (hmm -- may be a good reason to make positions be an Attribute, Uwe), I
> think it would still be OK?

In the flat positions space example, it would be necessary to add an offset to
each of the positions in that array.  Each segment would have a "positions
max" analogous to maxDoc(); these would be summed to obtain the positions
offset the same way we add up maxDoc() now to obtain the doc id offset.

That example may not be a deal breaker for you, but I'm not willing to
guarantee that Lucy will always return primitives from these enums, now and
forever, one per method call.

> Still torn... I think it's convenience vs performance.  

But convenience for the posting format plugin developer matters too, right?
Are you confident that a generic aggregator can support all possible codecs,
or will plugin developers be forced to ensure that aggregation works because
you've guaranteed to users like Renaud that it will?

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

Hi Michael,

On 09/02/10 20:47, Michael McCandless wrote:
> But, then, it's very convenient when you need it and don't care about
> performance.  EG in Renaud's usage, a test case that is trying to
> assert that all indexed docs look right, why should you be forced to
> operate per segment?  He shouldn't have to bother with the details of
> which field/term/doc was indexed into which segment.
>
> Or, I guess we could argue that this test really should create a
> TermQuery and walk the matching docs... instead of using the low level
> flex enum APIs.  Because searching impl already knows how to step
> through the segments.
>    
In fact, I care about performance, but I was using the 
IndexReader.termPositionsEnum to mimic the implementation of the 
different query scorers (e.g., TermScorer).
I have already reimplemented many of the original Lucene Scorers to use 
my particular index structure. From what I have seen, the main low level 
scorers (e.g., TermScorer, PhraseScorer) are using the DocsEnum 
interface, and not a segment-level enum. From what I understand, these 
scorers are not aware if they are using a segment-level enum or a 
Multi*Enum. So, there is a loss of performance in this case ? Or do I 
miss something ?

I'll try to clarify my usage of the Flex API, maybe it can highlight you 
certain aspects.
In the ideal world, what I would like to do is the following:
1) write my own codec,
2) register my codec in the IndexWriter, and tell him to use this codec 
for one or more fields (similar to the PerFieldCodecWrapper),
3) write query operators that are compatible with my codec,
4) at search time, use these query operators with the fields that use my 
codec.

If by error, I am using the query operators which are not compatible 
with a field (and its related codec), an exception is thrown telling me 
that I am not able to use these query operators with this field.

So, in my current use case, I don't think it is necessary to be aware of 
that fact that I am manipulating multiple segments or only one segment. 
I think this should be hidden.

But what you were suggesting is to create my own "MultiReader" that is 
optimised for my codec. Is that right ? A MultiReader that just iterates 
over the subreaders, checks if they are using my codec (and therefore 
associated fields), and uses them to iterate over my own postings ?
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Feb 9, 2010 at 1:12 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Tue, Feb 09, 2010 at 11:51:31AM -0500, Michael McCandless wrote:
>
>> You should (when possible/reasonable) instead use
>> ReaderUtil.gatherSubReaders, then iterate through those sub readers
>> asking each for its flex fields.
>
>> But if this is only for testing purposes, and Multi*Enum is more
>> convenient (and, once attrs work correctly), then Multi*Enum is
>> perfectly fine.
>
> Mike, FWIW, I've removed the ability to iterate over posting data at
> anything other than the segment level from KS.  There's still a
> priority-queue-based aggregator for iterating over all terms in a
> multi-segment index, but not for anything lower.

Interesting... and segment merging just does its own private
concatenation/mapping-around-deletes of the doc/positions?

I'm torn on the Multi*Enum.... it's easy to get one "by accident"
(because you're interacting with multi reader) and as a result take a
silent performance hit.  And often the caller can easily change to
operate per segment instead.

But, then, it's very convenient when you need it and don't care about
performance.  EG in Renaud's usage, a test case that is trying to
assert that all indexed docs look right, why should you be forced to
operate per segment?  He shouldn't have to bother with the details of
which field/term/doc was indexed into which segment.

Or, I guess we could argue that this test really should create a
TermQuery and walk the matching docs... instead of using the low level
flex enum APIs.  Because searching impl already knows how to step
through the segments.

Anyway, my current patch on LUCENE-2111 reflects my torn-ness: it
makes it just a bit harder to get Multi*Enum on a multi-reader.  If
you call MultiReader.fields(), it throws
UnsupportedOperationException, and you must instead use
MultiFields.getXXXEnum to explicitly create the enum.

> Forcing pluggable index formats to support the extra level of indirection
> necessary for iterating postings from a high level both introduces
> inefficiency and constrains their development.  Consider what would happen if
> we tried indexed terms within a flat positions space and returned an array of
> positions instead of one position at a time.  The instant you return objects
> or aggregates rather than primitives, you force support for offsets down into
> the low-level decoder.

I don't understand this example -- can you give more detail?  Eg,
what's a "flat positions space"?  And "force support for offsets".
And we don't return "objects or aggregates" with Multi*Enum now...

In flex right now the codec is unware that it's being "consumed" by a
Multi*Enum.  It still returns primitives.  If instead we returned an
int[] for positions (hmm -- may be a good reason to make positions be
an Attribute, Uwe), I think it would still be OK?

> It's not really necessary to iterate aggregated postings across multiple
> segments, so IMO it's best to shunt users like Renaud towards the segment
> level.

Still torn... I think it's convenience vs performance.  But I
want convenience to be an explicit choice.  We shouldn't default our
APIs to a silent perf hit...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Feb 09, 2010 at 11:51:31AM -0500, Michael McCandless wrote:

> You should (when possible/reasonable) instead use
> ReaderUtil.gatherSubReaders, then iterate through those sub readers
> asking each for its flex fields.
> 
> But if this is only for testing purposes, and Multi*Enum is more
> convenient (and, once attrs work correctly), then Multi*Enum is
> perfectly fine.

Mike, FWIW, I've removed the ability to iterate over posting data at anything
other than the segment level from KS.  There's still a priority-queue-based
aggregator for iterating over all terms in a multi-segment index, but not for
anything lower.  

Forcing pluggable index formats to support the extra level of indirection
necessary for iterating postings from a high level both introduces
inefficiency and constrains their development.  Consider what would happen if
we tried indexed terms within a flat positions space and returned an array of
positions instead of one position at a time.  The instant you return objects
or aggregates rather than primitives, you force support for offsets down into
the low-level decoder.

It's not really necessary to iterate aggregated postings across multiple
segments, so IMO it's best to shunt users like Renaud towards the segment
level.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Feb 9, 2010 at 11:35 AM, Renaud Delbru <re...@deri.org> wrote:

>> This particular patch doesn't change the Codecs API -- it "only"
>> factors out the Multi* APIs from MultiReader.  Likely you won't need
>> to change your codec... but try applying the patch and see :)
>>
>
> Ok, good news ;o).

Flex is still in flux, though :)

>> However: if you consume the flex API directly, on top of multi readers
>> (something you shouldn't do, for performance reasons), you will have
>> to use MultiField's static methods to get the enums.
>
> In my previous example (registering my codec in IndexWriter, and then use
> IndexReader), do I consume the flex API directly on top of the multi-readers
> directly ? If yes, how to avoid that ?

You should (when possible/reasonable) instead use
ReaderUtil.gatherSubReaders, then iterate through those sub readers
asking each for its flex fields.

But if this is only for testing purposes, and Multi*Enum is more
convenient (and, once attrs work correctly), then Multi*Enum is
perfectly fine.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

On 09/02/10 16:04, Michael McCandless wrote:
> On Tue, Feb 9, 2010 at 9:08 AM, Renaud Delbru<re...@deri.org>  wrote:
>    
>> So, does it mean that the codec interface is likely to change ? Do I need to
>> be prepared to change again all my code ;o) ?
>>      
> This particular patch doesn't change the Codecs API -- it "only"
> factors out the Multi* APIs from MultiReader.  Likely you won't need
> to change your codec... but try applying the patch and see :)
>    
Ok, good news ;o).
> However: if you consume the flex API directly, on top of multi readers
> (something you shouldn't do, for performance reasons), you will have
> to use MultiField's static methods to get the enums.
>    
In my previous example (registering my codec in IndexWriter, and then 
use IndexReader), do I consume the flex API directly on top of the 
multi-readers directly ? If yes, how to avoid that ?

Cheers
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Feb 9, 2010 at 9:08 AM, Renaud Delbru <re...@deri.org> wrote:
> Hi Michael,
>
> On 09/02/10 13:35, Michael McCandless wrote:
>>
>> It's great that you're testing the flex APIs... things are still "in
>> flux" as you've seen.  There's another big patch pending on
>> LUCENE-2111...
>>
>
> So, does it mean that the codec interface is likely to change ? Do I need to
> be prepared to change again all my code ;o) ?

This particular patch doesn't change the Codecs API -- it "only"
factors out the Multi* APIs from MultiReader.  Likely you won't need
to change your codec... but try applying the patch and see :)

However: if you consume the flex API directly, on top of multi readers
(something you shouldn't do, for performance reasons), you will have
to use MultiField's static methods to get the enums.

>> Out of curiosity... in what circumstances do you see a Multi*Enum
>> appearing?
>>
>> Lucene's core always searches "by segment".  Are you doing something
>> external (directly using the flex APIs against a
>> Multi/DirectoryReader)?
>
> I am using the flex API with the high level Lucene interface (IndexWriter
> and IndexReader).
> I am creating a RamDirectory, register my codec into the IndexWriter, and
> index 64 documents. Then, I use the IndexReader.termPositionsEnum to get my
> own DocsAndPositionsEnum in order to check if all the information that have
> been stored in the new index data structure are correctly retrieved.
> In that case, I got the previous errors (a MultiDocsAndPositionsEnum is
> returned). However, when I am indexing only one or two documents, the
> original DocsAndPositionsEnum is returned.

Got it, so you're directly consuming the flex API in your test.
Whenever the index has > 1 segment, you'll get a multi enum.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

Hi Michael,

On 09/02/10 13:35, Michael McCandless wrote:
> It's great that you're testing the flex APIs... things are still "in
> flux" as you've seen.  There's another big patch pending on
> LUCENE-2111...
>    
So, does it mean that the codec interface is likely to change ? Do I 
need to be prepared to change again all my code ;o) ?
> Out of curiosity... in what circumstances do you see a Multi*Enum appearing?
>
> Lucene's core always searches "by segment".  Are you doing something
> external (directly using the flex APIs against a
> Multi/DirectoryReader)?
>    
I am using the flex API with the high level Lucene interface 
(IndexWriter and IndexReader).
I am creating a RamDirectory, register my codec into the IndexWriter, 
and index 64 documents. Then, I use the IndexReader.termPositionsEnum to 
get my own DocsAndPositionsEnum in order to check if all the information 
that have been stored in the new index data structure are correctly 
retrieved.
In that case, I got the previous errors (a MultiDocsAndPositionsEnum is 
returned). However, when I am indexing only one or two documents, the 
original DocsAndPositionsEnum is returned.

Hope that helps,
cheers
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Michael McCandless <lu...@mikemccandless.com>.

Renaud,

It's great that you're testing the flex APIs... things are still "in
flux" as you've seen.  There's another big patch pending on
LUCENE-2111...

Out of curiosity... in what circumstances do you see a Multi*Enum appearing?

Lucene's core always searches "by segment".  Are you doing something
external (directly using the flex APIs against a
Multi/DirectoryReader)?

Mike

On Tue, Feb 9, 2010 at 8:04 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Hi Renaud,
>
>
>> On 09/02/10 12:16, Uwe Schindler wrote:
>> > In flex the correct way to add additional posting data to these
>> classes would be the usage of custom attributes, registered in the
>> attributes() AttributeSource.
>> >
>> Ok, I have changed my codes to use the AttributeSource interface.
>> > Due to some limitations, there is currently no working support in
>> MultiReaders to have a "view" on the underlying Enums, but we are
>> working on that.
>> >
>> But, I have still the same problem, it seems that
>> MultiDocsAndPositionsEnum does not have access to the underlying
>> attributes added to my DocsAndPositionsEnum subclass. I got the
>> following exception (IllegalArgumentException):
>> "This AttributeSource does not have the attribute
>> 'org.sindice.siren.analysis.attributes.TupleAttribute'."
>>
>> Is this related to your previous comment, i.e., that MultiReaders do
>> not
>> have a view on the underlying Enums ?
>
> Exactly, MultiEnums have their own attributes at the moment, there is no "Proxy" view on it. For this to work, proxy AttributeImpls are needed and there is no support at the moment.
>
> See https://issues.apache.org/jira/browse/LUCENE-2154
>
> The problem behind is that when a consumer gets/adds an Attribute, all subreaders  must use the same attribute or the MultiReader/DirectoryReader must proxy the attributes. For this to work we need dynamic proxies or you also have to implement ProxyImpls: Attribute, AttributeImpl, AttributeProxyImpl.
>
> We have no progress for that at the moment, so I am sorry, we have no working support for attributes in MultiReaders (which all DirectoryReaders are, because a index could consist of more than one segment).
>
>> > In general what you do (if it works in future):
>> > Define an interface for your extensions based on the Attribute
>> interface and also provide the implementation class. Then call
>> YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of
>> your enum, store a local reference to the attribute and fill this on
>> iteration. Any consumer of this Enum can check using
>> TermPositions.attributes().hasAttribute/getAttribute/addAttribute for
>> the existence of the the same and then read the attributes during
>> iteration. There is no need to change the Enum class API at all.
>> >
>> Ok, it works like a charm except the problem related to MultiReaders.
>
> See above.
>
> But attributes are the way to go for this extended posting/prox lists.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Flex & Docs/AndPositionsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Renaud,

 
> On 09/02/10 12:16, Uwe Schindler wrote:
> > In flex the correct way to add additional posting data to these
> classes would be the usage of custom attributes, registered in the
> attributes() AttributeSource.
> >
> Ok, I have changed my codes to use the AttributeSource interface.
> > Due to some limitations, there is currently no working support in
> MultiReaders to have a "view" on the underlying Enums, but we are
> working on that.
> >
> But, I have still the same problem, it seems that
> MultiDocsAndPositionsEnum does not have access to the underlying
> attributes added to my DocsAndPositionsEnum subclass. I got the
> following exception (IllegalArgumentException):
> "This AttributeSource does not have the attribute
> 'org.sindice.siren.analysis.attributes.TupleAttribute'."
> 
> Is this related to your previous comment, i.e., that MultiReaders do
> not
> have a view on the underlying Enums ?

Exactly, MultiEnums have their own attributes at the moment, there is no "Proxy" view on it. For this to work, proxy AttributeImpls are needed and there is no support at the moment.

See https://issues.apache.org/jira/browse/LUCENE-2154

The problem behind is that when a consumer gets/adds an Attribute, all subreaders  must use the same attribute or the MultiReader/DirectoryReader must proxy the attributes. For this to work we need dynamic proxies or you also have to implement ProxyImpls: Attribute, AttributeImpl, AttributeProxyImpl.

We have no progress for that at the moment, so I am sorry, we have no working support for attributes in MultiReaders (which all DirectoryReaders are, because a index could consist of more than one segment).

> > In general what you do (if it works in future):
> > Define an interface for your extensions based on the Attribute
> interface and also provide the implementation class. Then call
> YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of
> your enum, store a local reference to the attribute and fill this on
> iteration. Any consumer of this Enum can check using
> TermPositions.attributes().hasAttribute/getAttribute/addAttribute for
> the existence of the the same and then read the attributes during
> iteration. There is no need to change the Enum class API at all.
> >
> Ok, it works like a charm except the problem related to MultiReaders.

See above.

But attributes are the way to go for this extended posting/prox lists.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Posted by Renaud Delbru <re...@deri.org>.

Hi Uwe,

On 09/02/10 12:16, Uwe Schindler wrote:
> In flex the correct way to add additional posting data to these classes would be the usage of custom attributes, registered in the attributes() AttributeSource.
>    
Ok, I have changed my codes to use the AttributeSource interface.
> Due to some limitations, there is currently no working support in MultiReaders to have a "view" on the underlying Enums, but we are working on that.
>    
But, I have still the same problem, it seems that 
MultiDocsAndPositionsEnum does not have access to the underlying 
attributes added to my DocsAndPositionsEnum subclass. I got the 
following exception (IllegalArgumentException):
"This AttributeSource does not have the attribute 
'org.sindice.siren.analysis.attributes.TupleAttribute'."

Is this related to your previous comment, i.e., that MultiReaders do not 
have a view on the underlying Enums ?
> In general what you do (if it works in future):
> Define an interface for your extensions based on the Attribute interface and also provide the implementation class. Then call YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of your enum, store a local reference to the attribute and fill this on iteration. Any consumer of this Enum can check using TermPositions.attributes().hasAttribute/getAttribute/addAttribute for the existence of the the same and then read the attributes during iteration. There is no need to change the Enum class API at all.
>    
Ok, it works like a charm except the problem related to MultiReaders.

Thanks
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Flex & Docs/AndPositionsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Renaud,

In flex the correct way to add additional posting data to these classes would be the usage of custom attributes, registered in the attributes() AttributeSource.

Due to some limitations, there is currently no working support in MultiReaders to have a "view" on the underlying Enums, but we are working on that.

In general what you do (if it works in future):
Define an interface for your extensions based on the Attribute interface and also provide the implementation class. Then call YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of your enum, store a local reference to the attribute and fill this on iteration. Any consumer of this Enum can check using TermPositions.attributes().hasAttribute/getAttribute/addAttribute for the existence of the the same and then read the attributes during iteration. There is no need to change the Enum class API at all.

It works in the same way like the TokenStreams since 2.9/3.0.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Renaud Delbru [mailto:renaud.delbru@deri.org]
> Sent: Tuesday, February 09, 2010 1:05 PM
> To: java-user
> Cc: Michael McCandless
> Subject: Flex & Docs/AndPositionsEnum
> 
> Hi Michael,
> 
> I have updated my lucene-1458, and I discovered there was big
> modifications in the StandardCodec interface.
> I updated my own codecs to this new interface, but I encounter a
> problem. My codecs are creating DocsAndPositionsEnum subclasses that
> allow to access more information than simply the doc, freq and position
> (I have other information encoded into the Prox file).
> In the code, to be able to manipulate the additional interface that my
> classes provide, I was casting the DocsAndPositionsEnum object returned
> by IndexReader#termPositionsEnum() into the correct subclass. While
> this
> approach was working in the previous flewx branch, this does not work
> anymore with the last committed changes. In certain cases, the
> IndexReader#termPositionsEnum() does not return the
> DocsAndPositionsEnum
> created by the StandardPostingsReader, but a MultiDocsAndPositionsEnum.
> However, I am not able either to subclass the MultiDocsAndPositionsEnum
> or to wrap it into a decorator because it is declared as 'private
> static
> final' in DirectoryReader.
> 
> Are these classes (MultiTermEnum, MultiDocsAndPositionsEnum, etc.)
> hidden in a voluntary manner ? Or is there is another way to extends
> StandardCodec without having to deal with these classes ?
> 
> Cheers
> --
> Renaud Delbru
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org