You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2008/04/09 22:28:47 UTC

Re: Flexible indexing design

On Apr 9, 2008, at 6:35 AM, Michael Busch wrote:

> We also need to come up with a good solution for the dictionary,  
> because a term with frq/prx postings needs to store two (or three  
> for skiplist) file pointers in the dictionary, whereas e. g. a  
> "binary" posting list only needs one pointer.

This is something I'm working on as well, and I hope we can solve a  
couple of design problems I've been turning over in my mind for some  
time.

In KS, the information Lucene stores in the frq/prx files is carried  
in one postings file per field, as discussed previously.  However, I  
made the additional change of breaking out skip data into a separate  
file (shared across all fields).  Isolating skip data sacrifices some  
locality of reference, but buys substantial gains in simplicity and  
compartmentalization.  Individual Posting subclasses, each of which  
defines a file format, don't have to know about skip algorithms at  
all.  :)  Further, improvements in the skip algorithm only require  
changes to the .skip file, and falling back to PostingList_Next still  
works if the .skip file becomes corrupted since .skip carries only  
optimization info and no real data.

For reasons I won't go into here, KS doesn't need to put a field  
number in it's TermInfo, but it does need doc freq, plus file  
positions for the postings file, the skip file, and the primary  
Lexicon file.  (Lexicon is the KS term dictionary class, akin to  
Lucene's TermEnum.)

   struct kino_TermInfo {
       kino_VirtualTable* _;
       kino_ref_t ref;
       chy_i32_t doc_freq;
       chy_u64_t post_filepos;
       chy_u64_t skip_filepos;
       chy_u64_t lex_filepos;
   };

There are two problems.

First is that I'd like to extend indexing with arbitrary subclasses of  
SegDataWriter, and I'd like these classes to be able to put their own  
file position bookmarks (or possibly other data) into TermInfo.   
Making TermInfo hash-based would probably do it, but there would be  
nasty performance and memory penalties since TermInfo objects are  
numerous.

So, what's the best way to allow multiple, unrelated classes to extend  
TermInfo and the term dictionary file format?  Is it to break up  
TermInfo information horizontally rather than vertically, so that  
instead of a single array of TermInfo objects, we have a flexible  
stack of arrays of 64-bit integers representing file positions?

The second problem is how to share a term dictionary over a cluster.   
It would be nice to be able to plug modules into IndexReader that  
represent clusters of machines but that are dedicated to specific  
tasks: one cluster could be dedicated to fetching full documents and  
applying highlighting; another cluster could be dedicated to scanning  
through postings and finding/scoring hits; a third cluster could store  
the entire term dictionary in RAM.

A centralized term dictionary held in RAM would be particularly handy  
for sorting purposes.  The problem is that the file pointers of a term  
dictionary are specific to indexes on individual machines.  A shared  
dictionary in RAM would have to contain pointers for *all* clients,  
which isn't really workable.

So, just how do you go about assembling task specific clusters?  The  
stored documents cluster is easy, but the term dictionary and the  
postings are hard.

> For example, we should think about the Field APIs. Since we don't  
> have global field semantics in Lucene I wonder how to handle  
> conflict cases, e. g. when a document specifies a different posting  
> list format than a previous one for the same field. The easiest way  
> would be to not allow it and throw an exception. But this is kind of  
> against Lucene's way of dealing with fields currently. But I'm  
> scared of the complicated code to handle conflicts of all the  
> possible combinations of posting list formats.

Yeah. Lucene's field definition conflict-resolution code is gnarly  
already. :(

> KinoSearch doesn't have to worry about this, because it has a static  
> schema (I think?), but isn't as flexible as Lucene.

Earlier versions of KS did not allow the addition of new fields on the  
fly, but this has been changed.  You can now add fields to an existing  
Schema object like so:

     for my $doc (@docs) {
         # Dynamically define any new fields as 'text'.
         for my $field ( keys %$doc ) {
             $schema->add_field( $field => 'text' );
         }
         $invindexer->add_doc($doc);
     }

See the attached sample app for that snippet in context.

Here are some current differences between KS and Lucene:

   * KS doesn't yet purge *old* dynamic field definitions which have
     become obsolete.  However, that should be possible to add later,
     as a sweep triggered during full optimization.
   * You can't change the definition of an existing field.
   * Documents are hash-based, so you can't have multiple fields with
     the same name within one document object.  However, I consider
     that capability a misfeature of Lucene.

In summary, I don't think that global field semantics meaningfully  
restrict flexibility for the vast majority of users.

The primary distinction is/was philosophical.   IIRC, Doug didn't want  
to force people to think about index design in advance, so the Field/ 
Document API was optimized for newbies.  In contrast, KS wants you to  
give it a Schema before indexing commences.

It's still true that full-power KS forces you to think about index  
design up-front.  However, there's now a KinoSearch::Simple API  
targeted at newbies which hides the Schema API and handles field  
definition automatically -- so Doug's ease-of-use design goal has been  
achieved via different means.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

> > Container is only aware of the single inStream, while codec can still
> > think its operating on 3 even if it's really 1 or 2.
> >
>
> I don't understand.  If you have three streams, all of them are going to
> have to get skipped, right?

For the "all data in one stream" (KS's dev branch) approach, I'm
picturing (and I'm not really "sure" on any of this until we get our
feet real wet here) that the container is told "you really only have 1
stream" even though the codec thinks it's got 3 streams.  And the
container holds onto that stream.

I agree container alone interacts with skip data.

But I also thought container alone would do "big skips", and then use
codec only for doc-by-doc skipping only once it's close to the target.

If instead you do the "one file per type" approach (current released
KS & Lucene), then the container would know it has 3 real streams and
would seek all 3 on a big skip.

> > And even so one could plug in their own (single stream) codec if need be?
> >
>
> Sure.
>
> The question is how we set up PostingList's iterator.  In KS right now,
> SegPList_Next() calls Posting's Read_Record() method, which takes an
> InStream as an argument -- Posting doesn't maintain any streams internally.
>
> As soon as the codec starts reading from an indeterminate number of
> streams, though, having the container pass them in for each read op won't
> work any more.  The codec will have to be responsible for all data streams.

Agreed.

> > > The only penalty for having a TermPositions object read
> > > positions in bulk like that is memory footprint (since each TermPositions
> > > object needs a growable array of 32-bit integers to store the bulk
> > > positions).
> > >
> >
> > This is a tiny memory cost right?  A TermPositions instance gets
> > re-used each time you next() to the next term.
> >
>
> With large documents, the buffer size requirements can get pretty big --
> consider how often the term "the" might appear in a 1000-page novel.  Lucene
> and its relatives don't work very well with novel-sized documents anyway,
> though, so for better or worse, I blew it off.

OK good point.  There are users now and then, for better or worse, who
do seem to index massive documents.

> > I think you also pay an added up-front (small) decoding cost in cases
> > where the consumer will only look at a subset of the positions before
> > next()'ing.  Eg a phrase search involving a rare term and a frequent
> > term.
> >
>
> Yes, you're right about that.  The KS phrase scorer has less function call
> overhead, though -- it's a more limited design, with no 'slop' support, and
> it operates using pointer math on the arrays of positions directly rather
> than having to make a lot of accessor calls.
>
> My guess is that it's a wash, algorithm-wise.  It seems likely that file
> format would have a significant effect, but I would be surprised to see
> phrase scorer performance in KS and Lucene diverge all that much as a result
> of the algorithmic implementation.
>
> That's why I asserted that the main motivation for bulk-reading positions
> in KS was simplicity, rather than performance or something else.

OK that seems like a likely hypothesis, though somehow we should
verify it in practice.  Performance of phrase searching (and its
relatives) is important.

> > But the good news is if this framework is plugglable, one can insert
> > their own codec to not do the up-front decoding of all positions per
> > term X doc.
> >
>
> Yes, that would work.  You would need different Scorer implementations to
> deal with codecs which read the same file format yet are implemented
> differently... but that's a feature. :)

Yeah.  This is very much an way-expert use case.

> > > We'd need both PostingBuffer and Posting subclasses.
> > >
> >
> > Yes.
> >
>
> OK, I think we're almost at the point where we can hack up prototypes for
> Lucene implementations of PostingBuffer and PostingList that read the
> current file format.  I think seeing those would clarify things.
>
> Ironically, I'm not sure exactly how Posting fits into the equation at
> read-time anymore, but I think that will work itself out as we get back to
> indexing.

Agreed, it's time to get our feet weet.  Though, I'm working "top
down", so my first focus is on modularizing DocumentsWriter such that
new features like LUCENE-1231 (column-stride fields) are more or less
"just" a plugin into indexing.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 27, 2008, at 3:28 AM, Michael McCandless wrote:

> Actually, I was picturing that the container does the seeking itself
> (using skip data), to get "close" to the right point, and then it uses
> the codec to step through single docs at a time until it's at or
> beyond the right one.

I believe it makes sense to have the skip stream be the sole  
responsibility of the container.

In any case, the skip stream would would be the *only* stream the  
container knows about.  All others streams would be internal to the  
codec: frq, prx, etc.

> Ie, no SkipDatum is created & passed to the codec, though likely
> container should notify codec it just got skipped in case it has
> internal state that cares.

But you'll have to seek frq, prx, etc, though -- right?  Say you've  
got a large PostingList currently located near its top and you want  
skip to halfway through.  How do you do that without skipping the  
underlying streams, which the codec knows about but the container  
doesn't?

If you don't skip the internal state of the codec including the  
locations of its InStreams, you just have to keep calling next() on it  
over and over -- an un-optimized skipTo() implementation.  That can't  
be what you want; presumably we're miscommunicating somehow.

> Container is only aware of the single inStream, while codec can still
> think its operating on 3 even if it's really 1 or 2.

I don't understand.  If you have three streams, all of them are going  
to have to get skipped, right?

>> The downside is that the codec object itself suddenly has to get a  
>> lot
>> bigger to hold all the instreams.
>
> But not many instances of the codec are created at once, right?

Yes.

That's not how I originally implemented Posting.  It was a small  
class, and PostingList originally had a Bulk_Read() method that read 1  
kB worth of Posting objects into a ByteBuf, stacked end to end.  But  
we agree that the codec class will need to be larger.

> And even so one could plug in their own (single stream) codec if  
> need be?

Sure.

The question is how we set up PostingList's iterator.  In KS right  
now, SegPList_Next() calls Posting's Read_Record() method, which takes  
an InStream as an argument -- Posting doesn't maintain any streams  
internally.

As soon as the codec starts reading from an indeterminate number of  
streams, though, having the container pass them in for each read op  
won't work any more.  The codec will have to be responsible for all  
data streams.

>> The only penalty for having a TermPositions object read
>> positions in bulk like that is memory footprint (since each  
>> TermPositions
>> object needs a growable array of 32-bit integers to store the bulk
>> positions).
>
> This is a tiny memory cost right?  A TermPositions instance gets
> re-used each time you next() to the next term.

With large documents, the buffer size requirements can get pretty big  
-- consider how often the term "the" might appear in a 1000-page  
novel.  Lucene and its relatives don't work very well with novel-sized  
documents anyway, though, so for better or worse, I blew it off.

> I think you also pay an added up-front (small) decoding cost in cases
> where the consumer will only look at a subset of the positions before
> next()'ing.  Eg a phrase search involving a rare term and a frequent
> term.

Yes, you're right about that.  The KS phrase scorer has less function  
call overhead, though -- it's a more limited design, with no 'slop'  
support, and it operates using pointer math on the arrays of positions  
directly rather than having to make a lot of accessor calls.

My guess is that it's a wash, algorithm-wise.  It seems likely that  
file format would have a significant effect, but I would be surprised  
to see phrase scorer performance in KS and Lucene diverge all that  
much as a result of the algorithmic implementation.

That's why I asserted that the main motivation for bulk-reading  
positions in KS was simplicity, rather than performance or something  
else.

> But the good news is if this framework is plugglable, one can insert
> their own codec to not do the up-front decoding of all positions per
> term X doc.

Yes, that would work.  You would need different Scorer implementations  
to deal with codecs which read the same file format yet are  
implemented differently... but that's a feature. :)

>> We'd need both PostingBuffer and Posting subclasses.
>
> Yes.

OK, I think we're almost at the point where we can hack up prototypes  
for Lucene implementations of PostingBuffer and PostingList that read  
the current file format.  I think seeing those would clarify things.

Ironically, I'm not sure exactly how Posting fits into the equation at  
read-time anymore, but I think that will work itself out as we get  
back to indexing.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

> > > Seeking might get a little weird, I suppose.
> >
> > Maybe not?: if the container is only aware of the single InStream, and
> > say it's "indexed" with a multi-skip index, then when you ask
> > container to seek, it forwards the request to multi-skip which jumps
> > as close as it can, and then it asks the codec to skip docs until it's
> > at the requested docID.  Ie, the container can still be given a single
> > InStream, even though the codec "thinks" it's working with 3.
>
>  So, if I follow you...
>
> 1) When the requested doc number is far enough away to trigger skipping,
> segmentPostingList.skipTo() would operate on the skip stream and generate an
> opaque SkipDatum object (or something like that), which it would pass to the
> codec object.
>
> 2)The codec object would use the contents of the SkipDatum object -- an
> array of file pointers large enough to accommodate all of the InStreams,
> plus payload-type information as applicable -- to update itself into a
> consistent state just shy of or at the target doc.
>
> 3) The codec object will iterate through docs until it's at or past the
> target.
>
> Here's what's weird.  Say the codec is only operating on 1 stream but it
> thinks it's operating on 3.  Upon receiving the SkipDatum with 3 file
> pointers, it will seek the same InStream 3 times.
>
> That will only work if the file pointers are all the same -- which would be
> strange because then the second and third file pointers won't actually be
> marking the real data.
>
> Or perhaps the SkipDatum object will only contain one file pointer per
> "real" stream and the codec object will seek until it runs out of pointers
> in the stack?  i.e. It has three instreams, it receives a SkipDatum with
> only one file pointer, it seeks the first stream, then bails out and
> returns, ignoring the others.

Actually, I was picturing that the container does the seeking itself
(using skip data), to get "close" to the right point, and then it uses
the codec to step through single docs at a time until it's at or
beyond the right one.

Ie, no SkipDatum is created & passed to the codec, though likely
container should notify codec it just got skipped in case it has
internal state that cares.

Container is only aware of the single inStream, while codec can still
think its operating on 3 even if it's really 1 or 2.

> > > I think a variant of that read op could be created against various versions
> > > of the Lucene file format going back, making it possible to isolate and
> > > archive obsolete codecs and clean up the container classes.
> >
> > I like this approach.  It would allow us to decouple the codec from
> > how many (1, 2, 3) and which files are actually storing the data.
>
> The downside is that the codec object itself suddenly has to get a lot
> bigger to hold all the instreams.

But not many instances of the codec are created at once, right?  And
even so one could plug in their own (single stream) codec if need be?

> > > > When reading an index, the
> > > > Posting/PostingList should be more like TermBuffer than Term.
> > > >
> > > >
> > >
> > > Yes.
> > >
> > > Now's a good time to remark on a difference between KS and Lucene when
> > > reading the equivalent of TermPositions: In KS, all positions are read in
> > > one grab during ScorePost_Read_Record -- there's no nextPosition() method.
> > > That was a tradeoff I made primarily for simplicity's sake, since it meant
> > > that PhraseScorer could be implemented with integer arrays and pointer math.
> > > Reduced CPU overhead was another theoretical benefit, but I've never
> > > benchmarked it.
> > >
> > > If you didn't want to do that but you still wanted to implement a
> > > PhraseScorer based around PostingList objects rather than TermPositions
> > > objects, you have a bit of a quandary: PostingList only advances in
> > > doc-sized increments, because it doesn't have a nextPosition() method.  So,
> > > nextPosition() would have to be implemented by ScorePosting:
> > >
> > >  // Iterate over docs and positions.
> > >  while (postingList.next()) {
> > >   ScorePosting posting = postingList.getPosting();
> > >   while (posting.nextPostition()) {
> > >     int position = posting.GetPosition();
> > >     ...
> > >   }
> > >  }
> > >
> > > If we did that, it would require a change in the behavior for
> > > ScorePost_Next(), above.
> >
> > OK, I guess this is just the follow-through by KS on the same
> > philosophy above (precluding differentiating term-only vs
> > terms-and-positions queries).
>
> Not exactly.  While the devel branch of KS uses PostingList and the unified
> postings format, the maint branch uses a fairly close variant on the Lucene
> file format and a modified version of the TermDocs family.  The equivalent
> of TermPositions in KS maint *also* reads positions in bulk, just like KS
> devel does.  The only penalty for having a TermPositions object read
> positions in bulk like that is memory footprint (since each TermPositions
> object needs a growable array of 32-bit integers to store the bulk
> positions).

This is a tiny memory cost right?  A TermPositions instance gets
re-used each time you next() to the next term.

I think you also pay an added up-front (small) decoding cost in cases
where the consumer will only look at a subset of the positions before
next()'ing.  Eg a phrase search involving a rare term and a frequent
term.

But the good news is if this framework is plugglable, one can insert
their own codec to not do the up-front decoding of all positions per
term X doc.

> Reading positions in bulk or not is a decision that can be made
> independently of either the decision to go with a unified postings format or
> the decision to implement PostingList.  I provided the information on
> KinoSearch's behavior because I thought the code samples for PostingList I'd
> supplied begged the question, "How do you iterate through positions if
> PostingList doesn't know about them?"

OK.

> > I would think there is a non-trivial
> > performance cost for term-only queries in KS?
> >
>
> I'm not sure what the size of the effect is.  KS has its old indexing
> benchmarking app, but it doesn't have a search benchmarking app and I don't
> plan to write one.  I figure it's more profitable to finish the C porting of
> KS, write a Java binding and try to hook into the Lucene contrib
> benchmarking code than it is to either port the whole thing or write one
> from scratch.

Got it.

> The unified postings format seems likely to suffer some penalty on simple
> term queries and also likely to make up some of that ground on phrase
> queries.  We originally thought motivated users could compensate by spec'ing
> a simpler format for some fields, but now that I've actually implemented the
> unified format, I just don't see that approach as practical.
>
> So now, I've swung back to favoring the divide-by-data-type approach, but I
> want to keep the codec/container roles.
>
> Read-time isn't a problem.  We can outfit the codec object with multiple
> InStreams.  PostingList can continue to be a container which advances
> doc-at-a-time (regardless of whether we read positions in bulk a la
> KinoSearch or defer that till later a la Lucene).
>
> However, if we add all that stuff, the codec object gets a lot bigger.
> That means Posting, as I've envisioned and implemented it, is no longer
> suitable.  We'd need both PostingBuffer and Posting subclasses.

Yes.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote:
>> Seeking might get a little weird, I suppose.
>
> Maybe not?: if the container is only aware of the single InStream, and
> say it's "indexed" with a multi-skip index, then when you ask
> container to seek, it forwards the request to multi-skip which jumps
> as close as it can, and then it asks the codec to skip docs until it's
> at the requested docID.  Ie, the container can still be given a single
> InStream, even though the codec "thinks" it's working with 3.

So, if I follow you...

1) When the requested doc number is far enough away to trigger  
skipping, segmentPostingList.skipTo() would operate on the skip stream  
and generate an opaque SkipDatum object (or something like that),  
which it would pass to the codec object.

2)The codec object would use the contents of the SkipDatum object --  
an array of file pointers large enough to accommodate all of the  
InStreams, plus payload-type information as applicable -- to update  
itself into a consistent state just shy of or at the target doc.

3) The codec object will iterate through docs until it's at or past  
the target.

Here's what's weird.  Say the codec is only operating on 1 stream but  
it thinks it's operating on 3.  Upon receiving the SkipDatum with 3  
file pointers, it will seek the same InStream 3 times.

That will only work if the file pointers are all the same -- which  
would be strange because then the second and third file pointers won't  
actually be marking the real data.

Or perhaps the SkipDatum object will only contain one file pointer per  
"real" stream and the codec object will seek until it runs out of  
pointers in the stack?  i.e. It has three instreams, it receives a  
SkipDatum with only one file pointer, it seeks the first stream, then  
bails out and returns, ignoring the others.

>> I think a variant of that read op could be created against various  
>> versions
>> of the Lucene file format going back, making it possible to isolate  
>> and
>> archive obsolete codecs and clean up the container classes.
>
> I like this approach.  It would allow us to decouple the codec from
> how many (1, 2, 3) and which files are actually storing the data.

The downside is that the codec object itself suddenly has to get a lot  
bigger to hold all the instreams.

>>> When reading an index, the
>>> Posting/PostingList should be more like TermBuffer than Term.
>>>
>>
>> Yes.
>>
>> Now's a good time to remark on a difference between KS and Lucene  
>> when
>> reading the equivalent of TermPositions: In KS, all positions are  
>> read in
>> one grab during ScorePost_Read_Record -- there's no nextPosition()  
>> method.
>> That was a tradeoff I made primarily for simplicity's sake, since  
>> it meant
>> that PhraseScorer could be implemented with integer arrays and  
>> pointer math.
>> Reduced CPU overhead was another theoretical benefit, but I've never
>> benchmarked it.
>>
>> If you didn't want to do that but you still wanted to implement a
>> PhraseScorer based around PostingList objects rather than  
>> TermPositions
>> objects, you have a bit of a quandary: PostingList only advances in
>> doc-sized increments, because it doesn't have a nextPosition()  
>> method.  So,
>> nextPosition() would have to be implemented by ScorePosting:
>>
>>  // Iterate over docs and positions.
>>  while (postingList.next()) {
>>    ScorePosting posting = postingList.getPosting();
>>    while (posting.nextPostition()) {
>>      int position = posting.GetPosition();
>>      ...
>>    }
>>  }
>>
>> If we did that, it would require a change in the behavior for
>> ScorePost_Next(), above.
>
> OK, I guess this is just the follow-through by KS on the same
> philosophy above (precluding differentiating term-only vs
> terms-and-positions queries).

Not exactly.  While the devel branch of KS uses PostingList and the  
unified postings format, the maint branch uses a fairly close variant  
on the Lucene file format and a modified version of the TermDocs  
family.  The equivalent of TermPositions in KS maint *also* reads  
positions in bulk, just like KS devel does.  The only penalty for  
having a TermPositions object read positions in bulk like that is  
memory footprint (since each TermPositions object needs a growable  
array of 32-bit integers to store the bulk positions).

Reading positions in bulk or not is a decision that can be made  
independently of either the decision to go with a unified postings  
format or the decision to implement PostingList.  I provided the  
information on KinoSearch's behavior because I thought the code  
samples for PostingList I'd supplied begged the question, "How do you  
iterate through positions if PostingList doesn't know about them?"

> I would think there is a non-trivial
> performance cost for term-only queries in KS?

I'm not sure what the size of the effect is.  KS has its old indexing  
benchmarking app, but it doesn't have a search benchmarking app and I  
don't plan to write one.  I figure it's more profitable to finish the  
C porting of KS, write a Java binding and try to hook into the Lucene  
contrib benchmarking code than it is to either port the whole thing or  
write one from scratch.

The unified postings format seems likely to suffer some penalty on  
simple term queries and also likely to make up some of that ground on  
phrase queries.  We originally thought motivated users could  
compensate by spec'ing a simpler format for some fields, but now that  
I've actually implemented the unified format, I just don't see that  
approach as practical.

So now, I've swung back to favoring the divide-by-data-type approach,  
but I want to keep the codec/container roles.

Read-time isn't a problem.  We can outfit the codec object with  
multiple InStreams.  PostingList can continue to be a container which  
advances doc-at-a-time (regardless of whether we read positions in  
bulk a la KinoSearch or defer that till later a la Lucene).

However, if we add all that stuff, the codec object gets a lot  
bigger.  That means Posting, as I've envisioned and implemented it, is  
no longer suitable.  We'd need both PostingBuffer and Posting  
subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:
>
>  On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:
>
>
> > If I have a pluggable indexer,
> > then on the querying side I need something (I'm not sure what/how)
> > that knows how to create the right demuxer (container) and codec
> > (decoder) to interact with whatever my indexing plugins wrote.
> >
> > So I don't think it's "out of bounds" to have *Query classes that know
> > to use the non-prx variant of a field when positions aren't needed.
> > Though I agree it makes things more complex.
> >
>
> Perhaps switching up files based on Query type isn't "out of bounds", but
> you get hit with a double whammy. First, there's the added complexity --
> which gets extra nasty when you consider that we wouldn't want this
> optimization to be on by default.  Second, you wind up with duplicated
> information in the index -- resulting in greater space requirements and
> increased indexing time.
>
> That's an awful lot of inconvenience for the sake of an optimization. I
> wouldn't bother.
>
> My initial thought was that the unified postings file format that KS is
> using now offers so many benefits that KS should go with it anyway, despite
> the increased costs for simple TermQueries.   But maybe we can nail the
> design for the divide-by-data-type format right now.

I think it depends on the performance difference.  At least, we should
not preclude this approach in our design, even if we don't make it
easy to do "out of the box".

> > > > This is like "column stride" vs "row stride" serialization
> > > > of a matrix.
> > > >
> > > > Relatively soon, though, we will all be on SSDs, so maybe this
> > > > locality argument becomes far less important ;)
> > >
> > > Yes, I've thought about that.  It defeats the phrase-query locality
> > > argument for unified postings files and recommends breaking things up
> > > logically by type of data into frq/prx/payload/whatever.
> > >
> > > Would it be possible to design a Posting plugin class that reads from
> > > multiple files?  I'm pretty sure the answer is yes.  It messes up the
> > > single-stream readRecord() method I've been detailing and might force
> > > Posting to maintain state.  But if Postings are scarce TermBuffer-style
> > > objects where values mutate, rather than populous Term-style objects where
> > > you need a new instance for each set of values, then it doesn't matter if
> > > they're large.
> > >
> > > If that could be done, I think it would be possible to retrofit the
> > > Posting/PostingList concept into Lucene without a file format change.  FWIW.
> >
> > I think this is possible.
>
> We should aggressively explore this concept.  By my tally, breaking up
> posting content into multiple files by data type has the most benefits and
> the fewest drawbacks.
>
>  Brainstorming...
>
> In devel branch KS, ScorePost_Read_Record reads three blocks of data from
> an InStream:
>
>   doc_num/freq
>   boost/norm byte
>   positions
>
> Let's say that instead of ScorePost_Read_Record operating on a passed-in
> InStream, we give ScorePosting a ScorePost_Next() method, and have it
> operate on three internally maintained InStreams:
>
>     void
>     ScorePost_next(ScorePosting *self)
>     {
>         u32_t  num_prox;
>         u32_t  position = 0;
>         u32_t *positions;
>         const u32_t doc_code = InStream_Read_C32(self->frq_in);
>         const u32_t doc_delta = doc_code >> 1;
>
>         /* Apply delta doc and retrieve freq. */
>         self->doc_num  += doc_delta;
>         if (doc_code & 1)
>             self->freq = 1;
>         else
>             self->freq = InStream_Read_C32(self->frq_in);
>
>         /* Decode boost/norm byte. */
>         self->impact = self->sim->norm_decoder[
>                 InStream_Read_U8(self->norms_in)
>             ];
>
>         /* Read positions. */
>         num_prox = self->freq;
>         if (num_prox > self->prox_cap) {
>             self->prox = REALLOCATE(self->prox, num_prox, u32_t);
>         }
>         positions = self->prox;
>         while (num_prox--) {
>             position += InStream_Read_C32(self->prx_in);
>             *positions++ = position;
>         }
>     }
>
> So, there we have a single operation, in a single class, defining a codec
> -- achieving my main goal, even while reading from multiple files.
>
> But check this out: we could feed the ScorePosting constructor the same
> InStream object three times, and it wouldn't know the difference.  So the
> same read operation can be used against both a multi-file format and a
> single file format.
>
>  Seeking might get a little weird, I suppose.

Maybe not?: if the container is only aware of the single InStream, and
say it's "indexed" with a multi-skip index, then when you ask
container to seek, it forwards the request to multi-skip which jumps
as close as it can, and then it asks the codec to skip docs until it's
at the requested docID.  Ie, the container can still be given a single
InStream, even though the codec "thinks" it's working with 3.

> I think a variant of that read op could be created against various versions
> of the Lucene file format going back, making it possible to isolate and
> archive obsolete codecs and clean up the container classes.

I like this approach.  It would allow us to decouple the codec from
how many (1, 2, 3) and which files are actually storing the data.

> > When reading an index, the
> > Posting/PostingList should be more like TermBuffer than Term.
> >
>
>  Yes.
>
> Now's a good time to remark on a difference between KS and Lucene when
> reading the equivalent of TermPositions: In KS, all positions are read in
> one grab during ScorePost_Read_Record -- there's no nextPosition() method.
> That was a tradeoff I made primarily for simplicity's sake, since it meant
> that PhraseScorer could be implemented with integer arrays and pointer math.
> Reduced CPU overhead was another theoretical benefit, but I've never
> benchmarked it.
>
> If you didn't want to do that but you still wanted to implement a
> PhraseScorer based around PostingList objects rather than TermPositions
> objects, you have a bit of a quandary: PostingList only advances in
> doc-sized increments, because it doesn't have a nextPosition() method.  So,
> nextPosition() would have to be implemented by ScorePosting:
>
>   // Iterate over docs and positions.
>   while (postingList.next()) {
>     ScorePosting posting = postingList.getPosting();
>     while (posting.nextPostition()) {
>       int position = posting.GetPosition();
>       ...
>     }
>   }
>
> If we did that, it would require a change in the behavior for
> ScorePost_Next(), above.

OK, I guess this is just the follow-through by KS on the same
philosophy above (precluding differentiating term-only vs
terms-and-positions queries).  I would think there is a non-trivial
performance cost for term-only queries in KS?

> > Thinking about a codec reading multiple files... I would also love
> > this: a codec that can write & read layered updates to the inverted
> > files.
>
> IMO, it should not be the codec object that performs actions like this --
> it should be a container class that the user knows nothing about.

Maybe this is something in between codec & container, because it'd
have to "understand" that streams/codecs it's managing iterate
through docIDs, favoring newer streams that have the same docID,
having older streams skip over their byte blob for a given docID if
they've been obsoleted, etc.

Still, I agree it's more like a container than a codec.  Clearly you'd
want to re-use this ability across multiple kinds of files
(non-sparse, fixed-size like norms; sparse, variable-size like the
int[] example below).

Actually, even the multi-level skipping container (used below in the
sparse, variable-size int[] example below) needs to be aware of
docIDs.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:

> If I have a pluggable indexer,
> then on the querying side I need something (I'm not sure what/how)
> that knows how to create the right demuxer (container) and codec
> (decoder) to interact with whatever my indexing plugins wrote.
>
> So I don't think it's "out of bounds" to have *Query classes that know
> to use the non-prx variant of a field when positions aren't needed.
> Though I agree it makes things more complex.

Perhaps switching up files based on Query type isn't "out of bounds",  
but you get hit with a double whammy. First, there's the added  
complexity -- which gets extra nasty when you consider that we  
wouldn't want this optimization to be on by default.  Second, you wind  
up with duplicated information in the index -- resulting in greater  
space requirements and increased indexing time.

That's an awful lot of inconvenience for the sake of an optimization.  
I wouldn't bother.

My initial thought was that the unified postings file format that KS  
is using now offers so many benefits that KS should go with it anyway,  
despite the increased costs for simple TermQueries.   But maybe we can  
nail the design for the divide-by-data-type format right now.

>>> This is like "column stride" vs "row stride" serialization
>>> of a matrix.
>>>
>>> Relatively soon, though, we will all be on SSDs, so maybe this
>>> locality argument becomes far less important ;)
>>>
>>
>> Yes, I've thought about that.  It defeats the phrase-query locality
>> argument for unified postings files and recommends breaking things up
>> logically by type of data into frq/prx/payload/whatever.
>>
>> Would it be possible to design a Posting plugin class that reads from
>> multiple files?  I'm pretty sure the answer is yes.  It messes up the
>> single-stream readRecord() method I've been detailing and might force
>> Posting to maintain state.  But if Postings are scarce TermBuffer- 
>> style
>> objects where values mutate, rather than populous Term-style  
>> objects where
>> you need a new instance for each set of values, then it doesn't  
>> matter if
>> they're large.
>>
>> If that could be done, I think it would be possible to retrofit the
>> Posting/PostingList concept into Lucene without a file format  
>> change.  FWIW.
>
> I think this is possible.

We should aggressively explore this concept.  By my tally, breaking up  
posting content into multiple files by data type has the most benefits  
and the fewest drawbacks.

Brainstorming...

In devel branch KS, ScorePost_Read_Record reads three blocks of data  
from an InStream:

   doc_num/freq
   boost/norm byte
   positions

Let's say that instead of ScorePost_Read_Record operating on a passed- 
in InStream, we give ScorePosting a ScorePost_Next() method, and have  
it operate on three internally maintained InStreams:

     void
     ScorePost_next(ScorePosting *self)
     {
         u32_t  num_prox;
         u32_t  position = 0;
         u32_t *positions;
         const u32_t doc_code = InStream_Read_C32(self->frq_in);
         const u32_t doc_delta = doc_code >> 1;

         /* Apply delta doc and retrieve freq. */
         self->doc_num  += doc_delta;
         if (doc_code & 1)
             self->freq = 1;
         else
             self->freq = InStream_Read_C32(self->frq_in);

         /* Decode boost/norm byte. */
         self->impact = self->sim->norm_decoder[
                 InStream_Read_U8(self->norms_in)
             ];

         /* Read positions. */
         num_prox = self->freq;
         if (num_prox > self->prox_cap) {
             self->prox = REALLOCATE(self->prox, num_prox, u32_t);
         }
         positions = self->prox;
         while (num_prox--) {
             position += InStream_Read_C32(self->prx_in);
             *positions++ = position;
         }
     }

So, there we have a single operation, in a single class, defining a  
codec -- achieving my main goal, even while reading from multiple files.

But check this out: we could feed the ScorePosting constructor the  
same InStream object three times, and it wouldn't know the  
difference.  So the same read operation can be used against both a  
multi-file format and a single file format.

Seeking might get a little weird, I suppose.

I think a variant of that read op could be created against various  
versions of the Lucene file format going back, making it possible to  
isolate and archive obsolete codecs and clean up the container classes.

> When reading an index, the
> Posting/PostingList should be more like TermBuffer than Term.

Yes.

Now's a good time to remark on a difference between KS and Lucene when  
reading the equivalent of TermPositions: In KS, all positions are read  
in one grab during ScorePost_Read_Record -- there's no nextPosition()  
method.  That was a tradeoff I made primarily for simplicity's sake,  
since it meant that PhraseScorer could be implemented with integer  
arrays and pointer math.  Reduced CPU overhead was another theoretical  
benefit, but I've never benchmarked it.

If you didn't want to do that but you still wanted to implement a  
PhraseScorer based around PostingList objects rather than  
TermPositions objects, you have a bit of a quandary: PostingList only  
advances in doc-sized increments, because it doesn't have a  
nextPosition() method.  So, nextPosition() would have to be  
implemented by ScorePosting:

   // Iterate over docs and positions.
   while (postingList.next()) {
     ScorePosting posting = postingList.getPosting();
     while (posting.nextPostition()) {
       int position = posting.GetPosition();
       ...
     }
   }

If we did that, it would require a change in the behavior for  
ScorePost_Next(), above.

> Thinking about a codec reading multiple files... I would also love
> this: a codec that can write & read layered updates to the inverted
> files.

IMO, it should not be the codec object that performs actions like this  
-- it should be a container class that the user knows nothing about.

> I agree, Lucene needs a stronger separation of "container" from
> "codec".  If someone just wants to plugin a new codec they should be
> able to cleanly plug into an existing container and "only" provide the
> codec.
>
> EG, say I want to store say an arbitrary array of ints per document.
> Not all documents have an array, and when they do, the array length
> varies.
>
> To do this, I'd like to re-use a container that knows how to store a
> byte blob per document, sparsely, and say indexed with a multi-level
> skip list.  That container is exactly the container we now use for
> storing frq/prx data, under each term.  So, ideally, I can easily
> re-use that container and just provide a codec (encoder & decoder)
> that maps from an int[] to a byte blob, and back.
>
> We need to factor things so that this container can be easily shared
> and is entirely decoupled from the codecs it's using.

Perfect.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

>  On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote:
>
>
> > I think the major difference is locality?  In a compound file, you
> > have to seek "far away" to reach the prx & skip data (if they are
> > separate).
>
> There's another item worth mentioning, something that Doug, Grant and I
> discussed when this flexible indexing talk started way back when.  When you
> unify frq/prx data into a single file, phrase queries and the like benefit
> from improved locality, but simple term queries are impeded because needless
> positional data must be plowed through.

Good point.

> We dismissed that cost with the assertion that you could specify a
> match-only field for simple queries if that was important to you, but IME
> that doesn't seem to be very practical.  It's hard for the internals to know
> that they should prefer one field over another based on the type of query,
> and hard to manually override everywhere.

The thing is, we have to build this out anyway, right?  Ie, *Query
must know something about the index.  If I have a pluggable indexer,
then on the querying side I need something (I'm not sure what/how)
that knows how to create the right demuxer (container) and codec
(decoder) to interact with whatever my indexing plugins wrote.

So I don't think it's "out of bounds" to have *Query classes that know
to use the non-prx variant of a field when positions aren't needed.
Though I agree it makes things more complex.

> > This is like "column stride" vs "row stride" serialization
> > of a matrix.
> >
> > Relatively soon, though, we will all be on SSDs, so maybe this
> > locality argument becomes far less important ;)
> >
>
> Yes, I've thought about that.  It defeats the phrase-query locality
> argument for unified postings files and recommends breaking things up
> logically by type of data into frq/prx/payload/whatever.
>
> Would it be possible to design a Posting plugin class that reads from
> multiple files?  I'm pretty sure the answer is yes.  It messes up the
> single-stream readRecord() method I've been detailing and might force
> Posting to maintain state.  But if Postings are scarce TermBuffer-style
> objects where values mutate, rather than populous Term-style objects where
> you need a new instance for each set of values, then it doesn't matter if
> they're large.
>
> If that could be done, I think it would be possible to retrofit the
> Posting/PostingList concept into Lucene without a file format change.  FWIW.

I think this is possible.  When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.

Thinking about a codec reading multiple files... I would also love
this: a codec that can write & read layered updates to the inverted
files.  EG say I want to update field X of a bunch of documents.  Why
not do this, but write an "incremental" update (with a generation
number so we remain write-once) to the index files.

This way I have a massive original _x.frq file, and then a smallish
_x_1.frq update.

Then when reading, I dynamically choose _x_1.frq when it has a given
docID, else I fallback to _x.frq.  We could write many such updates to
a segment over time.  A partial optimize could coalesce all of them
and write a new single .frq (to the next generation).

We could do something similar with per-document stores.  EG why
rewrite norms if only a few changed?  We could instead store a
sparsely encoded _x_1.nrm "update".  Likewise for stored fields, term
vectors.

This would finally give us incremental updates of just reindexing one
field of a document.

> > I would think running out of file descriptors is common problem otherwise.
> >
>
> The default per-process limit for file descriptors on OS X is 256.  Under
> the Lucene non-compound file format, you're guaranteed to run out of file
> descriptors eventually under normal usage.  If KS allowed a non-compound
> format, you'd also be guaranteed to run out of file descriptors, just
> sooner.  Since not failing at all is the only acceptable outcome, there's
> not much practical difference.

It's a performance/risk tradeoff.  Though I wonder what the practical
performance difference is at search time between CFS and non-CFS for
Lucene.

> I think there's more to be gained from tweaking out the VFS than in
> accommodating a non-compound format.  Saddling users with file descriptor
> constraint worries and having to invoke ulimit all the time sucks.

I think making it an option is appropriate, if the performance
difference is there, so long as the default is still CFS.  Simple
things should be simple; complex things should be possible ;)

> > > My conclusion was that it was better to exploit the benefits of bounded,
> > > single-purpose streams and simple file formats whenever possible.
> > >
> > > There's also a middle way, where each *format* gets its own file.  Then you
> > > wind up with fewer files, but you have to track field number state.
> > >
> > > The nice thing is that packet-scoped plugins can be compatible with ALL of
> > > these configurations:
> > >
> >
> > Right.  This way users can pick & choose how to put things in the
> > index (with "healthy" defaults, of course).
> >
>
>
> Well, IMO, we don't want the users to have to care about the container
> classes.
>
> Under the TermDocs/TermPositions model, every time you add new data, you
> need to subclass the containers.  Under the PostingList model, you don't --
> Posting plugs in.
>
> For KS at least, the primary goal is to make Posting public and as easy to
> subclass as possible -- because a public Posting plugin class seems to me to
> be the easiest way to add custom flexible indexing features like like text
> payloads, or arbitrary integer values used by custom function queries, or
> other schemes not yet considered.

I agree, Lucene needs a stronger separation of "container" from
"codec".  If someone just wants to plugin a new codec they should be
able to cleanly plug into an existing container and "only" provide the
codec.

EG, say I want to store say an arbitrary array of ints per document.
Not all documents have an array, and when they do, the array length
varies.

To do this, I'd like to re-use a container that knows how to store a
byte blob per document, sparsely, and say indexed with a multi-level
skip list.  That container is exactly the container we now use for
storing frq/prx data, under each term.  So, ideally, I can easily
re-use that container and just provide a codec (encoder & decoder)
that maps from an int[] to a byte blob, and back.

We need to factor things so that this container can be easily shared
and is entirely decoupled from the codecs it's using.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote:

> I think the major difference is locality?  In a compound file, you
> have to seek "far away" to reach the prx & skip data (if they are
> separate).

There's another item worth mentioning, something that Doug, Grant and  
I discussed when this flexible indexing talk started way back when.   
When you unify frq/prx data into a single file, phrase queries and the  
like benefit from improved locality, but simple term queries are  
impeded because needless positional data must be plowed through.

We dismissed that cost with the assertion that you could specify a  
match-only field for simple queries if that was important to you, but  
IME that doesn't seem to be very practical.  It's hard for the  
internals to know that they should prefer one field over another based  
on the type of query, and hard to manually override everywhere.

> This is like "column stride" vs "row stride" serialization
> of a matrix.
>
> Relatively soon, though, we will all be on SSDs, so maybe this
> locality argument becomes far less important ;)

Yes, I've thought about that.  It defeats the phrase-query locality  
argument for unified postings files and recommends breaking things up  
logically by type of data into frq/prx/payload/whatever.

Would it be possible to design a Posting plugin class that reads from  
multiple files?  I'm pretty sure the answer is yes.  It messes up the  
single-stream readRecord() method I've been detailing and might force  
Posting to maintain state.  But if Postings are scarce TermBuffer- 
style objects where values mutate, rather than populous Term-style  
objects where you need a new instance for each set of values, then it  
doesn't matter if they're large.

If that could be done, I think it would be possible to retrofit the  
Posting/PostingList concept into Lucene without a file format change.   
FWIW.

> Does KS allow non-compound format?

No, it doesn't.

> I would think running out of file descriptors is common problem  
> otherwise.

The default per-process limit for file descriptors on OS X is 256.   
Under the Lucene non-compound file format, you're guaranteed to run  
out of file descriptors eventually under normal usage.  If KS allowed  
a non-compound format, you'd also be guaranteed to run out of file  
descriptors, just sooner.  Since not failing at all is the only  
acceptable outcome, there's not much practical difference.

I think there's more to be gained from tweaking out the VFS than in  
accommodating a non-compound format.  Saddling users with file  
descriptor constraint worries and having to invoke ulimit all the time  
sucks.

>> My conclusion was that it was better to exploit the benefits of  
>> bounded,
>> single-purpose streams and simple file formats whenever possible.
>>
>> There's also a middle way, where each *format* gets its own file.   
>> Then you
>> wind up with fewer files, but you have to track field number state.
>>
>> The nice thing is that packet-scoped plugins can be compatible with  
>> ALL of
>> these configurations:
>
> Right.  This way users can pick & choose how to put things in the
> index (with "healthy" defaults, of course).

Well, IMO, we don't want the users to have to care about the container  
classes.

Under the TermDocs/TermPositions model, every time you add new data,  
you need to subclass the containers.  Under the PostingList model, you  
don't -- Posting plugs in.

For KS at least, the primary goal is to make Posting public and as  
easy to subclass as possible -- because a public Posting plugin class  
seems to me to be the easiest way to add custom flexible indexing  
features like like text payloads, or arbitrary integer values used by  
custom function queries, or other schemes not yet considered.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:
>
>  On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:
>
>
> > Can't you compartmentalize while still serializing skip data into the
> > single frq/prx file?
> >
>
> Yes, that's possible.
>
> The way KS is set up right now, PostingList objects maintain i/o state, and
> Posting's Read_Record() method just deals with whatever instream gets passed
> to it.  If the PostingList were to sneak in the reading of a skip packet,
> the Posting would be none the wiser.

Got it.

> > This as analagous to how videos are encoded.  EG the AVI file format
> > is a "container" format, and in contains packets of video and packets
> > of audio, interleaved at the right rate so a player can play both in
> > sync.  The "container" has no idea how to decode the audio and video
> > packets.  Separate codecs do that.
> >
> > Taking this back to Lucene, there's a container format that, using
> > TermInfo, knows where the frq/prx data (packet) is and where the skip
> > data (packet) is.  And it calls on separate decoders to decode each.
> >
>
> This is an intriguing proposal.  :)
>
> The dev branch of KS currently uses oodles of per-segment files for the
> lexicon and the postings:
>
>   * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]
>   * One lexicon file per field per segment.       [SEGNAME-FIELDNUM.lex]
>   * One lexicon index file per field per segment. [SEGNAME-FIELDNUM.lexx]
>
> Having so many files is something of a drawback, but it means that each
> individual file can be very specialized, and that yields numerous benefits:
>
>   * Each file has a simple format.
>   * File Format spec easier to write and understand.
>   * Formats are pluggable.
>       o Easy to deprecate.
>       o Easy to isolate within a single class.
>   * PostingList objects are always single-field.
>       o Simplified internals.
>           * No field numbers to track.
>           * Repeat one read operation to scan the whole file.
>       o Pluggable using subclasses of Posting.
>       o Fewer subclasses (e.g. SegmentTermPositions is not needed).
>   * Lexicon objects are always single-field.
>       o Simplified internals.
>           * No field numbers to track.
>           * Repeat one read operation to scan the whole file.
>       o Possible to extend with custom per-field sorting at index-time.
>       o Easier to extend to non-text terms.
>           * Comparisons ops guaranteed to see like objects.
>   * Stream-related errors are comparatively easy to track down.
>
> Some of these benefits are preserved when reading from a single stream.
> However, there are some downsides:
>
>   * Container classes like PostingList more complex.
>       o No longer single-field.
>       o Harder to detect overruns that would have been EOF errors.
>       o Easier to lose stream sync.
>       o Periodic sampling for index records more complex.
>           * Tricky to prevent inappropriate compareTo ops at boundaries.
>   * Harder to troubleshoot.
>       o Glitch in one plugin can manifest as an error somewhere else.
>       o Hexdump nearly impossible to interpret.
>       o Mentally taxing to follow like packets in an interleaved stream.
>   * File corruption harder to recover from.
>       o Only as reliable as the weakest plugin.
>
>  Benefits of the single stream include:
>
>   * Fewer hard disk seeks.
>   * Far fewer files.
>
> If you're using Lucene's non-compound file format, having far fewer files
> could be a significant benefit depending on the OS.  But here's the thing:
>
> If you're using a custom virtual file system a la Lucene's compound files,
> what's the difference between divvying up data using filenames within the
> CompoundFileReader object, and divvying up data downstream in some other
> object using some ad hoc mechanism?

I think the major difference is locality?  In a compound file, you
have to seek "far away" to reach the prx & skip data (if they are
separate).  This is like "column stride" vs "row stride" serialization
of a matrix.

Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)

Does KS allow non-compound format?  I would think running out of
file descriptors is common problem otherwise.  Though, I think your
fibonacci merge policy is more "aggressive" than Lucene's
LogMergePolicy (ie, fewer segments for the same # docs).

> My conclusion was that it was better to exploit the benefits of bounded,
> single-purpose streams and simple file formats whenever possible.
>
> There's also a middle way, where each *format* gets its own file.  Then you
> wind up with fewer files, but you have to track field number state.
>
> The nice thing is that packet-scoped plugins can be compatible with ALL of
> these configurations:

Right.  This way users can pick & choose how to put things in the
index (with "healthy" defaults, of course).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:

> Can't you compartmentalize while still serializing skip data into the
> single frq/prx file?

Yes, that's possible.

The way KS is set up right now, PostingList objects maintain i/o  
state, and Posting's Read_Record() method just deals with whatever  
instream gets passed to it.  If the PostingList were to sneak in the  
reading of a skip packet, the Posting would be none the wiser.

> This as analagous to how videos are encoded.  EG the AVI file format
> is a "container" format, and in contains packets of video and packets
> of audio, interleaved at the right rate so a player can play both in
> sync.  The "container" has no idea how to decode the audio and video
> packets.  Separate codecs do that.
>
> Taking this back to Lucene, there's a container format that, using
> TermInfo, knows where the frq/prx data (packet) is and where the skip
> data (packet) is.  And it calls on separate decoders to decode each.

This is an intriguing proposal.  :)

The dev branch of KS currently uses oodles of per-segment files for  
the lexicon and the postings:

   * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]
   * One lexicon file per field per segment.       [SEGNAME- 
FIELDNUM.lex]
   * One lexicon index file per field per segment. [SEGNAME- 
FIELDNUM.lexx]

Having so many files is something of a drawback, but it means that  
each individual file can be very specialized, and that yields numerous  
benefits:

   * Each file has a simple format.
   * File Format spec easier to write and understand.
   * Formats are pluggable.
       o Easy to deprecate.
       o Easy to isolate within a single class.
   * PostingList objects are always single-field.
       o Simplified internals.
           * No field numbers to track.
           * Repeat one read operation to scan the whole file.
       o Pluggable using subclasses of Posting.
       o Fewer subclasses (e.g. SegmentTermPositions is not needed).
   * Lexicon objects are always single-field.
       o Simplified internals.
           * No field numbers to track.
           * Repeat one read operation to scan the whole file.
       o Possible to extend with custom per-field sorting at index-time.
       o Easier to extend to non-text terms.
           * Comparisons ops guaranteed to see like objects.
   * Stream-related errors are comparatively easy to track down.

Some of these benefits are preserved when reading from a single  
stream.  However, there are some downsides:

   * Container classes like PostingList more complex.
       o No longer single-field.
       o Harder to detect overruns that would have been EOF errors.
       o Easier to lose stream sync.
       o Periodic sampling for index records more complex.
           * Tricky to prevent inappropriate compareTo ops at  
boundaries.
   * Harder to troubleshoot.
       o Glitch in one plugin can manifest as an error somewhere else.
       o Hexdump nearly impossible to interpret.
       o Mentally taxing to follow like packets in an interleaved  
stream.
   * File corruption harder to recover from.
       o Only as reliable as the weakest plugin.

Benefits of the single stream include:

   * Fewer hard disk seeks.
   * Far fewer files.

If you're using Lucene's non-compound file format, having far fewer  
files could be a significant benefit depending on the OS.  But here's  
the thing:

If you're using a custom virtual file system a la Lucene's compound  
files, what's the difference between divvying up data using filenames  
within the CompoundFileReader object, and divvying up data downstream  
in some other object using some ad hoc mechanism?

My conclusion was that it was better to exploit the benefits of  
bounded, single-purpose streams and simple file formats whenever  
possible.

There's also a middle way, where each *format* gets its own file.   
Then you wind up with fewer files, but you have to track field number  
state.

The nice thing is that packet-scoped plugins can be compatible with  
ALL of these configurations:

> This way we can decouple the question of "how many files do I store my
> things in" from "how is each thing encoded/decoded".  Maybe I want
> frq/prx/skip all in one file, or maybe I want them in 3 different  
> files.
>>

Well said.

>> The second problem is how to share a term dictionary over a  
>> cluster.  It
>> would be nice to be able to plug modules into IndexReader that  
>> represent
>> clusters of machines but that are dedicated to specific tasks: one  
>> cluster
>> could be dedicated to fetching full documents and applying  
>> highlighting;
>> another cluster could be dedicated to scanning through postings and
>> finding/scoring hits; a third cluster could store the entire term  
>> dictionary
>> in RAM.
>>
>> A centralized term dictionary held in RAM would be particularly  
>> handy for
>> sorting purposes.  The problem is that the file pointers of a term
>> dictionary are specific to indexes on individual machines.  A shared
>> dictionary in RAM would have to contain pointers for *all* clients,  
>> which
>> isn't really workable.
>>
>> So, just how do you go about assembling task specific clusters?   
>> The stored
>> documents cluster is easy, but the term dictionary and the postings  
>> are
>> hard.
>
> Phew!  This is way beyond what I'm trying to solve now :)

Hmm.  It doesn't look that difficult from my perspective.  The problem  
seems reasonably well isolated and contained.  But I've worked hard to  
make KS modular, so perhaps there's less distance left to travel.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing design

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:
> On Apr 9, 2008, at 6:35 AM, Michael Busch wrote:
>
>
> > We also need to come up with a good solution for the dictionary, because a
> term with frq/prx postings needs to store two (or three for skiplist) file
> pointers in the dictionary, whereas e. g. a "binary" posting list only needs
> one pointer.
> >
>
>  This is something I'm working on as well, and I hope we can solve a couple
> of design problems I've been turning over in my mind for some time.
>
>  In KS, the information Lucene stores in the frq/prx files is carried in one
> postings file per field, as discussed previously.  However, I made the
> additional change of breaking out skip data into a separate file (shared
> across all fields).  Isolating skip data sacrifices some locality of
> reference, but buys substantial gains in simplicity and
> compartmentalization.  Individual Posting subclasses, each of which defines
> a file format, don't have to know about skip algorithms at all.  :)
> Further, improvements in the skip algorithm only require changes to the
> .skip file, and falling back to PostingList_Next still works if the .skip
> file becomes corrupted since .skip carries only optimization info and no
> real data.

Can't you compartmentalize while still serializing skip data into the
single frq/prx file?

This as analagous to how videos are encoded.  EG the AVI file format
is a "container" format, and in contains packets of video and packets
of audio, interleaved at the right rate so a player can play both in
sync.  The "container" has no idea how to decode the audio and video
packets.  Separate codecs do that.

Taking this back to Lucene, there's a container format that, using
TermInfo, knows where the frq/prx data (packet) is and where the skip
data (packet) is.  And it calls on separate decoders to decode each.

This way we can decouple the question of "how many files do I store my
things in" from "how is each thing encoded/decoded".  Maybe I want
frq/prx/skip all in one file, or maybe I want them in 3 different files.

>  For reasons I won't go into here, KS doesn't need to put a field number in
> it's TermInfo, but it does need doc freq, plus file positions for the
> postings file, the skip file, and the primary Lexicon file.  (Lexicon is the
> KS term dictionary class, akin to Lucene's TermEnum.)
>
>   struct kino_TermInfo {
>       kino_VirtualTable* _;
>       kino_ref_t ref;
>       chy_i32_t doc_freq;
>       chy_u64_t post_filepos;
>       chy_u64_t skip_filepos;
>       chy_u64_t lex_filepos;
>   };
>
>  There are two problems.
>
>  First is that I'd like to extend indexing with arbitrary subclasses of
> SegDataWriter, and I'd like these classes to be able to put their own file
> position bookmarks (or possibly other data) into TermInfo.  Making TermInfo
> hash-based would probably do it, but there would be nasty performance and
> memory penalties since TermInfo objects are numerous.
>
>  So, what's the best way to allow multiple, unrelated classes to extend
> TermInfo and the term dictionary file format?  Is it to break up TermInfo
> information horizontally rather than vertically, so that instead of a single
> array of TermInfo objects, we have a flexible stack of arrays of 64-bit
> integers representing file positions?

I think for starters TermInfo must encode N absolute offsets into
separate files and/or relative offsets into the same file or some
combination.

If we want to go even further, such that separate plugins can insert
arbitrary fields into the TermInfo, I'm not sure how best to do
that... memory cost is important for TermInfo since there are so many
of these created.

>  The second problem is how to share a term dictionary over a cluster.  It
> would be nice to be able to plug modules into IndexReader that represent
> clusters of machines but that are dedicated to specific tasks: one cluster
> could be dedicated to fetching full documents and applying highlighting;
> another cluster could be dedicated to scanning through postings and
> finding/scoring hits; a third cluster could store the entire term dictionary
> in RAM.
>
>  A centralized term dictionary held in RAM would be particularly handy for
> sorting purposes.  The problem is that the file pointers of a term
> dictionary are specific to indexes on individual machines.  A shared
> dictionary in RAM would have to contain pointers for *all* clients, which
> isn't really workable.
>
>  So, just how do you go about assembling task specific clusters?  The stored
> documents cluster is easy, but the term dictionary and the postings are
> hard.

Phew!  This is way beyond what I'm trying to solve now :)

> > For example, we should think about the Field APIs. Since we don't have
> global field semantics in Lucene I wonder how to handle conflict cases, e.
> g. when a document specifies a different posting list format than a previous
> one for the same field. The easiest way would be to not allow it and throw
> an exception. But this is kind of against Lucene's way of dealing with
> fields currently. But I'm scared of the complicated code to handle conflicts
> of all the possible combinations of posting list formats.
> >
>
>  Yeah. Lucene's field definition conflict-resolution code is gnarly already.
> :(

Yes.  But if each plugin handles its own merging I think this is
manageable.

> > KinoSearch doesn't have to worry about this, because it has a static
> schema (I think?), but isn't as flexible as Lucene.
> >
>
>  Earlier versions of KS did not allow the addition of new fields on the fly,
> but this has been changed.  You can now add fields to an existing Schema
> object like so:
>
>     for my $doc (@docs) {
>         # Dynamically define any new fields as 'text'.
>         for my $field ( keys %$doc ) {
>             $schema->add_field( $field => 'text' );
>         }
>         $invindexer->add_doc($doc);
>     }
>
>  See the attached sample app for that snippet in context.
>
>  Here are some current differences between KS and Lucene:
>
>   * KS doesn't yet purge *old* dynamic field definitions which have
>     become obsolete.  However, that should be possible to add later,
>     as a sweep triggered during full optimization.
>   * You can't change the definition of an existing field.
>   * Documents are hash-based, so you can't have multiple fields with
>     the same name within one document object.  However, I consider
>     that capability a misfeature of Lucene.
>
>  In summary, I don't think that global field semantics meaningfully restrict
> flexibility for the vast majority of users.
>
>  The primary distinction is/was philosophical.   IIRC, Doug didn't want to
> force people to think about index design in advance, so the Field/Document
> API was optimized for newbies.  In contrast, KS wants you to give it a
> Schema before indexing commences.
>
>  It's still true that full-power KS forces you to think about index design
> up-front.  However, there's now a KinoSearch::Simple API targeted at newbies
> which hides the Schema API and handles field definition automatically -- so
> Doug's ease-of-use design goal has been achieved via different means.

Fixed schema certainly simplifies many things, but I don't think this
is something we can change about Lucene at this point.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org