You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2007/10/20 00:53:47 UTC

Per-document Payloads (was: Re: lucene indexing and merge process)

John Wang wrote:
> 
>      I can tried to get some numbers for leading an int[] array vs
> FieldCache.getInts().

I've had a similar performance problem when I used the FieldCache. The
loading performance is apparently so slow, because each value is stored
as a term in the dictionary. For loading the cache it is necessary to
iterate over all terms for the field in the dictionary. And for each
term it's posting list is opened to check which documents have that value.

If you store unique docIds, then there are no two documents that share
the same value. That means, that each value gets its own entry in the
dictionary and to load each value it is necessary to perform two random
I/O seeks (one for term lookup + one to open the posting list).

In my app it took for a big index several minutes to fill the cache like
that.

To speed things up I did essentially what Ning suggested. Now I store
the values as payloads in the posting list of an artificial term. To
fill my cache it's only necessary to perform a couple of I/O seeks for
opening the posting list of the specific term, then it is just a
sequential scan to load all values. With this approach the time for
filling the cache went down from minutes to seconds!

Now this approach is already much better than the current field cache
implementation, but it still can be improved. In fact, we already have a
mechanism for doing that: the norms. Norms are stored with a fixed size,
which means both random access and sequential scan are optimal. Norms
are also cached in memory, and filling that cache is much faster
compared to the current FieldCache approach.

I was therefore thinking about adding per-document payloads to Lucene
(we can also call it document-metadata). The API could look like this:

Document d = new Document();
byte[] uidValue = ...
d.addMetadata("uid", uidValue);

And on the retrieval side all values could either be loaded into the
field cache, or, if the index is too big, a new API can be used:

IndexReader reader = IndexReader.open(...);
DocumentMetadataIterator it = reader.metadataIterator("uid");

where DocumentMetadataIterator is an interface similar to TermDocs:

interface DocumentMetadataIterator {
  void seek(String name);
  boolean next();
  boolean skipTo(int doc);

  int doc();
  byte[] getMetadata();
}

The next question would be how to store the per-doc payloads (PDP). If
all values have the same length (as the unique docIds), then we should
store them as efficiently as possible, like the norms. However, we still
want to offer the flexibility of having variable-length values. For this
case we could use a new data structure similar to our posting list.

PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
SkipList>
FixedLengthPDPList    --> <Payload>^SegSize
VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
Payload               --> Byte^PayloadLength
PayloadLength         --> VInt
SkipList              --> see frq.file

Because we don't have global field semantics Lucene should automatically
pick the "right" data structure. This could work like this: When the
DocumentsWriter writes a segment it checks whether all values of a PDP
have the same length. If yes, it stores them as FixedLengthPDPList, if
not, then as VariableLengthPDPList.
When the SegmentMerger merges two or more segments it checks if all
segments have a FixedLengthPDPList with the same length for a PDP. If
not, it writes a VariableLengthPDPList to the new segment.

I think this would be a nice new feature for Lucene. We could then have
user-defined and Lucene-specific PDPs. For example, norms would be in
the latter category (this way we would get rid of the special code for
norms, as they could be handled as PDPs). It would also be easy to add
new features in the future, like splitting the norms into two values: a
norm and a boost value.

OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)

- Michael

> 
> Thanks
> 
> -John
> 
> On 10/19/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>
>> It seems like there are (at least) two angles here for getting better
>> performance from FieldCache:
>>
>>   1) Be incremental: with reopen() we should only have to update a
>>      subset of the array in the FieldCache, according to the changed
>>      segments.  This is what Hoss is working on and Mark was referring
>>      to and I think it's very important!
>>
>>   2) Parsing is slow (?): I'm guessing one of the reasons that John
>>      added the _X.udt file was because it's much faster to load an
>>      array of already-parsed ints than to ask FieldCache to populate
>>      itself.
>>
>> Even if we do #1, I think #2 could be a big win (in addition)?  John
>> do you have any numbers of how much faster it is to load the array of
>> ints from the _X.udt file vs having FieldCache populate itself?
>>
>> Also on the original question of "can we open up SegmentReader,
>> FieldsWriter, etc.", I think that's a good idea?  At least we can make
>> things protected instead of private/final?
>>
>> Mike
>>
>> "Ning Li" <ni...@gmail.com> wrote:
>>> I see what you mean by 2) now. What Mark said should work for you when
>>> it's done.
>>>
>>> Cheers,
>>> Ning
>>>
>>> On 10/18/07, John Wang <jo...@gmail.com> wrote:
>>>> Hi Ning:
>>>>     That is essentially what field cache does. Doing this for each
>> docid in
>>>> the result set will be slow if the result set is large. But loading it
>> in
>>>> memory when opening index can also be slow if the index is large and
>> updates
>>>> often.
>>>>
>>>> Thanks
>>>>
>>>> -John
>>>>
>>>> On 10/18/07, Ning Li <ni...@gmail.com> wrote:
>>>>> Make all documents have a term, say "ID:UID", and for each document,
>>>>> store its UID in the term's payload. You can read off this posting
>>>>> list to create your array. Will this work for you, John?
>>>>>
>>>>> Cheers,
>>>>> Ning
>>>>>
>>>>>
>>>>> On 10/18/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>>>>> Forwarding this to java-dev per request.  Seems like the best
>> place
>>>>>> to discuss this topic.
>>>>>>
>>>>>>         Erik
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>>> From: "John Wang" <jo...@gmail.com>
>>>>>>> Date: October 17, 2007 5:43:29 PM EDT
>>>>>>> To: erik@ehatchersolutions.com
>>>>>>> Subject: lucene indexing and merge process
>>>>>>>
>>>>>>> Hi Erik:
>>>>>>>
>>>>>>>     We are revamping our search system here at LinekdIn. And we
>> are
>>>>>>> using Lucene.
>>>>>>>
>>>>>>>     One issue we ran across is that we store an UID in Lucene
>> which
>>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
>> we
>>>>>>> have the following solutions:
>>>>>>>
>>>>>>> 1) Index it as a Stored field and get it from reader.document(very
>>>>>>> slow if recall is large)
>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
>>>>>>> indexreader can be slow)
>>>>>>> 3) construct it using the FieldCache and persist it on disk
>>>>>>> everytime the index changes. (not suitable for real time
>> indexing,
>>>>>>> e.g. this process will degrade as # of documents get large)
>>>>>>>
>>>>>>>     None of the above solutions turn out to be adequate for our
>>>>>>> requirements.
>>>>>>>
>>>>>>>      What we end up doing is to modify Lucene code by changing
>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
>>>>>>> advantage of the Lucene Segment/merge process. E.g:
>>>>>>>
>>>>>>>      For each segment, we store a .udt file, which is an int[]
>>>>>>> array, (by changing the FieldWriter class)
>>>>>>>
>>>>>>>      And SegmentReader will load the .udt file into an array.
>>>>>>>
>>>>>>>      And merge happens seemlessly.
>>>>>>>
>>>>>>>      Because the tight encapsulation around these classes, e.g.
>>>>>>> private and final methods, it is very difficult to extend Lucene
>>>>>>> while avoiding branch into our own version. Is there a way we
>> can
>>>>>>> open up and make these classes extensible? We'd be happy to
>>>>>>> contribute what we have done.
>>>>>>>
>>>>>>>      I guess to tackle the problem from a different angle: is
>> there
>>>>>>> a way to incorporate FieldCache into the segments (it is
>> strictly
>>>>>>> in memory now), and build disk versions while indexing.
>>>>>>>
>>>>>>>
>>>>>>>      Hope I am making sense.
>>>>>>>
>>>>>>>     I did not send this out to the mailing list because I wasn't
>>>>>>> sure if this is a dev question or an user question, feel free to
>>>>>>> either forward it to the right mailing list or let me know and I
>>>>>>> can forward it.
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> -John
>>>>>>>
>>>>>>
>>>>>>
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le samedi 20 octobre 2007, Michael Busch a écrit :
> John Wang wrote:
> >      I can tried to get some numbers for leading an int[] array vs
> > FieldCache.getInts().
>
> I've had a similar performance problem when I used the FieldCache. The
> loading performance is apparently so slow, because each value is stored
> as a term in the dictionary. For loading the cache it is necessary to
> iterate over all terms for the field in the dictionary. And for each
> term it's posting list is opened to check which documents have that value.
>
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two random
> I/O seeks (one for term lookup + one to open the posting list).
>
> In my app it took for a big index several minutes to fill the cache like
> that.
>
> To speed things up I did essentially what Ning suggested. Now I store
> the values as payloads in the posting list of an artificial term. To
> fill my cache it's only necessary to perform a couple of I/O seeks for
> opening the posting list of the specific term, then it is just a
> sequential scan to load all values. With this approach the time for
> filling the cache went down from minutes to seconds!
>
> Now this approach is already much better than the current field cache
> implementation, but it still can be improved. In fact, we already have a
> mechanism for doing that: the norms. Norms are stored with a fixed size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-document payloads to Lucene
> (we can also call it document-metadata). The API could look like this:
>
> Document d = new Document();
> byte[] uidValue = ...
> d.addMetadata("uid", uidValue);
>
> And on the retrieval side all values could either be loaded into the
> field cache, or, if the index is too big, a new API can be used:
>
> IndexReader reader = IndexReader.open(...);
> DocumentMetadataIterator it = reader.metadataIterator("uid");
>
> where DocumentMetadataIterator is an interface similar to TermDocs:
>
> interface DocumentMetadataIterator {
>   void seek(String name);
>   boolean next();
>   boolean skipTo(int doc);
>
>   int doc();
>   byte[] getMetadata();
> }
>
> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we still
> want to offer the flexibility of having variable-length values. For this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file
>
> Because we don't have global field semantics Lucene should automatically
> pick the "right" data structure. This could work like this: When the
> DocumentsWriter writes a segment it checks whether all values of a PDP
> have the same length. If yes, it stores them as FixedLengthPDPList, if
> not, then as VariableLengthPDPList.
> When the SegmentMerger merges two or more segments it checks if all
> segments have a FixedLengthPDPList with the same length for a PDP. If
> not, it writes a VariableLengthPDPList to the new segment.
>
> I think this would be a nice new feature for Lucene. We could then have
> user-defined and Lucene-specific PDPs. For example, norms would be in
> the latter category (this way we would get rid of the special code for
> norms, as they could be handled as PDPs). It would also be easy to add
> new features in the future, like splitting the norms into two values: a
> norm and a boost value.
>
> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)

lot's thoughts, that makes me think of LUCENE-662 ;)

Nicolas

>
> - Michael
>
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >> It seems like there are (at least) two angles here for getting better
> >> performance from FieldCache:
> >>
> >>   1) Be incremental: with reopen() we should only have to update a
> >>      subset of the array in the FieldCache, according to the changed
> >>      segments.  This is what Hoss is working on and Mark was referring
> >>      to and I think it's very important!
> >>
> >>   2) Parsing is slow (?): I'm guessing one of the reasons that John
> >>      added the _X.udt file was because it's much faster to load an
> >>      array of already-parsed ints than to ask FieldCache to populate
> >>      itself.
> >>
> >> Even if we do #1, I think #2 could be a big win (in addition)?  John
> >> do you have any numbers of how much faster it is to load the array of
> >> ints from the _X.udt file vs having FieldCache populate itself?
> >>
> >> Also on the original question of "can we open up SegmentReader,
> >> FieldsWriter, etc.", I think that's a good idea?  At least we can make
> >> things protected instead of private/final?
> >>
> >> Mike
> >>
> >> "Ning Li" <ni...@gmail.com> wrote:
> >>> I see what you mean by 2) now. What Mark said should work for you when
> >>> it's done.
> >>>
> >>> Cheers,
> >>> Ning
> >>>
> >>> On 10/18/07, John Wang <jo...@gmail.com> wrote:
> >>>> Hi Ning:
> >>>>     That is essentially what field cache does. Doing this for each
> >>
> >> docid in
> >>
> >>>> the result set will be slow if the result set is large. But loading it
> >>
> >> in
> >>
> >>>> memory when opening index can also be slow if the index is large and
> >>
> >> updates
> >>
> >>>> often.
> >>>>
> >>>> Thanks
> >>>>
> >>>> -John
> >>>>
> >>>> On 10/18/07, Ning Li <ni...@gmail.com> wrote:
> >>>>> Make all documents have a term, say "ID:UID", and for each document,
> >>>>> store its UID in the term's payload. You can read off this posting
> >>>>> list to create your array. Will this work for you, John?
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>> On 10/18/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >>>>>> Forwarding this to java-dev per request.  Seems like the best
> >>
> >> place
> >>
> >>>>>> to discuss this topic.
> >>>>>>
> >>>>>>         Erik
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>> From: "John Wang" <jo...@gmail.com>
> >>>>>>> Date: October 17, 2007 5:43:29 PM EDT
> >>>>>>> To: erik@ehatchersolutions.com
> >>>>>>> Subject: lucene indexing and merge process
> >>>>>>>
> >>>>>>> Hi Erik:
> >>>>>>>
> >>>>>>>     We are revamping our search system here at LinekdIn. And we
> >>
> >> are
> >>
> >>>>>>> using Lucene.
> >>>>>>>
> >>>>>>>     One issue we ran across is that we store an UID in Lucene
> >>
> >> which
> >>
> >>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
> >>
> >> we
> >>
> >>>>>>> have the following solutions:
> >>>>>>>
> >>>>>>> 1) Index it as a Stored field and get it from reader.document(very
> >>>>>>> slow if recall is large)
> >>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
> >>>>>>> indexreader can be slow)
> >>>>>>> 3) construct it using the FieldCache and persist it on disk
> >>>>>>> everytime the index changes. (not suitable for real time
> >>
> >> indexing,
> >>
> >>>>>>> e.g. this process will degrade as # of documents get large)
> >>>>>>>
> >>>>>>>     None of the above solutions turn out to be adequate for our
> >>>>>>> requirements.
> >>>>>>>
> >>>>>>>      What we end up doing is to modify Lucene code by changing
> >>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
> >>>>>>> advantage of the Lucene Segment/merge process. E.g:
> >>>>>>>
> >>>>>>>      For each segment, we store a .udt file, which is an int[]
> >>>>>>> array, (by changing the FieldWriter class)
> >>>>>>>
> >>>>>>>      And SegmentReader will load the .udt file into an array.
> >>>>>>>
> >>>>>>>      And merge happens seemlessly.
> >>>>>>>
> >>>>>>>      Because the tight encapsulation around these classes, e.g.
> >>>>>>> private and final methods, it is very difficult to extend Lucene
> >>>>>>> while avoiding branch into our own version. Is there a way we
> >>
> >> can
> >>
> >>>>>>> open up and make these classes extensible? We'd be happy to
> >>>>>>> contribute what we have done.
> >>>>>>>
> >>>>>>>      I guess to tackle the problem from a different angle: is
> >>
> >> there
> >>
> >>>>>>> a way to incorporate FieldCache into the segments (it is
> >>
> >> strictly
> >>
> >>>>>>> in memory now), and build disk versions while indexing.
> >>>>>>>
> >>>>>>>
> >>>>>>>      Hope I am making sense.
> >>>>>>>
> >>>>>>>     I did not send this out to the mailing list because I wasn't
> >>>>>>> sure if this is a dev question or an user question, feel free to
> >>>>>>> either forward it to the right mailing list or let me know and I
> >>>>>>> can forward it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> -John
> >>
> >> ---------------------------------------------------------------------
> >>
> >>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >>
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Oct 19, 2007, at 3:53 PM, Michael Busch wrote:

> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we  
> still
> want to offer the flexibility of having variable-length values. For  
> this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file

There's another approach, which has the following advantages:

   * Simpler.
   * Pluggable.
   * More future proof.
   * More closely models IR Theory.
   * Easier for other implementations to deal with.
   * Breaks the tight binding between Lucene and its file format.

Start with a Posting base class.

   public class Posting {
     private int docNum;
     private int lastDocNum = 0;

     public int getDocNum { return docNum; }

     public void read(IndexInput inStream) {
       docNum += inStream.readVInt();
     }

     public void write(IndexOutput outStream) {
       outStream.writeVInt(docNum - lastDocNum);
     }
   }

Then, PostingList (subclassed by SegPostingList and MultiPostingList,  
naturally).

   public abstract class PostingList {
      public abstract Posting getPosting();
      public abstract boolean next() throws IOException;
      public boolean skipTo(int target) throws IOException;
   }

Each field gets its own "postings" file within the segment, named  
_SEGNUM_FIELDNUM.p, where SEGNUM and FIELDNUM are encoded using base  
36.  Each of these files is a solid stack of serialized Postings.

Posting subclasses like ScorePosting, PayloadPosting, etc, implement  
their own read() and write() methods.  Thus, Posting subclasses  
wholly define their own file format -- instead of the current,  
brittle design, where read/write code is dispersed over multiple  
classes.  If some Posting types become obsolete, they can be  
deprecated, but PostingList and its subclasses won't require the  
addition of crufty special case code to stay back-compatible.

There's more (I've written a working implementation), but that's the  
gist.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Ning Li <ni...@gmail.com>.

> That may be a little too seamless.  We want the user to have specific
> control over which fields are efficiently stored separately since they
> will know how that field will be used.

Maybe let users decide field families, like the column families in BigTable?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le lundi 29 octobre 2007, Michael McCandless a écrit :
> "Michael Busch" <bu...@gmail.com> wrote:
> > Michael McCandless wrote:
> > > Michael, are you thinking that the storage would/could be non-sparse
> > > (like norms), and loaded/cached once in memory, especially for fixed
> > > size fields?  EG a big array of ints of length maxDocID?  In John's
> > > original case, every doc has this UID int field; I think this is
> > > fairly common.
> >
> > Yes I agree, this is a common use case. In my first mail in this thread
> > I suggested to have a flexible format. Non-sparse, like norms, in case
> > every document has one value and all values have the same fixed size.
> > Sparse and with a skip list if one or both conditions are false.
> >
> > The DocumentsWriter would have to check whether both conditions are
> > true, in  which case it would store the values non-sparse. The
> > SegmentMerger would only write the non-sparse format for the new segment
> > if all of the source segments also had the non-sparse format with the
> > same value size.
> >
> > This would provide the most flexibility for the users I think.
>
> OK, got it.  So in the case where I always put a field "UID" on every
> document, always a 4-byte binary field, then Lucene will "magically"
> store this as non-sparse column-stride field for every segment.
>
> But I still have to mark the Field as "column-stride storage" right?

It depends how the API should look like. Either Lucene support every different 
format support, so you have explicitely bind fileds with a format, either you 
open up the API so it is the Lucene user who choose how to store its data.

As said earlier in the thread, some work have done done against the second 
choice : LUCENE-662
 
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Michael Busch" <bu...@gmail.com> wrote:
> Michael McCandless wrote:
> > 
> > Michael, are you thinking that the storage would/could be non-sparse
> > (like norms), and loaded/cached once in memory, especially for fixed
> > size fields?  EG a big array of ints of length maxDocID?  In John's
> > original case, every doc has this UID int field; I think this is
> > fairly common.
> >
> 
> Yes I agree, this is a common use case. In my first mail in this thread
> I suggested to have a flexible format. Non-sparse, like norms, in case
> every document has one value and all values have the same fixed size.
> Sparse and with a skip list if one or both conditions are false.
> 
> The DocumentsWriter would have to check whether both conditions are
> true, in  which case it would store the values non-sparse. The
> SegmentMerger would only write the non-sparse format for the new segment
> if all of the source segments also had the non-sparse format with the
> same value size.
>
> This would provide the most flexibility for the users I think.

OK, got it.  So in the case where I always put a field "UID" on every
document, always a 4-byte binary field, then Lucene will "magically"
store this as non-sparse column-stride field for every segment.

But I still have to mark the Field as "column-stride storage" right?

Even if some docs do not have the field, it is still beneficial to
store it non-sparse up until a point.  EG the logic in
BitVector.isSparse() is doing a similar calculation.  This is only
possible when the field, when set on the document, is always the same
length in bytes.

Maybe we should also allow users to explicitly state that they wish
for this field to be stored in this way (sparse or non-sparse) rather
than having Lucene choose?

New question: how would we handle a "boolean" type column-stride
stored field?  It seems like we should always use BitVector since it
already handles the sparse/non-sparse storage decision "under the
hood"?

> > I think many apps have no trouble loading the array-of-ints entirely
> > into RAM, either because there are not that many docs or because
> > throwing RAM at the problem is fine (eg on a 64-bit JVM).
> > 
> >>From John's tests, the "load int[] directly from disk" took 186 msec
> > vs the payload approach (using today's payloads API) took 430 msec.
> > 
> > This is a sizable performance difference (2.3 X faster) and for
> > interactive indexing apps, where minimizing cost of re-opening readers
> > is critical, this is significant.  Especially combining this with the
> > ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
> > distributing the FieldCache down into sub-readers) should make
> > re-opening + re-warming much faster than today.
> > 
> 
> Yes definitely. I was planning to add a FieldCache implementation that
> uses these per-doc payloads - it's one of the most obvious use-cases.
> However, I think providing an iterator in addition, like TermDocs, makes
> sense too. People might have very big indexes, store longer values than
> 4 Bytes Ints, or use more than one per-doc payload. In some tests I
> found  out that the performance is still often acceptable, even if the
> values are not cached. (It's like having one AND-term more in the query,
> as one more "posting list" has to be processed).
>
> > If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
> > would look at FieldInfo, determine that this field is stored
> > column-stride, and load it as one big int array?
> > 
> 
> So I think a TermDocs-like iterator plus a new FieldCache implementation
> would make sense?

OK, I agree, we should have an iterator API as well so that you can
process this posting list "document at a time" just like all other
terms in the query.

> We could further make these fields updateable, like norms?

Agreed, though how would the API work (if indeed we are just adding
"column-stride[-non]-sparse" options to Field)?  Because if the Field
is also indexed, we can't update that.  I think I can see why you
wanted to make a new API here :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael Busch <bu...@gmail.com>.

Michael McCandless wrote:
> 
> Michael, are you thinking that the storage would/could be non-sparse
> (like norms), and loaded/cached once in memory, especially for fixed
> size fields?  EG a big array of ints of length maxDocID?  In John's
> original case, every doc has this UID int field; I think this is
> fairly common.
>

Yes I agree, this is a common use case. In my first mail in this thread
I suggested to have a flexible format. Non-sparse, like norms, in case
every document has one value and all values have the same fixed size.
Sparse and with a skip list if one or both conditions are false.

The DocumentsWriter would have to check whether both conditions are
true, in  which case it would store the values non-sparse. The
SegmentMerger would only write the non-sparse format for the new segment
if all of the source segments also had the non-sparse format with the
same value size.

This would provide the most flexibility for the users I think.

> I think many apps have no trouble loading the array-of-ints entirely
> into RAM, either because there are not that many docs or because
> throwing RAM at the problem is fine (eg on a 64-bit JVM).
> 
>>>From John's tests, the "load int[] directly from disk" took 186 msec
> vs the payload approach (using today's payloads API) took 430 msec.
> 
> This is a sizable performance difference (2.3 X faster) and for
> interactive indexing apps, where minimizing cost of re-opening readers
> is critical, this is significant.  Especially combining this with the
> ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
> distributing the FieldCache down into sub-readers) should make
> re-opening + re-warming much faster than today.
> 

Yes definitely. I was planning to add a FieldCache implementation that
uses these per-doc payloads - it's one of the most obvious use-cases.
However, I think providing an iterator in addition, like TermDocs, makes
sense too. People might have very big indexes, store longer values than
4 Bytes Ints, or use more than one per-doc payload. In some tests I
found  out that the performance is still often acceptable, even if the
values are not cached. (It's like having one AND-term more in the query,
as one more "posting list" has to be processed).

> If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
> would look at FieldInfo, determine that this field is stored
> column-stride, and load it as one big int array?
> 

So I think a TermDocs-like iterator plus a new FieldCache implementation
would make sense?

We could further make these fields updateable, like norms?

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael McCandless <lu...@mikemccandless.com>.

> Michael Busch wrote:
>
> > Doug Cutting wrote:
> > 
> > If this is really required, perhaps it ought to appear as an
> > attribute for stored fields, indicating that the field should be
> > stored in a separate "column store".  This would permit efficient
> > enumeration of values of just that field.
> 
> Yes I was thinking about this too. I'm just not sure if this is
> confusing for the users, because it will be conceptually different
> how to retrieve "normal" stored fields vs. "column-stored"
> fields. The former via getDocument() (multiple field values at a
> time), but the latter via an Iterator similar to TermDocs (one value
> at a time).  Do you think this would be confusing? Or do you have
> other ideas for the retrieval API?

Michael, are you thinking that the storage would/could be non-sparse
(like norms), and loaded/cached once in memory, especially for fixed
size fields?  EG a big array of ints of length maxDocID?  In John's
original case, every doc has this UID int field; I think this is
fairly common.

I think many apps have no trouble loading the array-of-ints entirely
into RAM, either because there are not that many docs or because
throwing RAM at the problem is fine (eg on a 64-bit JVM).

>From John's tests, the "load int[] directly from disk" took 186 msec
vs the payload approach (using today's payloads API) took 430 msec.

This is a sizable performance difference (2.3 X faster) and for
interactive indexing apps, where minimizing cost of re-opening readers
is critical, this is significant.  Especially combining this with the
ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
distributing the FieldCache down into sub-readers) should make
re-opening + re-warming much faster than today.

If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
would look at FieldInfo, determine that this field is stored
column-stride, and load it as one big int array?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael Busch <bu...@gmail.com>.

Doug Cutting wrote:
> 
> If this is really required, perhaps it ought to appear as an attribute
> for stored fields, indicating that the field should be stored in a
> separate "column store".  This would permit efficient enumeration of
> values of just that field.
> 

Yes I was thinking about this too. I'm just not sure if this is
confusing for the users, because it will be conceptually different how
to retrieve "normal" stored fields vs. "column-stored" fields. The
former via getDocument() (multiple field values at a time), but the
latter via an Iterator similar to TermDocs (one value at a time).
Do you think this would be confusing? Or do you have other ideas for the
retrieval API?

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Doug Cutting <cu...@apache.org>.

Michael Busch wrote:
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two random
> I/O seeks (one for term lookup + one to open the posting list).

Except they shouldn't be random seeks, but rather sequential accesses, 
since the term list is accessed in order, and the postings are processed 
in order, no?  It would be interesting to profile this.  We should make 
sure that we've well-optimized the case where we seek for the next term, 
seek to the current position in a file, etc.  Profiling should show if 
we've missed obvious optimizations for this case.

> I was therefore thinking about adding per-document payloads to Lucene

If this is really required, perhaps it ought to appear as an attribute 
for stored fields, indicating that the field should be stored in a 
separate "column store".  This would permit efficient enumeration of 
values of just that field.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by John Wang <jo...@gmail.com>.

Hi Micahel:
    After removing isDelete(), the index loads in 430 ms.

Thanks

-john

On 10/21/07, Michael Busch <bu...@gmail.com> wrote:
>
> John Wang wrote:
>
> >
> > Since all three methods loads docids into an int[], the lookup time is
> the
> > same for all three methods, what's
> > different are the load times:
> >
> > 1) 16.5 seconds,      43 MB
> > 2) 590 milliseconds     32.5 MB
> > 3) 186 milliseconds  26MB
>
> Good analysis! Thanks for sharing the results...
>
> >
> > I think the payload method is good enough so we don't need to diverge
> from
> > the lucene code base.
>
> Actually, I noticed that in my program in getCachedIDs() you can remove
> the check
>   if (!reader.isDeleted(tp.doc())) {
>
> This should improve the performance further (not sure how much though),
> because the synchronized isDeleted() call is quite expensive and not
> necessary.
>
> If you want to reduce the index size, you might want to try to encode
> the Integers more efficiently, e. g. as VInts (depending on the values
> of your UIDs).
>
> > However, I feel that being able to customize the
> > indexing process and store our own file is still more efficient both in
> load
> > time and index size.
> >
>
> Yes, the current payload implementation is not optimized for this use
> case, it can be improved with a per-doc approach like the one I suggested.
>
> -Michael
>
>
> > Thanks
> >
> > -John
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Grant Ingersoll <gs...@apache.org>.

https://issues.apache.org/jira/browse/LUCENE-510  is related, then, I  
presume

On Oct 20, 2007, at 11:09 AM, Yonik Seeley wrote:

> On 10/20/07, Yonik Seeley <yo...@apache.org> wrote:
>> What about switching from char
>> counts to byte counts for indexed (String) fields that are stored
>> separately?
>
> In fact, what about switching to byte counts for all stored fields?
> It should be much easier than the full-blown byte-counts for the term
> index since it only involves stored fields.  It should make skipping
> fields (lazy field loading) much faster too.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Yonik Seeley <yo...@apache.org>.

On 10/20/07, Yonik Seeley <yo...@apache.org> wrote:
> What about switching from char
> counts to byte counts for indexed (String) fields that are stored
> separately?

In fact, what about switching to byte counts for all stored fields?
It should be much easier than the full-blown byte-counts for the term
index since it only involves stored fields.  It should make skipping
fields (lazy field loading) much faster too.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Yonik Seeley <yo...@apache.org>.

On 10/20/07, Grant Ingersoll <gs...@apache.org> wrote:
> I would think the typical use case would be you want all the
> "small" fields to be returned w/ the document and the large fields to
> be lazily loaded.  I think it should be seamless to the user.

That may be a little too seamless.  We want the user to have specific
control over which fields are efficiently stored separately since they
will know how that field will be used.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 20, 2007, at 10:51 AM, Yonik Seeley wrote:

> On 10/20/07, Grant Ingersoll <gs...@apache.org> wrote:
>> I think one of the questions that will come up from users is when
>> should I use addMetadata and when should I use addField?  Why make
>> the distinction to the user?  Fields have always represented
>> metadata, all your doing is optimizing the internal storage of them.
>> So from an interface side of things, I would just make it a new Field
>> type.
>
> Same thing occured to me...
> Fieldable.isStoredSeparately()?
>
> I wouldn't mind this byte[] access to any type of field stored
> separately (non binary fields too).  What about switching from char
> counts to byte counts for indexed (String) fields that are stored
> separately?
>
> I guess fields that were stored separately would not be returned
> unless asked for by name?

Right, I would think the typical use case would be you want all the  
"small" fields to be returned w/ the document and the large fields to  
be lazily loaded.  I think it should be seamless to the user.   
Perhaps we could have a threshold value upon indexing, such that all  
fields below are determined to be small, and all above are large,  
then at retrieval time we just compare the byte count to the  
threshold and lazy load the large fields.

Just a thought.  There are probably several ways this could be handled.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Yonik Seeley <yo...@apache.org>.

On 10/20/07, Grant Ingersoll <gs...@apache.org> wrote:
> I think one of the questions that will come up from users is when
> should I use addMetadata and when should I use addField?  Why make
> the distinction to the user?  Fields have always represented
> metadata, all your doing is optimizing the internal storage of them.
> So from an interface side of things, I would just make it a new Field
> type.

Same thing occured to me...
Fieldable.isStoredSeparately()?

I wouldn't mind this byte[] access to any type of field stored
separately (non binary fields too).  What about switching from char
counts to byte counts for indexed (String) fields that are stored
separately?

I guess fields that were stored separately would not be returned
unless asked for by name?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Oct 20, 2007, at 12:49 PM, Michael Busch wrote:

> In fact, what I'm proposing is a new kind of posting list.

http://www.rectangular.com/pipermail/kinosearch/2007-July/001096.html

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael Busch <bu...@gmail.com>.

Grant Ingersoll wrote:
> 
> Some randomly pieced together thoughts (I may not even be fully awake
> yet :-)  so feel free to tell me I'm not understanding this correctly)
> 
> My first thought was how is this different from just having a binary
> field, but if I understand correctly it is to be stored in a separate file?
> 
> Now you are proposing a faster storage mechanism for them, essentially,
> since they are to be stored separately from the Documents themselves?  
> But the other key is they are all stored next to each other, right, so
> the scan is a lot faster?
> 

Yes, scanning and skipping would be much faster, comparable to a posting
list. In fact, what I'm proposing is a new kind of posting list. Since
you mentioned the magic term "flexible indexing" already ;), let's take
a look at http://wiki.apache.org/lucene-java/FlexibleIndexing. Here 4
kinds of posting lists are proposed:

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

Today, we have c. and d. already. c. is the original Lucene format, and
d. can be achieved by storing the boost as a payload.

The new format I'm proposing actually covers a. and b. If you don't
store a payload it's basically a binary posting list without freq and
positions (a.). If you store the boost as a payload, then you have b.

> I think one of the questions that will come up from users is when should
> I use addMetadata and when should I use addField?  Why make the
> distinction to the user?  Fields have always represented metadata, all

I'd like to make a distinction because IMO these are two different use
cases. Not necessarily in terms of functionality, but in terms of
performance. You are right, you can store everything today as stored
fields, but if you want to use e. g. a stored value for scoring, then
performance is terrible. This is simply the nature of the store - it is
optimized for returning all stored fields for a document. Even a
FieldSelector doesn't help you too much, unless the docs contain very
big fields that you don't want to return. The reason is that two random
I/Os are necessary to find the stored fields of a document. Then only
sequential I/O has to be performed. And the overhead of loading e. g.
10KB instead of 2KB is not big, much less than two random I/Os, I believe.

Payloads are also much better in terms of cache utilization. Since they
are stored next to each other, and if accessed frequently (in every
search), then it's very likely that big portions of that posting list
will be in the cache.

So the answer to the question when to use a stored field and when to use
a payload should be: use payloads when you access the data during query
evaluation/scoring, use stored fields when you need the data to
construct a search result from a hit.

> fields, right?  Perhaps in this way, if users were willing to commit to
> fixed length fields for the first level, we could also make field
> updating of these types of fields possible w/o having to reindex?????
> 

Yes I was thinking the same. Just like norms.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 19, 2007, at 6:53 PM, Michael Busch wrote:

> John Wang wrote:
>>
>>      I can tried to get some numbers for leading an int[] array vs
>> FieldCache.getInts().
>
> I've had a similar performance problem when I used the FieldCache. The
> loading performance is apparently so slow, because each value is  
> stored
> as a term in the dictionary. For loading the cache it is necessary to
> iterate over all terms for the field in the dictionary. And for each
> term it's posting list is opened to check which documents have that  
> value.
>
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two  
> random
> I/O seeks (one for term lookup + one to open the posting list).
>
> In my app it took for a big index several minutes to fill the cache  
> like
> that.
>
> To speed things up I did essentially what Ning suggested. Now I store
> the values as payloads in the posting list of an artificial term. To
> fill my cache it's only necessary to perform a couple of I/O seeks for
> opening the posting list of the specific term, then it is just a
> sequential scan to load all values. With this approach the time for
> filling the cache went down from minutes to seconds!
>
> Now this approach is already much better than the current field cache
> implementation, but it still can be improved. In fact, we already  
> have a
> mechanism for doing that: the norms. Norms are stored with a fixed  
> size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-document payloads to Lucene
> (we can also call it document-metadata). The API could look like this:
>
> Document d = new Document();
> byte[] uidValue = ...
> d.addMetadata("uid", uidValue);
>
> And on the retrieval side all values could either be loaded into the
> field cache, or, if the index is too big, a new API can be used:
>
> IndexReader reader = IndexReader.open(...);
> DocumentMetadataIterator it = reader.metadataIterator("uid");
>
> where DocumentMetadataIterator is an interface similar to TermDocs:
>
> interface DocumentMetadataIterator {
>   void seek(String name);
>   boolean next();
>   boolean skipTo(int doc);
>
>   int doc();
>   byte[] getMetadata();
> }
>
> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we  
> still
> want to offer the flexibility of having variable-length values. For  
> this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file
>
> Because we don't have global field semantics Lucene should  
> automatically
> pick the "right" data structure. This could work like this: When the
> DocumentsWriter writes a segment it checks whether all values of a PDP
> have the same length. If yes, it stores them as FixedLengthPDPList, if
> not, then as VariableLengthPDPList.
> When the SegmentMerger merges two or more segments it checks if all
> segments have a FixedLengthPDPList with the same length for a PDP. If
> not, it writes a VariableLengthPDPList to the new segment.
>
> I think this would be a nice new feature for Lucene. We could then  
> have
> user-defined and Lucene-specific PDPs. For example, norms would be in
> the latter category (this way we would get rid of the special code for
> norms, as they could be handled as PDPs). It would also be easy to add
> new features in the future, like splitting the norms into two  
> values: a
> norm and a boost value.

Some randomly pieced together thoughts (I may not even be fully awake  
yet :-)  so feel free to tell me I'm not understanding this correctly)

My first thought was how is this different from just having a binary  
field, but if I understand correctly it is to be stored in a separate  
file?

Now you are proposing a faster storage mechanism for them,  
essentially, since they are to be stored separately from the  
Documents themselves?   But the other key is they are all stored next  
to each other, right, so the scan is a lot faster?

I think one of the questions that will come up from users is when  
should I use addMetadata and when should I use addField?  Why make  
the distinction to the user?  Fields have always represented  
metadata, all your doing is optimizing the internal storage of them.   
So from an interface side of things, I would just make it a new Field  
type.  Essentially what we are doing is creating a two level document  
store, right?  First level contains all of the small metadata that is  
likely to be accessed on every hit, second level contains all of the  
non-essential fields, right?  Perhaps in this way, if users were  
willing to commit to fixed length fields for the first level, we  
could also make field updating of these types of fields possible w/o  
having to reindex?????

Btw, I've thought ever since we added payloads that we should find a  
way to hook in scoring on the binary fields and I would presume  
people would eventually want scoring of metadata too, just like the  
FunctionQuery stuff does.

And yes, to Nicholas point, it starts to sound like flexible  
indexing.  :-)  Which I still would like to get to sometime in my  
lifetime...


Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by Michael Busch <bu...@gmail.com>.

John Wang wrote:

> 
> Since all three methods loads docids into an int[], the lookup time is the
> same for all three methods, what's
> different are the load times:
> 
> 1) 16.5 seconds,      43 MB
> 2) 590 milliseconds     32.5 MB
> 3) 186 milliseconds  26MB

Good analysis! Thanks for sharing the results...

> 
> I think the payload method is good enough so we don't need to diverge from
> the lucene code base. 

Actually, I noticed that in my program in getCachedIDs() you can remove
the check
  if (!reader.isDeleted(tp.doc())) {

This should improve the performance further (not sure how much though),
because the synchronized isDeleted() call is quite expensive and not
necessary.

If you want to reduce the index size, you might want to try to encode
the Integers more efficiently, e. g. as VInts (depending on the values
of your UIDs).

> However, I feel that being able to customize the
> indexing process and store our own file is still more efficient both in load
> time and index size.
> 

Yes, the current payload implementation is not optimized for this use
case, it can be improved with a per-doc approach like the one I suggested.

-Michael

> Thanks
> 
> -John
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads

Posted by John Wang <jo...@gmail.com>.

Hi Michael:

    I took your program and benchmarked against my setup, here are some
numbers comparing to the other options:

Setup: 2M docs with only the id, indexed in various ways for each method

randomly selected 50000 of the docids and do a lookup.

Comparing 3 methods:

1) load int[] from field cache
    Field(fldname,uid,Stored.No,Index.Tokenized);    lucene 2.2

2) using payload (your impl)            lucene 2.2

3) using our custom made lucene, described by my early email
   no fields added, just doc.setUserID(uid);  // since we are adding our own
segment file. (custome lucene 2.0)


Since all three methods loads docids into an int[], the lookup time is the
same for all three methods, what's
different are the load times:

1) 16.5 seconds,      43 MB
2) 590 milliseconds     32.5 MB
3) 186 milliseconds  26MB

I think the payload method is good enough so we don't need to diverge from
the lucene code base. However, I feel that being able to customize the
indexing process and store our own file is still more efficient both in load
time and index size.

Thanks

-John


On 10/20/07, Michael Busch <bu...@gmail.com> wrote:
>
> John Wang wrote:
> > Hi Michael:
> >      Thanks for the info.
> >
> >      I haven't played with payloads. Can you give me an example or point
> me
> > to how it is used to solve this problem?
> >
>
> Hi John,
>
> I (quickly) put together a class that is able to store UIDs as payloads.
> I believe the type of your UIDs is Integer?
>
> To add the UID to a document use this method:
>   /**
>    *  Adds a unique docId (UID) to a document as a payload.
>    */
>   public void addUID(Document doc, int uid);
>
> You can either load the UID from the disk or use a cache:
>
> /** Returns the UID for the passed-in docId.
>   *  If you use this method in a Hitcollector and your Query
>   *  contains OR-terms, then try to set
>   *  BooleanQuery.setAllowDocsOutOfOrder(false) to improve performance.
>   */
>   public int getUID(IndexReader reader, int docId) throws IOException;
>
>   /** Fills the passed-in array with UID-values. If the given array is
>    *  null or too small, then a new array is created with
>    *  cache.length = reader.maxDoc()
>    */
>   public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
>    throws IOException;
>
> I put a little test program in main and it seems to work fine. However,
> it's not thoroughly tested yet...
>
> You might want to try it without using the cache first. The performance
> might be good enough for your needs. If not, try using the cache, it
> should fill up much faster compared to the FieldCache.
>
> Another comment: If you're using Lucene 2.2, you need to replace the
> lines "tp.seek(UID_TERM);" (see comments in the code below). Or use the
> latest trunk version, it has a fix for this bug.
>
> Let me know please if this improves your performance! Have fun...
> - Michael
>
> And here is the code:
>
> import java.io.IOException;
>
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Payload;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermPositions;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
>
> public class PerDocPayloadReaderWriter {
>   public static final Term UID_TERM = new Term("_ID", "_UID");
>   private SinglePayloadTokenStream singlePayloadTokenStream = new
> SinglePayloadTokenStream();
>   private TermPositions tp;
>   private byte[] payloadBuffer = new byte[4];
>
>   public static void main(String args[]) throws IOException {
>     PerDocPayloadReaderWriter pdp = new PerDocPayloadReaderWriter();
>     Directory dir = new RAMDirectory();
>     IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
>     for (int i = 0; i < 10; i++) {
>       Document d = new Document();
>       // create dummy doc
>       d.add(new Field("test", "This is a test.", Store.NO,
> Index.TOKENIZED));
>       pdp.addUID(d, 100 + i);
>
>       writer.addDocument(d);
>     }
>     writer.close();
>
>     IndexReader reader = IndexReader.open(dir);
>     int[] uids = pdp.getCachedIds(reader, null, 0);
>
>     System.out.println("Caching:");
>     System.out.println("ID --> UID");
>
>     for (int i = 0; i < uids.length; i++) {
>       System.out.println(i + "  --> " + uids[i]);
>     }
>
>     System.out.println("\nDirect access:");
>     System.out.println("ID --> UID");
>
>     for (int i = 0; i < uids.length; i++) {
>       System.out.println(i + "  --> " + pdp.getUID(reader, i));
>     }
>     reader.close();
>   }
>
>
>   /** Fills the passed-in array with UID-values. If the given array is
> null or too small, then
>    * a new array is created with cache.length = reader.maxDoc()
>    */
>   public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
> throws IOException {
>     int maxDoc = reader.maxDoc();
>     if (cache == null || cache.length - offset > maxDoc) {
>       cache = new int[maxDoc];
>       offset = 0;
>     }
>
>     if (tp == null) {
>       tp = reader.termPositions(UID_TERM);
>     } else {
>       // if using Lucene 2.2 replace the following line with
>       // tp = reader.termPositions(UID_TERM);
>       tp.seek(UID_TERM);
>     }
>
>     while (tp.next()) {
>       assert tp.doc() < maxDoc;
>       if (!reader.isDeleted(tp.doc())) {
>         tp.nextPosition();
>         tp.getPayload(payloadBuffer, 0);
>         cache[tp.doc() + offset] = bytesToInt(payloadBuffer);
>       }
>     }
>
>     return cache;
>   }
>
>   /** Returns the UID for the passed-in docId.
>    *  If you use this method in a Hitcollector and your Query contains
> OR-terms,
>    *  then try to set BooleanQuery.setAllowDocsOutOfOrder(false) to
> improve performance.
>    */
>   public int getUID(IndexReader reader, int docId) throws IOException {
>     if (tp == null) {
>       tp = reader.termPositions(UID_TERM);
>     } else if (tp.doc()> docId) {
>       // if using Lucene 2.2 replace the following line with
>       // tp = reader.termPositions(UID_TERM);
>       tp.seek(UID_TERM);
>     }
>
>     tp.skipTo(docId);
>     tp.nextPosition();
>     tp.getPayload(payloadBuffer, 0);
>
>     return bytesToInt(payloadBuffer);
>   }
>
>   /**
>    * Adds a unique docId (UID) to a document as a payload.
>    */
>   public void addUID(Document doc, int uid) {
>     singlePayloadTokenStream.setUID(uid);
>     doc.add(new Field(UID_TERM.field(), singlePayloadTokenStream));
>   }
>
>   private int bytesToInt(byte[] bytes) {
>     return ((bytes[3] & 0xFF) << 24) | ((bytes[2] & 0xFF) << 16)
>     | ((bytes[1] & 0xFF) <<  8) |  (bytes[0] & 0xFF);
>
>   }
>
>   private static class SinglePayloadTokenStream extends TokenStream {
>     private Token token = new Token(UID_TERM.text(), 0, 0);
>     private byte[] buffer = new byte[4];
>     private boolean returnToken = false;
>
>     void setUID(int uid) {
>       buffer[0] = (byte) (uid);
>       buffer[1] = (byte) (uid >> 8);
>       buffer[2] = (byte) (uid >> 16);
>       buffer[3] = (byte) (uid >> 24);
>       token.setPayload(new Payload(buffer));
>       returnToken = true;
>     }
>
>     public Token next() throws IOException {
>       if (returnToken) {
>         returnToken = false;
>         return token;
>       } else {
>         return null;
>       }
>     }
>   }
> }
>
>
>
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael Busch <bu...@gmail.com> wrote:
> >> John Wang wrote:
> >>>      I can tried to get some numbers for leading an int[] array vs
> >>> FieldCache.getInts().
> >> I've had a similar performance problem when I used the FieldCache. The
> >> loading performance is apparently so slow, because each value is stored
> >> as a term in the dictionary. For loading the cache it is necessary to
> >> iterate over all terms for the field in the dictionary. And for each
> >> term it's posting list is opened to check which documents have that
> value.
> >>
> >> If you store unique docIds, then there are no two documents that share
> >> the same value. That means, that each value gets its own entry in the
> >> dictionary and to load each value it is necessary to perform two random
> >> I/O seeks (one for term lookup + one to open the posting list).
> >>
> >> In my app it took for a big index several minutes to fill the cache
> like
> >> that.
> >>
> >> To speed things up I did essentially what Ning suggested. Now I store
> >> the values as payloads in the posting list of an artificial term. To
> >> fill my cache it's only necessary to perform a couple of I/O seeks for
> >> opening the posting list of the specific term, then it is just a
> >> sequential scan to load all values. With this approach the time for
> >> filling the cache went down from minutes to seconds!
> >>
> >> Now this approach is already much better than the current field cache
> >> implementation, but it still can be improved. In fact, we already have
> a
> >> mechanism for doing that: the norms. Norms are stored with a fixed
> size,
> >> which means both random access and sequential scan are optimal. Norms
> >> are also cached in memory, and filling that cache is much faster
> >> compared to the current FieldCache approach.
> >>
> >> I was therefore thinking about adding per-document payloads to Lucene
> >> (we can also call it document-metadata). The API could look like this:
> >>
> >> Document d = new Document();
> >> byte[] uidValue = ...
> >> d.addMetadata("uid", uidValue);
> >>
> >> And on the retrieval side all values could either be loaded into the
> >> field cache, or, if the index is too big, a new API can be used:
> >>
> >> IndexReader reader = IndexReader.open(...);
> >> DocumentMetadataIterator it = reader.metadataIterator("uid");
> >>
> >> where DocumentMetadataIterator is an interface similar to TermDocs:
> >>
> >> interface DocumentMetadataIterator {
> >>   void seek(String name);
> >>   boolean next();
> >>   boolean skipTo(int doc);
> >>
> >>   int doc();
> >>   byte[] getMetadata();
> >> }
> >>
> >> The next question would be how to store the per-doc payloads (PDP). If
> >> all values have the same length (as the unique docIds), then we should
> >> store them as efficiently as possible, like the norms. However, we
> still
> >> want to offer the flexibility of having variable-length values. For
> this
> >> case we could use a new data structure similar to our posting list.
> >>
> >> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> >> SkipList>
> >> FixedLengthPDPList    --> <Payload>^SegSize
> >> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> >> Payload               --> Byte^PayloadLength
> >> PayloadLength         --> VInt
> >> SkipList              --> see frq.file
> >>
> >> Because we don't have global field semantics Lucene should
> automatically
> >> pick the "right" data structure. This could work like this: When the
> >> DocumentsWriter writes a segment it checks whether all values of a PDP
> >> have the same length. If yes, it stores them as FixedLengthPDPList, if
> >> not, then as VariableLengthPDPList.
> >> When the SegmentMerger merges two or more segments it checks if all
> >> segments have a FixedLengthPDPList with the same length for a PDP. If
> >> not, it writes a VariableLengthPDPList to the new segment.
> >>
> >> I think this would be a nice new feature for Lucene. We could then have
> >> user-defined and Lucene-specific PDPs. For example, norms would be in
> >> the latter category (this way we would get rid of the special code for
> >> norms, as they could be handled as PDPs). It would also be easy to add
> >> new features in the future, like splitting the norms into two values: a
> >> norm and a boost value.
> >>
> >> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
> >>
> >> - Michael
> >>
> >>> Thanks
> >>>
> >>> -John
> >>>
> >>> On 10/19/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >>>> It seems like there are (at least) two angles here for getting better
> >>>> performance from FieldCache:
> >>>>
> >>>>   1) Be incremental: with reopen() we should only have to update a
> >>>>      subset of the array in the FieldCache, according to the changed
> >>>>      segments.  This is what Hoss is working on and Mark was
> referring
> >>>>      to and I think it's very important!
> >>>>
> >>>>   2) Parsing is slow (?): I'm guessing one of the reasons that John
> >>>>      added the _X.udt file was because it's much faster to load an
> >>>>      array of already-parsed ints than to ask FieldCache to populate
> >>>>      itself.
> >>>>
> >>>> Even if we do #1, I think #2 could be a big win (in addition)?  John
> >>>> do you have any numbers of how much faster it is to load the array of
> >>>> ints from the _X.udt file vs having FieldCache populate itself?
> >>>>
> >>>> Also on the original question of "can we open up SegmentReader,
> >>>> FieldsWriter, etc.", I think that's a good idea?  At least we can
> make
> >>>> things protected instead of private/final?
> >>>>
> >>>> Mike
> >>>>
> >>>> "Ning Li" <ni...@gmail.com> wrote:
> >>>>> I see what you mean by 2) now. What Mark said should work for you
> when
> >>>>> it's done.
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>> On 10/18/07, John Wang <jo...@gmail.com> wrote:
> >>>>>> Hi Ning:
> >>>>>>     That is essentially what field cache does. Doing this for each
> >>>> docid in
> >>>>>> the result set will be slow if the result set is large. But loading
> >> it
> >>>> in
> >>>>>> memory when opening index can also be slow if the index is large
> and
> >>>> updates
> >>>>>> often.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>> On 10/18/07, Ning Li <ni...@gmail.com> wrote:
> >>>>>>> Make all documents have a term, say "ID:UID", and for each
> document,
> >>>>>>> store its UID in the term's payload. You can read off this posting
> >>>>>>> list to create your array. Will this work for you, John?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Ning
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/18/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >>>>>>>> Forwarding this to java-dev per request.  Seems like the best
> >>>> place
> >>>>>>>> to discuss this topic.
> >>>>>>>>
> >>>>>>>>         Erik
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Begin forwarded message:
> >>>>>>>>
> >>>>>>>>> From: "John Wang" <jo...@gmail.com>
> >>>>>>>>> Date: October 17, 2007 5:43:29 PM EDT
> >>>>>>>>> To: erik@ehatchersolutions.com
> >>>>>>>>> Subject: lucene indexing and merge process
> >>>>>>>>>
> >>>>>>>>> Hi Erik:
> >>>>>>>>>
> >>>>>>>>>     We are revamping our search system here at LinekdIn. And we
> >>>> are
> >>>>>>>>> using Lucene.
> >>>>>>>>>
> >>>>>>>>>     One issue we ran across is that we store an UID in Lucene
> >>>> which
> >>>>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
> >>>> we
> >>>>>>>>> have the following solutions:
> >>>>>>>>>
> >>>>>>>>> 1) Index it as a Stored field and get it from reader.document
> (very
> >>>>>>>>> slow if recall is large)
> >>>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
> >>>>>>>>> indexreader can be slow)
> >>>>>>>>> 3) construct it using the FieldCache and persist it on disk
> >>>>>>>>> everytime the index changes. (not suitable for real time
> >>>> indexing,
> >>>>>>>>> e.g. this process will degrade as # of documents get large)
> >>>>>>>>>
> >>>>>>>>>     None of the above solutions turn out to be adequate for our
> >>>>>>>>> requirements.
> >>>>>>>>>
> >>>>>>>>>      What we end up doing is to modify Lucene code by changing
> >>>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
> >>>>>>>>> advantage of the Lucene Segment/merge process. E.g:
> >>>>>>>>>
> >>>>>>>>>      For each segment, we store a .udt file, which is an int[]
> >>>>>>>>> array, (by changing the FieldWriter class)
> >>>>>>>>>
> >>>>>>>>>      And SegmentReader will load the .udt file into an array.
> >>>>>>>>>
> >>>>>>>>>      And merge happens seemlessly.
> >>>>>>>>>
> >>>>>>>>>      Because the tight encapsulation around these classes, e.g.
> >>>>>>>>> private and final methods, it is very difficult to extend Lucene
> >>>>>>>>> while avoiding branch into our own version. Is there a way we
> >>>> can
> >>>>>>>>> open up and make these classes extensible? We'd be happy to
> >>>>>>>>> contribute what we have done.
> >>>>>>>>>
> >>>>>>>>>      I guess to tackle the problem from a different angle: is
> >>>> there
> >>>>>>>>> a way to incorporate FieldCache into the segments (it is
> >>>> strictly
> >>>>>>>>> in memory now), and build disk versions while indexing.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>      Hope I am making sense.
> >>>>>>>>>
> >>>>>>>>>     I did not send this out to the mailing list because I wasn't
> >>>>>>>>> sure if this is a dev question or an user question, feel free to
> >>>>>>>>> either forward it to the right mailing list or let me know and I
> >>>>>>>>> can forward it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> -John
> >>>>>>>>>
> >>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Per-document Payloads

Posted by Michael Busch <bu...@gmail.com>.

John Wang wrote:
> Hi Michael:
>      Thanks for the info.
> 
>      I haven't played with payloads. Can you give me an example or point me
> to how it is used to solve this problem?
> 

Hi John,

I (quickly) put together a class that is able to store UIDs as payloads.
I believe the type of your UIDs is Integer?

To add the UID to a document use this method:
  /**
   *  Adds a unique docId (UID) to a document as a payload.
   */
  public void addUID(Document doc, int uid);

You can either load the UID from the disk or use a cache:

 /** Returns the UID for the passed-in docId.
  *  If you use this method in a Hitcollector and your Query
  *  contains OR-terms, then try to set
  *  BooleanQuery.setAllowDocsOutOfOrder(false) to improve performance.
  */
  public int getUID(IndexReader reader, int docId) throws IOException;

  /** Fills the passed-in array with UID-values. If the given array is
   *  null or too small, then a new array is created with
   *  cache.length = reader.maxDoc()
   */
  public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
   throws IOException;

I put a little test program in main and it seems to work fine. However,
it's not thoroughly tested yet...

You might want to try it without using the cache first. The performance
might be good enough for your needs. If not, try using the cache, it
should fill up much faster compared to the FieldCache.

Another comment: If you're using Lucene 2.2, you need to replace the
lines "tp.seek(UID_TERM);" (see comments in the code below). Or use the
latest trunk version, it has a fix for this bug.

Let me know please if this improves your performance! Have fun...
- Michael

And here is the code:

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Payload;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class PerDocPayloadReaderWriter {
  public static final Term UID_TERM = new Term("_ID", "_UID");
  private SinglePayloadTokenStream singlePayloadTokenStream = new
SinglePayloadTokenStream();
  private TermPositions tp;
  private byte[] payloadBuffer = new byte[4];

  public static void main(String args[]) throws IOException {
    PerDocPayloadReaderWriter pdp = new PerDocPayloadReaderWriter();
    Directory dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
    for (int i = 0; i < 10; i++) {
      Document d = new Document();
      // create dummy doc
      d.add(new Field("test", "This is a test.", Store.NO,
Index.TOKENIZED));
      pdp.addUID(d, 100 + i);

      writer.addDocument(d);
    }
    writer.close();

    IndexReader reader = IndexReader.open(dir);
    int[] uids = pdp.getCachedIds(reader, null, 0);

    System.out.println("Caching:");
    System.out.println("ID --> UID");

    for (int i = 0; i < uids.length; i++) {
      System.out.println(i + "  --> " + uids[i]);
    }

    System.out.println("\nDirect access:");
    System.out.println("ID --> UID");

    for (int i = 0; i < uids.length; i++) {
      System.out.println(i + "  --> " + pdp.getUID(reader, i));
    }
    reader.close();
  }


  /** Fills the passed-in array with UID-values. If the given array is
null or too small, then
   * a new array is created with cache.length = reader.maxDoc()
   */
  public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
throws IOException {
    int maxDoc = reader.maxDoc();
    if (cache == null || cache.length - offset > maxDoc) {
      cache = new int[maxDoc];
      offset = 0;
    }

    if (tp == null) {
      tp = reader.termPositions(UID_TERM);
    } else {
      // if using Lucene 2.2 replace the following line with
      // tp = reader.termPositions(UID_TERM);
      tp.seek(UID_TERM);
    }

    while (tp.next()) {
      assert tp.doc() < maxDoc;
      if (!reader.isDeleted(tp.doc())) {
        tp.nextPosition();
        tp.getPayload(payloadBuffer, 0);
        cache[tp.doc() + offset] = bytesToInt(payloadBuffer);
      }
    }

    return cache;
  }

  /** Returns the UID for the passed-in docId.
   *  If you use this method in a Hitcollector and your Query contains
OR-terms,
   *  then try to set BooleanQuery.setAllowDocsOutOfOrder(false) to
improve performance.
   */
  public int getUID(IndexReader reader, int docId) throws IOException {
    if (tp == null) {
      tp = reader.termPositions(UID_TERM);
    } else if (tp.doc()> docId) {
      // if using Lucene 2.2 replace the following line with
      // tp = reader.termPositions(UID_TERM);
      tp.seek(UID_TERM);
    }

    tp.skipTo(docId);
    tp.nextPosition();
    tp.getPayload(payloadBuffer, 0);

    return bytesToInt(payloadBuffer);
  }

  /**
   * Adds a unique docId (UID) to a document as a payload.
   */
  public void addUID(Document doc, int uid) {
    singlePayloadTokenStream.setUID(uid);
    doc.add(new Field(UID_TERM.field(), singlePayloadTokenStream));
  }

  private int bytesToInt(byte[] bytes) {
    return ((bytes[3] & 0xFF) << 24) | ((bytes[2] & 0xFF) << 16)
    | ((bytes[1] & 0xFF) <<  8) |  (bytes[0] & 0xFF);

  }

  private static class SinglePayloadTokenStream extends TokenStream {
    private Token token = new Token(UID_TERM.text(), 0, 0);
    private byte[] buffer = new byte[4];
    private boolean returnToken = false;

    void setUID(int uid) {
      buffer[0] = (byte) (uid);
      buffer[1] = (byte) (uid >> 8);
      buffer[2] = (byte) (uid >> 16);
      buffer[3] = (byte) (uid >> 24);
      token.setPayload(new Payload(buffer));
      returnToken = true;
    }

    public Token next() throws IOException {
      if (returnToken) {
        returnToken = false;
        return token;
      } else {
        return null;
      }
    }
  }
}



> Thanks
> 
> -John
> 
> On 10/19/07, Michael Busch <bu...@gmail.com> wrote:
>> John Wang wrote:
>>>      I can tried to get some numbers for leading an int[] array vs
>>> FieldCache.getInts().
>> I've had a similar performance problem when I used the FieldCache. The
>> loading performance is apparently so slow, because each value is stored
>> as a term in the dictionary. For loading the cache it is necessary to
>> iterate over all terms for the field in the dictionary. And for each
>> term it's posting list is opened to check which documents have that value.
>>
>> If you store unique docIds, then there are no two documents that share
>> the same value. That means, that each value gets its own entry in the
>> dictionary and to load each value it is necessary to perform two random
>> I/O seeks (one for term lookup + one to open the posting list).
>>
>> In my app it took for a big index several minutes to fill the cache like
>> that.
>>
>> To speed things up I did essentially what Ning suggested. Now I store
>> the values as payloads in the posting list of an artificial term. To
>> fill my cache it's only necessary to perform a couple of I/O seeks for
>> opening the posting list of the specific term, then it is just a
>> sequential scan to load all values. With this approach the time for
>> filling the cache went down from minutes to seconds!
>>
>> Now this approach is already much better than the current field cache
>> implementation, but it still can be improved. In fact, we already have a
>> mechanism for doing that: the norms. Norms are stored with a fixed size,
>> which means both random access and sequential scan are optimal. Norms
>> are also cached in memory, and filling that cache is much faster
>> compared to the current FieldCache approach.
>>
>> I was therefore thinking about adding per-document payloads to Lucene
>> (we can also call it document-metadata). The API could look like this:
>>
>> Document d = new Document();
>> byte[] uidValue = ...
>> d.addMetadata("uid", uidValue);
>>
>> And on the retrieval side all values could either be loaded into the
>> field cache, or, if the index is too big, a new API can be used:
>>
>> IndexReader reader = IndexReader.open(...);
>> DocumentMetadataIterator it = reader.metadataIterator("uid");
>>
>> where DocumentMetadataIterator is an interface similar to TermDocs:
>>
>> interface DocumentMetadataIterator {
>>   void seek(String name);
>>   boolean next();
>>   boolean skipTo(int doc);
>>
>>   int doc();
>>   byte[] getMetadata();
>> }
>>
>> The next question would be how to store the per-doc payloads (PDP). If
>> all values have the same length (as the unique docIds), then we should
>> store them as efficiently as possible, like the norms. However, we still
>> want to offer the flexibility of having variable-length values. For this
>> case we could use a new data structure similar to our posting list.
>>
>> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
>> SkipList>
>> FixedLengthPDPList    --> <Payload>^SegSize
>> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
>> Payload               --> Byte^PayloadLength
>> PayloadLength         --> VInt
>> SkipList              --> see frq.file
>>
>> Because we don't have global field semantics Lucene should automatically
>> pick the "right" data structure. This could work like this: When the
>> DocumentsWriter writes a segment it checks whether all values of a PDP
>> have the same length. If yes, it stores them as FixedLengthPDPList, if
>> not, then as VariableLengthPDPList.
>> When the SegmentMerger merges two or more segments it checks if all
>> segments have a FixedLengthPDPList with the same length for a PDP. If
>> not, it writes a VariableLengthPDPList to the new segment.
>>
>> I think this would be a nice new feature for Lucene. We could then have
>> user-defined and Lucene-specific PDPs. For example, norms would be in
>> the latter category (this way we would get rid of the special code for
>> norms, as they could be handled as PDPs). It would also be easy to add
>> new features in the future, like splitting the norms into two values: a
>> norm and a boost value.
>>
>> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
>>
>> - Michael
>>
>>> Thanks
>>>
>>> -John
>>>
>>> On 10/19/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>> It seems like there are (at least) two angles here for getting better
>>>> performance from FieldCache:
>>>>
>>>>   1) Be incremental: with reopen() we should only have to update a
>>>>      subset of the array in the FieldCache, according to the changed
>>>>      segments.  This is what Hoss is working on and Mark was referring
>>>>      to and I think it's very important!
>>>>
>>>>   2) Parsing is slow (?): I'm guessing one of the reasons that John
>>>>      added the _X.udt file was because it's much faster to load an
>>>>      array of already-parsed ints than to ask FieldCache to populate
>>>>      itself.
>>>>
>>>> Even if we do #1, I think #2 could be a big win (in addition)?  John
>>>> do you have any numbers of how much faster it is to load the array of
>>>> ints from the _X.udt file vs having FieldCache populate itself?
>>>>
>>>> Also on the original question of "can we open up SegmentReader,
>>>> FieldsWriter, etc.", I think that's a good idea?  At least we can make
>>>> things protected instead of private/final?
>>>>
>>>> Mike
>>>>
>>>> "Ning Li" <ni...@gmail.com> wrote:
>>>>> I see what you mean by 2) now. What Mark said should work for you when
>>>>> it's done.
>>>>>
>>>>> Cheers,
>>>>> Ning
>>>>>
>>>>> On 10/18/07, John Wang <jo...@gmail.com> wrote:
>>>>>> Hi Ning:
>>>>>>     That is essentially what field cache does. Doing this for each
>>>> docid in
>>>>>> the result set will be slow if the result set is large. But loading
>> it
>>>> in
>>>>>> memory when opening index can also be slow if the index is large and
>>>> updates
>>>>>> often.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> On 10/18/07, Ning Li <ni...@gmail.com> wrote:
>>>>>>> Make all documents have a term, say "ID:UID", and for each document,
>>>>>>> store its UID in the term's payload. You can read off this posting
>>>>>>> list to create your array. Will this work for you, John?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ning
>>>>>>>
>>>>>>>
>>>>>>> On 10/18/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>>>>>>> Forwarding this to java-dev per request.  Seems like the best
>>>> place
>>>>>>>> to discuss this topic.
>>>>>>>>
>>>>>>>>         Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>>> From: "John Wang" <jo...@gmail.com>
>>>>>>>>> Date: October 17, 2007 5:43:29 PM EDT
>>>>>>>>> To: erik@ehatchersolutions.com
>>>>>>>>> Subject: lucene indexing and merge process
>>>>>>>>>
>>>>>>>>> Hi Erik:
>>>>>>>>>
>>>>>>>>>     We are revamping our search system here at LinekdIn. And we
>>>> are
>>>>>>>>> using Lucene.
>>>>>>>>>
>>>>>>>>>     One issue we ran across is that we store an UID in Lucene
>>>> which
>>>>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
>>>> we
>>>>>>>>> have the following solutions:
>>>>>>>>>
>>>>>>>>> 1) Index it as a Stored field and get it from reader.document(very
>>>>>>>>> slow if recall is large)
>>>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
>>>>>>>>> indexreader can be slow)
>>>>>>>>> 3) construct it using the FieldCache and persist it on disk
>>>>>>>>> everytime the index changes. (not suitable for real time
>>>> indexing,
>>>>>>>>> e.g. this process will degrade as # of documents get large)
>>>>>>>>>
>>>>>>>>>     None of the above solutions turn out to be adequate for our
>>>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>>      What we end up doing is to modify Lucene code by changing
>>>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
>>>>>>>>> advantage of the Lucene Segment/merge process. E.g:
>>>>>>>>>
>>>>>>>>>      For each segment, we store a .udt file, which is an int[]
>>>>>>>>> array, (by changing the FieldWriter class)
>>>>>>>>>
>>>>>>>>>      And SegmentReader will load the .udt file into an array.
>>>>>>>>>
>>>>>>>>>      And merge happens seemlessly.
>>>>>>>>>
>>>>>>>>>      Because the tight encapsulation around these classes, e.g.
>>>>>>>>> private and final methods, it is very difficult to extend Lucene
>>>>>>>>> while avoiding branch into our own version. Is there a way we
>>>> can
>>>>>>>>> open up and make these classes extensible? We'd be happy to
>>>>>>>>> contribute what we have done.
>>>>>>>>>
>>>>>>>>>      I guess to tackle the problem from a different angle: is
>>>> there
>>>>>>>>> a way to incorporate FieldCache into the segments (it is
>>>> strictly
>>>>>>>>> in memory now), and build disk versions while indexing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Hope I am making sense.
>>>>>>>>>
>>>>>>>>>     I did not send this out to the mailing list because I wasn't
>>>>>>>>> sure if this is a dev question or an user question, feel free to
>>>>>>>>> either forward it to the right mailing list or let me know and I
>>>>>>>>> can forward it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Posted by John Wang <jo...@gmail.com>.

Hi Michael:
     Thanks for the info.

     I haven't played with payloads. Can you give me an example or point me
to how it is used to solve this problem?

Thanks

-John

On 10/19/07, Michael Busch <bu...@gmail.com> wrote:
>
> John Wang wrote:
> >
> >      I can tried to get some numbers for leading an int[] array vs
> > FieldCache.getInts().
>
> I've had a similar performance problem when I used the FieldCache. The
> loading performance is apparently so slow, because each value is stored
> as a term in the dictionary. For loading the cache it is necessary to
> iterate over all terms for the field in the dictionary. And for each
> term it's posting list is opened to check which documents have that value.
>
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two random
> I/O seeks (one for term lookup + one to open the posting list).
>
> In my app it took for a big index several minutes to fill the cache like
> that.
>
> To speed things up I did essentially what Ning suggested. Now I store
> the values as payloads in the posting list of an artificial term. To
> fill my cache it's only necessary to perform a couple of I/O seeks for
> opening the posting list of the specific term, then it is just a
> sequential scan to load all values. With this approach the time for
> filling the cache went down from minutes to seconds!
>
> Now this approach is already much better than the current field cache
> implementation, but it still can be improved. In fact, we already have a
> mechanism for doing that: the norms. Norms are stored with a fixed size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-document payloads to Lucene
> (we can also call it document-metadata). The API could look like this:
>
> Document d = new Document();
> byte[] uidValue = ...
> d.addMetadata("uid", uidValue);
>
> And on the retrieval side all values could either be loaded into the
> field cache, or, if the index is too big, a new API can be used:
>
> IndexReader reader = IndexReader.open(...);
> DocumentMetadataIterator it = reader.metadataIterator("uid");
>
> where DocumentMetadataIterator is an interface similar to TermDocs:
>
> interface DocumentMetadataIterator {
>   void seek(String name);
>   boolean next();
>   boolean skipTo(int doc);
>
>   int doc();
>   byte[] getMetadata();
> }
>
> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we still
> want to offer the flexibility of having variable-length values. For this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file
>
> Because we don't have global field semantics Lucene should automatically
> pick the "right" data structure. This could work like this: When the
> DocumentsWriter writes a segment it checks whether all values of a PDP
> have the same length. If yes, it stores them as FixedLengthPDPList, if
> not, then as VariableLengthPDPList.
> When the SegmentMerger merges two or more segments it checks if all
> segments have a FixedLengthPDPList with the same length for a PDP. If
> not, it writes a VariableLengthPDPList to the new segment.
>
> I think this would be a nice new feature for Lucene. We could then have
> user-defined and Lucene-specific PDPs. For example, norms would be in
> the latter category (this way we would get rid of the special code for
> norms, as they could be handled as PDPs). It would also be easy to add
> new features in the future, like splitting the norms into two values: a
> norm and a boost value.
>
> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
>
> - Michael
>
> >
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >>
> >> It seems like there are (at least) two angles here for getting better
> >> performance from FieldCache:
> >>
> >>   1) Be incremental: with reopen() we should only have to update a
> >>      subset of the array in the FieldCache, according to the changed
> >>      segments.  This is what Hoss is working on and Mark was referring
> >>      to and I think it's very important!
> >>
> >>   2) Parsing is slow (?): I'm guessing one of the reasons that John
> >>      added the _X.udt file was because it's much faster to load an
> >>      array of already-parsed ints than to ask FieldCache to populate
> >>      itself.
> >>
> >> Even if we do #1, I think #2 could be a big win (in addition)?  John
> >> do you have any numbers of how much faster it is to load the array of
> >> ints from the _X.udt file vs having FieldCache populate itself?
> >>
> >> Also on the original question of "can we open up SegmentReader,
> >> FieldsWriter, etc.", I think that's a good idea?  At least we can make
> >> things protected instead of private/final?
> >>
> >> Mike
> >>
> >> "Ning Li" <ni...@gmail.com> wrote:
> >>> I see what you mean by 2) now. What Mark said should work for you when
> >>> it's done.
> >>>
> >>> Cheers,
> >>> Ning
> >>>
> >>> On 10/18/07, John Wang <jo...@gmail.com> wrote:
> >>>> Hi Ning:
> >>>>     That is essentially what field cache does. Doing this for each
> >> docid in
> >>>> the result set will be slow if the result set is large. But loading
> it
> >> in
> >>>> memory when opening index can also be slow if the index is large and
> >> updates
> >>>> often.
> >>>>
> >>>> Thanks
> >>>>
> >>>> -John
> >>>>
> >>>> On 10/18/07, Ning Li <ni...@gmail.com> wrote:
> >>>>> Make all documents have a term, say "ID:UID", and for each document,
> >>>>> store its UID in the term's payload. You can read off this posting
> >>>>> list to create your array. Will this work for you, John?
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>>
> >>>>> On 10/18/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >>>>>> Forwarding this to java-dev per request.  Seems like the best
> >> place
> >>>>>> to discuss this topic.
> >>>>>>
> >>>>>>         Erik
> >>>>>>
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>>> From: "John Wang" <jo...@gmail.com>
> >>>>>>> Date: October 17, 2007 5:43:29 PM EDT
> >>>>>>> To: erik@ehatchersolutions.com
> >>>>>>> Subject: lucene indexing and merge process
> >>>>>>>
> >>>>>>> Hi Erik:
> >>>>>>>
> >>>>>>>     We are revamping our search system here at LinekdIn. And we
> >> are
> >>>>>>> using Lucene.
> >>>>>>>
> >>>>>>>     One issue we ran across is that we store an UID in Lucene
> >> which
> >>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
> >> we
> >>>>>>> have the following solutions:
> >>>>>>>
> >>>>>>> 1) Index it as a Stored field and get it from reader.document(very
> >>>>>>> slow if recall is large)
> >>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
> >>>>>>> indexreader can be slow)
> >>>>>>> 3) construct it using the FieldCache and persist it on disk
> >>>>>>> everytime the index changes. (not suitable for real time
> >> indexing,
> >>>>>>> e.g. this process will degrade as # of documents get large)
> >>>>>>>
> >>>>>>>     None of the above solutions turn out to be adequate for our
> >>>>>>> requirements.
> >>>>>>>
> >>>>>>>      What we end up doing is to modify Lucene code by changing
> >>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
> >>>>>>> advantage of the Lucene Segment/merge process. E.g:
> >>>>>>>
> >>>>>>>      For each segment, we store a .udt file, which is an int[]
> >>>>>>> array, (by changing the FieldWriter class)
> >>>>>>>
> >>>>>>>      And SegmentReader will load the .udt file into an array.
> >>>>>>>
> >>>>>>>      And merge happens seemlessly.
> >>>>>>>
> >>>>>>>      Because the tight encapsulation around these classes, e.g.
> >>>>>>> private and final methods, it is very difficult to extend Lucene
> >>>>>>> while avoiding branch into our own version. Is there a way we
> >> can
> >>>>>>> open up and make these classes extensible? We'd be happy to
> >>>>>>> contribute what we have done.
> >>>>>>>
> >>>>>>>      I guess to tackle the problem from a different angle: is
> >> there
> >>>>>>> a way to incorporate FieldCache into the segments (it is
> >> strictly
> >>>>>>> in memory now), and build disk versions while indexing.
> >>>>>>>
> >>>>>>>
> >>>>>>>      Hope I am making sense.
> >>>>>>>
> >>>>>>>     I did not send this out to the mailing list because I wasn't
> >>>>>>> sure if this is a dev question or an user question, feel free to
> >>>>>>> either forward it to the right mailing list or let me know and I
> >>>>>>> can forward it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> -John
> >>>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>