You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/12/21 04:08:51 UTC

SortWriter memory costs

On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
> > Over on the Lucene java-user list, we're discussing how to rein in
> > SortWriter's memory costs:
> >
> >  http://markmail.org/message/zcgb6ak24di42rev
> 
> I'm not sure that I understand your approach here.  Could you offer
> some context?  

Sure, the easiest way to do that is to explain how SortWriter works now and
why it consumes too much memory.  I'll use JavaScript syntax for the
pseudocode.[1]

Assume each document in your index has a field, "category", which has been
marked as "sortable".  Assume that "category" has been assigned the field
number 2 for this segment.

As docs get added, SortWriter keeps track of each doc's value for the field
"category" in both a hash and an array.  

The hash stores unique values:

  var category = doc.fetch('category');
  sort_writer.unique_values['category'][category] = -1;

The array records the value for each doc_id:

  var unique_val = sort_writer.unique_values['category].find_key(category);
  sort_writer.doc_values['category'][doc_id] = unique_val;

Since the "doc_values" array only stores pointers to keys in the
"unique_values" hash, it doesn't grow very fast -- just one pointer per doc.

Our problem is the unique_values hash.  For fields with low cardinality -- for
instance if our "category" field has only a few possible values -- it never
gets very big.  But for fields with high cardinality -- such as a "product_id"
field where every value is unique -- it gets huge and consumes a ton of RAM.

That's what we need to solve.

When we finish off the segment, the present SortWriter writes three files for
each sortable text field: ords, offsets, and UTF-8 character data:

   seg_1/sort-2.ord  // <--- The "2" is the field number.
   seg_1/sort-2.ix
   seg_1/sort-2.dat

First, SortWriter creates an array of unique sorted values by sorting the keys
in the unique_values hash.  

  var sorted = sort_writer.unique_values[category].keys();
  sorted.sort();

The sort positions are then saved as values in the unique_values hash.

  for (var ord = 0; ord < sorted.length; ord++) {
    var key = sorted[ord];
    sort_writer.unique_values['category'][key] = ord;
  }

The sorted values are written out to disk as two files: offsets, and character
data.

  var ix_out  = sort_writer.folder.open_out('seg_1/sort-2.ix');
  var dat_out = sort_writer.folder.open_out('seg_1/sort-2.dat');
  for (var i = 0; i < sorted.length; i++) {
    var unique_val = sorted[i];
    ix_out.write_i64(dat_out.tell());
    dat_out.write_bytes(unique_val.get_ptr8(), unique_val.get_size());
  }
  ix_out.close();
  dat_out.close();

Next, we create the ordinals array by iterating over the doc_values array and
looking up each ord via the unique_values hash.

  for (var doc_id = 1; doc_id <= doc_max; doc_id++) {
    var key = sort_writer.doc_values['category'][doc_id];
    ords[doc_id] = sort_writer.unique_values['category'][key];
  }

The ords file is simply this ordinals array blasted out to disk after
compressing it to a smaller bit width appropriate to its cardinality. (2
unique values = one bit per doc, 13 unique values = four bits per doc, etc).

  var outstream = sort_writer.folder.open_out('seg_1/sort-2.ord');
  outstream.write_bytes(compressed_ords, compressed_ords_len);
  outstream.close();

Right now, SortWriter is the only DataWriter component that has a problem with
RAM footprint -- all the other components (LexWriter, DocWriter,
PostingsWriter, etc) do a good job of keeping RAM usage under control and
scale up beyond the practical size for an index on one machine.

> The plan is to store one giant file of all the field values, 

Yes.  Instead of using a hash to perform the uniquing, we'd use a file.

Think of it this way: we're dispensing with the unique_values hash and using
only the doc_values array.  At segment finish time, we use that doc_values
array to create a sorted list of doc_ids.

  var sorted_doc_ids = new Array(doc_max + 1);
  for (var i = 0; i <= doc_max; i++) { sorted_doc_ids[i] = i; }
  sorted_doc_ids.sort(compare_doc_ids_by_doc_values);

However, instead of actually storing CharBuf objects in the doc_values array,
we're going to save memory by storing file offsets pointing at a temporary
file where we've *serialized* those values.  During indexing, that costs us
one 64-bit integer per doc per sortable field, which will scale up OK.

To perform comparisons during sorting, we deserialize values from the
temporary file two at a time.

  var instream = sort_writer.folder.open_in('seg_1/sort_cache_writer_temp');
  var value_a  = CharBuf.new();
  var value_b  = CharBuf.new();

  function compare_doc_ids_by_doc_values(doc_a, doc_b) {
    var position_a = sort_writer.doc_values['category'][doc_a];
    var position_b = sort_writer.doc_values['category'][doc_b];
    instream.seek(position_a);
    read_value_into_charbuf(instream, value_a);
    instream.seek(position_b);
    read_value_into_charbuf(instream, value_b);
    return value_a.compare_to(value_b);
  }

We can derive the ords array using the sorted_doc_ids array and the temp file:

  var ord        = 0;
  var ords       = new Array(doc_max + 1);
  var last_value = new CharBuf();
  var this_value = new CharBuf();
  for (var i = 0; i < doc_max; i++) {
    var doc_id = sorted_doc_ids[i];
    var file_position = sort_writer.doc_values['category'][doc_id];
    instream.seek(file_position);
    read_value_into_charbuf(instream, this_value);
    if (!this_value.equals(last_value)) {
      ord++
      last_value = this_value;
    }
    ords[doc_id] = ord;
  }

This propose algorithm is similar to the one that Toke Eskildsen described on
java-user:

    http://markmail.org/message/tjgh3p6ugjgh5go6

However, it has been adapted for some of Lucy's constraints, e.g. no
read/write files.

> I'd wonder how you'd merge results between shards or sort the results
> of text query.   

At search time, we use the mmap'd ords array to perform all sort comparisons
within a segment -- which is lightning fast.  :)  For comparison across
segments, we use actual values, which is slower but not that expensive in the
grand scheme of things, since it scales with the number of segments rather
than the number of documents.

> What's the main use case you are targetting?

All queries with a SortSpec that sorts on a field and all RangeQueries use our
mmap'd sort caches.

Marvin Humphrey

[1] C isn't expressive enough for code samples, especially since it lacks
    syntax for hash tables and arrays.  Perl, Ruby, and Python all suffice,
    but I don't want to favor one binding language over another.  (You know
    Perl, Mike knows Python, who gets excluded from the conversation blah blah
    blah.)  JavaScript, which we *can't* write bindings for because it lacks
    IO, is sufficiently expressive, spawned the JSON we use in our metadata
    files, and is popular enough that most people should grok it.

Re: SortWriter memory costs

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 27, 2009 at 4:45 PM, Nathan Kurz <na...@verse.com> wrote:
> This would be hard, though, so I'd suggest just sticking with your
> current approach and making your hash slightly more efficient.  Then
> think about how you can handle non-text fields directly --- instead of
> ords and offsets, handle dates, times, ints, floats, ratings etc.
> directly.  You can keep using an ord file for text, but treat this as
> an optimization rather than a requirement.    And think about how to
> make the intersegment merges efficient.

Simplifying up rereading my poorly organized response:

I think numeric types are likely the main use for sorting.  While
converting strings to ords does convert allow them to be compared as
ints, this seems like it should be the special case rather than the
one determining the architecture.  I'd love to see a model of how
efficient date sorting based on UNIX timestamps would work first, and
then figure out how to add in free text afterward.  It might turn out
that ords are great, but I think that a direct sorting model might be
simpler and more general.

Nathan Kurz
nate@verse.com

Re: SortWriter memory costs

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 20, 2009 at 7:08 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
>> > Over on the Lucene java-user list, we're discussing how to rein in
>> > SortWriter's memory costs:
>> >
>> >  http://markmail.org/message/zcgb6ak24di42rev
>>
>> I'm not sure that I understand your approach here.  Could you offer
>> some context?
>
> Sure, the easiest way to do that is to explain how SortWriter works now and
> why it consumes too much memory.  I'll use JavaScript syntax for the
> pseudocode.[1]

Thanks for the detailed description.  I'm more confident in my
understanding now, although I'm sure there are still gaps.  The
Javascript is a fine approach, although I think pseudo-C would be fine
too.

> Our problem is the unique_values hash.  For fields with low cardinality -- for
> instance if our "category" field has only a few possible values -- it never
> gets very big.  But for fields with high cardinality -- such as a "product_id"
> field where every value is unique -- it gets huge and consumes a ton of RAM.
>
> That's what we need to solve.

Patient -- Doctor, every time I poke myself in the eye with this stick
I get a sharp pain.
Doctor -- Well then stop poking yourself in the eye with a stick!

Patient -- Every time I try to sort by a text field on a gigantic
index for what should be a numeric field, my memory usage blows up.
Doctor -- Well, I suppose we could wrap the end of the stick with this
dirty rag to see if blunts the pain...

My dominant impression is that converting text fields to ordinals to
allow easier sorting is an ugly hack that should not be encouraged.
Yes, it works if you're willing to tolerate falling back on alternate
solutions for sharding and segments, able to live with inefficient
joins with text-matching, and are OK with making your indexes
essentially unmodifiable,   But isn't there a better way?

But let's come back to that later...

> When we finish off the segment, the present SortWriter writes three files for
> each sortable text field: ords, offsets, and UTF-8 character data:
>
>   seg_1/sort-2.ord  // <--- The "2" is the field number.
>   seg_1/sort-2.ix
>   seg_1/sort-2.dat

To check my understanding:

'ord' is for ordinal, and conceptually is the same as the term-id in a
sorted lexicon.  The '.ord' file is an array of ordinals indexed by
doc-id (although bit-compressed). Conceptually, looking up ord[doc-id]
would give you the ordinal of the value of 'category' for doc-id.

'ix' is for index, and 'dat' is for data.  The '.ix' file is an array
of offsets into the '.dat' file, so that to look up the text
representation of ordinal 'o', one would start reading at dat[o].
Wrapping, one looks up the text representation of 'category' for
doc-id using dat[ord[doc-id]].

> First, SortWriter creates an array of unique sorted values by sorting the keys
> in the unique_values hash.

It seems like the main problem would be the overhead of the hash.
Are you trying to use host language hashes, or are you doing something
custom?  My first thought would be to write a simple custom hash that
stores offsets into '.dat' rather than values.   If a string isn't in
the hash, append it to the .dat file (lexicon) and save the offset.
Common strings will thus tend to appear at the start of the file, rare
ones at the end.   Given the fixed width of the offsets, you should be
able to get the overhead of the hash down to something very small.

> Right now, SortWriter is the only DataWriter component that has a problem with
> RAM footprint -- all the other components (LexWriter, DocWriter,
> PostingsWriter, etc) do a good job of keeping RAM usage under control and
> scale up beyond the practical size for an index on one machine.

This is all per-segment, right?  How many docs are you planning to
target per segment?
How many do you have to get to before the RAM usage is out of control?

100 million seems like a lot, to me.  100 million docs with a plain
text Unique_ID field and each entry in the hash taking up 100 bytes
(worst case current?) would be 10GB, and at the boundary of
in-control.  If you can get that overhead down to 10 bytes, (1 GB) it
seems like you'd be firmly in control again, even if you have multiple
such fields.  And if not, you could do always do 1 million docs per
segment and merge them later.

> Yes.  Instead of using a hash to perform the uniquing, we'd use a file.
>
> Think of it this way: we're dispensing with the unique_values hash and using
> only the doc_values array.  At segment finish time, we use that doc_values
> array to create a sorted list of doc_ids.

OK, so you don't worry about duplicating strings in the data file ---
you just write out everything and save the offset.  Then when you are
done, you sort the index file by the values of the stored strings.

This seems like a small win in the case of 'Product ID', and a major
loss for a field like 'Gender' that has only a limited number of
values.  Do you really want to indirectly sort 100,000,000 strings
that have only two (OK, 7 for the Bay Area) distinct values?

The windowing function Toke alludes to would solve most of this, but
once you're doing this it doesn't seem that much harder to go all the
way and do a low-overhead hash.

My overall impression is that this approach would work, but that there
are some tradeoffs and it doesn't really buy you that much overall
efficiency.  Either you fit in RAM or you don't --- once you don't,
even an SSD isn't going to keep performance from diving off the cliff.

>> I'd wonder how you'd merge results between shards or sort the results
>> of text query.
>
> At search time, we use the mmap'd ords array to perform all sort comparisons
> within a segment -- which is lightning fast.  :)  For comparison across
> segments, we use actual values, which is slower but not that expensive in the
> grand scheme of things, since it scales with the number of segments rather
> than the number of documents.

I'm still not quite understanding your approach.  I just found this
email with GMail by searching for 'marvin', so let's use that as a
base case for ordering.  Making up syntax, I enter 'marvin' in the box
and Google translates it to something like this: "MATCH marvin IN
all-fields ORDER BY date LIMIT 10".

There are several ways to handle this query.  Which one are you aiming for?

1) Incrementally find hits for 'marvin', lookup ord[doc-id] at the
time of each hit, score as 1/ord, and keep the top 10.

2) Find all hits for marvin, sort them all using the mmapped ord
array, return the top 10.

3) Find all hits for marvin, then use this to filter an ord sorted
list of doc-ids, stopping once one has found 10 matches.

It sounds like you are aiming for 2), but this seems like it would be
unwieldy when searching for a common term:  it seems expensive to sort
a list that big.  I'm also not sure how to break ties by using the
score for the text term.   I'm most comfortable with 3) because of
it's flexibility, but can see the advantage of 1) for efficiency on
rare terms.

If you were to do something closer to 2), I'd be tempted to just look
up offset[doc-id] and use the string directly to do the sorting.

>> What's the main use case you are targetting?
>
> All queries with a SortSpec that sorts on a field and all RangeQueries use our
> mmap'd sort caches.

The explanation up to here is wonderfully clear, but this seems little
circular as a use case.   Is the Gmail example a reasonable one?   Is
there a better one?

For me, I think I useful case to think about would involve both full
text search and a range query, with results ordered by the value used
in the range query and ties broken by strength of text match.
Something like "MATCH word IN title WHERE rating > 3 ORDER BY rating
DESCENDING".

In almost all the non-contrived cases I can come up with, the ranges
are dates or numerics, not text fields.  I can come up with some cases
where I'd want to order alphabetically, but not that many.

Which brings me back to my opening rant...

Ordinals serve as a minimal monotonic perfect hash for the input
strings.    The actual value of the ordinal is not that useful ---
more commonly, you just want a fast way of comparing two strings, and
comparing two ordinals turns is faster.   If the lexicon itself is
ordered, the offsets into the lexicon (data file), while not minimal,
can will compare the same.

Since the cases that interest me most of those where the index can be
modified without being rebuilt, I'd rather not require that the
lexicon itself be ordered, since it's awfully convenient just to add a
new string by appending.   I'm not even sure you need to have the
ordinals precalculated at all --- an ordered list of (doc-id, offset)
pairs seems like it would be sufficient.

To be editable, you'd probably need to keep this in a BTree variant,
or some other page oriented ordered data structure.  The nice part of
this is that it would be entirely pay-as-you-go:  if you add to this
file directly, you'd have minimal memory usage at all steps of the
process, and once you've added the last document, there's no giant
sorting step that needs to happen.

This would be hard, though, so I'd suggest just sticking with your
current approach and making your hash slightly more efficient.  Then
think about how you can handle non-text fields directly --- instead of
ords and offsets, handle dates, times, ints, floats, ratings etc.
directly.  You can keep using an ord file for text, but treat this as
an optimization rather than a requirement.    And think about how to
make the intersegment merges efficient.

One thing I'm becoming more confident of is that handling of segments
and shards should be identical.  Multi-segment merges should not
require direct access to in-memory data structures, rather all data
necessary for aggregation should be presented along with the results.
 The results returned from a single segment, from a shard, or from the
whole shabang should be interchangeable.  If you can do this
efficiently, I think you are pretty much guaranteed to scale to
many-cores as well as many-machines.

Sorry if the ranting was too incoherent,

Nathan Kurz
nate@verse.com