You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ganesh <em...@yahoo.co.in> on 2009/12/17 07:48:02 UTC

External sort

Hello all,

We are facing serious issues related to Sorting in production environment. I know this issues has been discussed in this group. I am using v2.9.1 I am having multiple shards and i need to do only date time sorting. Sorting consumes 50% of RAM.

Is anyone has attempted to use external sorting?

Is there any way to index the records in sorted order. Sort and Merge the records at indexing time and pull the records sorted by doc id? 

Any other ideas.... I need to do sort with less memory. Performance is not the concern.

Regards
Ganesh


 
Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Dec 17, 2009 at 09:33:11AM -0800, Marvin Humphrey wrote:
> On Thu, Dec 17, 2009 at 05:03:11PM +0100, Toke Eskildsen wrote:
> 
> > A third alternative would be to count the number of unique datetime-values
> > and make a compressed representation, but that would make the creation of
> > the order-array more complex.
> 
> This is what we're doing in Lucy/KinoSearch, though we prepare the ordinal
> arrays at index time and mmap them at search time.
> 
> The provisional implementation has a problem, though.  We use a hash set to
> perform uniquing, but when there are a lot of unique values, the memory
> requirements of the SortWriter go very high.
> 
> Assume that we're sorting string data rather than date time values (so the
> memory occupied by all those unique values is significant).  What algorithms
> would you consider for performing the uniquing?

I have received private replies which have helped to shape the following design:

Instead of storing an object in memory for each unique value, we'll serialize
values to a file and store 64-bit file pointer offsets.  The memory cost as we
add documents will be...

   sizeof(int64_t) * number_of_documents_in_segment * number_of_sortable_fields

When it comes time to finish the segment, we'll loop through the fields one at
a time.  On each loop, we'll create a sorted array of document numbers using
the values we find at the file pointers.  There will be an awful lot of file
seeking and deserialization costs during the sort to achieve the comparisons,
but that's the price we pay for a scalable and general algorithm.

There will be only one file in which we store all serialized values for all
fields.  We'd probably get better locality with one file per field, but we
have to be conservative about file descriptors.

As an optimization for fields with low cardinality, we'll keep a 32-element
LRU[1] for each field in which we cache unique values.  That way, we won't
spend a whole lot of IO e.g. writing "yes" and "no" over and over.

Once we have the sorted array of document numbers, we can use it to create the
compressed ords array and the final stack of unique values.  To perform the
final uniquing, we'll have to compare each doc's value to the last and
determine whether the value has actually changed.  That means yet more seeking
and deserialization costs we don't have with the pure hash set uniquing we use
now -- but it can't be helped since we will often have multiple copies of the
same value serialized at different places within the temp values file.

Thoughts, improvements?  Hopefully this gives the original poster some ideas
about how you might go about sorting externally within Lucene as well...

Marvin Humphrey

[1] For now it will actually be a faked up LRU: a hash table that we empty
every time it grows to 32 elements.  If and when we get around to writing a
real LRU class for Lucy, we'll use that instead.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: External sort

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
(reposted to the list as I originally sent my reply to only Marvin by mistake)

From: Marvin Humphrey [marvin@rectangular.com]
> We use a hash set to perform uniquing, but when there are a lot of unique
> values, the memory requirements of the SortWriter go very high.
>
> Assume that we're sorting string data rather than date time values (so the
> memory occupied by all those unique values is significant).  What algorithms
> would you consider for performing the uniquing?

We do have a structure that is comparable to the above: An ordered list of
Strings, where updates to a Lucene index requires updates to the list. We do
this by having a long[] with collator-sorted offsets into a file with all the Strings
in insertion order.

When a new String is considered, existence is checked in O(log(n)) by
binary search. If it is new, it is appended to the file and the offset inserted
into the offset-array (arrayCopy). That is only usable for a relatively small
amount of inserts, due to the arrayCopy. SSDs are very helpful for
String lookup performance here.

For building the initial list, we use the terms for a given field, so we know that
they are unique: We Append them all in a file and keep track of the offsets,
then sort the offset-list based on what they're pointing at. By using caching
and mergesort (heapsort _kills_ performance in this scenario, as there is no
locality), it performs fairly well.
(I'm lying a little about our setup, for sake of brevity)
Memory usage excl. caching is 8 bytes * #Strings + housekeeping.

By adding another level of indirection and storing the offsets as a file and
sort a list of pointers to the offsets (argh), memory requirements can be
dropped to 4 bytes * # Strings. That doubles the number of seeks, so I
would only recommend it with SSDs and in tight spots.

We do have some code for duplicate reduction (long story): When the list
is sorted, step through the offsets and compare the Strings for the entries
at position x and x+1. If the Strings are equal, set offset[x+1] = x.
When the iteration has finished, the offsets only points to unique Strings
and the Strings-file contains some non-referenced entries, that can be
cleaned up by writing a new Strings-file.


Somewhat related, I'm experimenting with a sliding-window approach to
sorting Lucene terms, which might be usable in your scenario. It is a fairly
clean trade-off between memory and processing time. It is outlined in the
thread "heap memory issues when sorting by a string field" and it should
be fairly straight-forward to adapt the algorithm to remove duplicates.
The main requirement is that it is possible to iterate over the Strings
multiple times.


In general, I find that the memory-killer in these cases tend to be all the
wrapping and not the data themselves. A String has 38+ bytes of
overhead and if you're mainly using ASCII, Java's 2-byte char sure
means a lot of always-0-bits in memory. I don't know if it is feasible, but
making something like
class LittleString {
  private byte[] utf8;
  LittleString(String s) {
    utf8 = s.getBytes("utf-8");
  }
  public int hashCode() {
    ...generate from utf8
  }
  public String toString() {
    return new String(utf8, "utf-8");
  }
  public boolean equals(Object o) {
    if (!(o instanceof LittleString) return false;
    return Arrays.equals(utf8, ((LittleString)o).utf8);
  }
}
and storing that in your HashSet instead of the Strings directly ought
to cut a fair amount of your memory usage. I don't know how much
it would cost in performance though and the HashSet structure itself
isn't free.

Regards,
Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Dec 17, 2009 at 05:03:11PM +0100, Toke Eskildsen wrote:

> A third alternative would be to count the number of unique datetime-values
> and make a compressed representation, but that would make the creation of
> the order-array more complex.

This is what we're doing in Lucy/KinoSearch, though we prepare the ordinal
arrays at index time and mmap them at search time.

The provisional implementation has a problem, though.  We use a hash set to
perform uniquing, but when there are a lot of unique values, the memory
requirements of the SortWriter go very high.

Assume that we're sorting string data rather than date time values (so the
memory occupied by all those unique values is significant).  What algorithms
would you consider for performing the uniquing?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2009-12-18 at 12:47 +0100, Ganesh wrote:
> I am using Integer for datetime. As the data grows, I am hitting the 
> upper limit.

Could you give us some numbers? Document count, index size in GB, amount
of RAM available for Lucene?

> One option most of them in the group discussed about using EHcache. 
> 
> Let consider the below data get indexed. unique_id is the id 
> generated for every record. unique_id,  field1,  field2,  date_time
> 
> In Ehcache, Consider I am storing
> unique_id, date_time
> 
> How could i merge the results from Lucene and Ehcache? Do I need to 
> fetch all the search results and compare it against the EHcache
> results and decide (using FieldComparatorSource).

As you are storing the date_time in the index, you don't win anything by
caching the values externally: Reading the unique_id needed for lookup
in the Ehcache takes just as long as reading the date_time directly.

> (future thought / research) One more thought, Is there any way to 
> write the index in sorted order, May be while merging. Assign docid
> by sorting the selected field.

You cannot control the docID that way, but Lucene keeps documents in
index order, so you could do this by sorting your data before index
build.

You're touching on a recurring theme here, as coordination with external
data-handling could be done very efficiently if Lucene provided a
persistent id as a first-class attribute for documents. The problem is
that it would require a lot of changes to the API and that it would mean
an additional non-optional overhead for all Lucene users. I haven't kept
track on enough threads on the developer-list, so a better solution to
the problem might have been found.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Ganesh <em...@yahoo.co.in>.
Thanks for all your ideas. I was expecting the sorting related fix in 3.0 but hopefully it would be great, if it is get in to 3.1.

I am using Integer for datetime. As the data grows, I am hitting the upper limit. As my application is part of the product, used in different environment, We cannot request all customers to increase the memory. We need to run in less and find some other way out.

One option most of them in the group discussed about using EHcache. 

Let consider the below data get indexed. unique_id is the id generated for every record.
unique_id,  field1,  field2,  date_time

In Ehcache, Consider I am storing
unique_id, date_time

How could i merge the results from Lucene and Ehcache? Do I need to fetch all the search results and compare it against the EHcache results and decide (using FieldComparatorSource). Could some one help me how to go about with it. Consider the data is static and there will be updates, modification but the uniqueid and date_time is not going to change. I cannot use docid from lucene as it is changing after updates.

(future thought / research) One more thought, Is there any way to write the index in sorted order, May be while merging. Assign docid by sorting the selected field. This way we could achieve the sorting by zero RAM utilization. Mostly the sorted field is fixed for all application. Just some interest to know these things....

Regards
Ganesh

----- Original Message ----- 
From: "Toke Eskildsen" <te...@statsbiblioteket.dk>
To: <ja...@lucene.apache.org>; "Toke Eskildsen" <te...@statsbiblioteket.dk>
Sent: Thursday, December 17, 2009 9:33 PM
Subject: RE: External sort


Sigh... Forest for the trees, Toke.

The date time represented as an integer _is_ the order and at the same time a global external representation. No need to jump through all the hoops I made. I had my mind full of a more general solution to the memory problem with sort.


Instead of the order-array being a long[] with order and datetime, it should be just an int[] with datetime. The FieldCacheImpl does this for INT-sorts , so there's no need for extra code if you just store the datetime as an integer (something like Integer.toString(datetimeAsInt) for the field-value) and use SortField(fieldname, SortField.INT) to sort with.

If you cannot store the datetime as an integer (e.g. if you don't control the indexing), you can use the FieldCacheImpl with a custom int-parser that translates your datetime representation to int.

The internal representation takes 4 bytes/document. If you need to go lower than that, I'll say you have a very uncommon setup.

It can be done by making custom code and storing the order-array on disk, but access-speed would suffer tremendously for searches with millions of hits. An alternative would be to reduce the granularity of the datetime and use SortField.SHORT or ShortField.BYTE. A third alternative would be to count the number of unique datetime-values and make a compressed representation, but that would make the creation of the order-array more complex.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: External sort

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Sigh... Forest for the trees, Toke.

The date time represented as an integer _is_ the order and at the same time a global external representation. No need to jump through all the hoops I made. I had my mind full of a more general solution to the memory problem with sort.


Instead of the order-array being a long[] with order and datetime, it should be just an int[] with datetime. The FieldCacheImpl does this for INT-sorts , so there's no need for extra code if you just store the datetime as an integer (something like Integer.toString(datetimeAsInt) for the field-value) and use SortField(fieldname, SortField.INT) to sort with.

If you cannot store the datetime as an integer (e.g. if you don't control the indexing), you can use the FieldCacheImpl with a custom int-parser that translates your datetime representation to int.

The internal representation takes 4 bytes/document. If you need to go lower than that, I'll say you have a very uncommon setup.

It can be done by making custom code and storing the order-array on disk, but access-speed would suffer tremendously for searches with millions of hits. An alternative would be to reduce the granularity of the datetime and use SortField.SHORT or ShortField.BYTE. A third alternative would be to count the number of unique datetime-values and make a compressed representation, but that would make the creation of the order-array more complex.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2009-12-17 at 15:34 +0100, Ganesh wrote:
> Thanks Toke. I worried to use  long[] inverted = new long[reader.maxDoc]; 
> as the memory consumption will be high for millions of document.

Well, how many documents do you have? 10 million? That's just 160MB in
overhead, of which the 80MB are temporary on the first search.

> Any idea of building external sort cache?  

You could dump the order-array on disk (huge performance hit on
conventional harddisks), but it's hard to avoid the temporary
inverse-array upon first search. Of course, you could generate it on
index build and thus have a memory-hit of virtual 0.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Ganesh <em...@yahoo.co.in>.
Thanks Toke. I worried to use  long[] inverted = new long[reader.maxDoc]; as the memory consumption will be high for millions of document.

Any idea of building external sort cache?  

Regards
Ganesh
 
----- Original Message ----- 
From: "Toke Eskildsen" <te...@statsbiblioteket.dk>
To: <ja...@lucene.apache.org>
Sent: Thursday, December 17, 2009 6:07 PM
Subject: Re: External sort


> On Thu, 2009-12-17 at 07:48 +0100, Ganesh wrote:
>> I am using v2.9.1 I am having multiple shards and i need to do only 
>> date time sorting. Sorting consumes 50% of RAM.
> 
> I'm guessing that your date-times are representable in 32 bits (signed
> seconds since epoch or such - that'll work until 2038)? If so, it should
> be possible to do very efficient sorting, both memory- and
> performance-wise.
> 
> Make your own sorter by implementing SortComparatorSource.
> 
> The SortComparatorSource returns a ScoreDocComparator which contains an
> array of longs, in which the first 32 bits designates the order of the
> document at the given index (docID) and the last 32 bits holds the date
> time.
> 
> The ScoreDocComparator's methods are trivial:
>  public int compare (ScoreDoc i, ScoreDoc j) {
>    return order[i.doc] - order[j.doc];
>    // Or is it the other way? I always mix them up
>  }
>  public Comparable sortValue (ScoreDoc i) {
>    return Integer.valueOf((int)(order[i.doc] & 0xFFFFFFFF));
>  }
>  public int sortType(){
>    return SortField.CUSTOM;
>  }
> 
> Now, for generating the order-array, we do something like this in the
> SortComparatorSource:
> 
>  TermDocs termDocs = reader.termDocs();
>  // inverted[docID] == datetime | docID
>  long[] inverted = new long[reader.maxDoc];
>  TermEnum termEnum = reader.terms(new Term(fieldname, ""));
> 
>  do {
>    Term term = termEnum.term();
>    if (term == null || !fieldname.equals(term.field())) {
>      break;
>    }
>    long dateTime = (long)stringDateTimeToInt(term.text()) << 32;
>    termDocs.seek(termEnum);
>    while (termDocs.next()) {
>      inverted[termDocs.doc()] = dateTime | termDocs.doc();
>    }
>  } while (termEnum.next());
>  termEnum.close();
> 
>  // inverted[order] == datetime | docID
>  Arrays.sort(inverted); // works for date time 1970+
> 
>  // order[docID] == order | datetime
>  long[] order = new long[inverted.length];
>  for (long o = 0 ; o < inverted.length ; o++) {
>    int docID = (int)(inverted[o] & 0xFFFFFFFF);
>    order[docID] = (o << 32) | (inverted[o] >>> 32);
>  }
> 
> It would be nice to avoid the extra array needed for reordering, but I'm
> fresh out of ideas. Still, the memory-overhead is just
>  8 bytes (long) * 2 (arrays) * maxDoc
> and performance should high as Arrays.sort(long[]) is fast and
> everything runs without taxing the garbage collector.
> 
> 
> Caveat lector: I haven't implemented the stuff above, so it's just an
> idea written in not-so-pseudo code.
> 
> Regards,
> Toke Eskildsen
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2009-12-17 at 07:48 +0100, Ganesh wrote:
> I am using v2.9.1 I am having multiple shards and i need to do only 
> date time sorting. Sorting consumes 50% of RAM.

I'm guessing that your date-times are representable in 32 bits (signed
seconds since epoch or such - that'll work until 2038)? If so, it should
be possible to do very efficient sorting, both memory- and
performance-wise.

Make your own sorter by implementing SortComparatorSource.

The SortComparatorSource returns a ScoreDocComparator which contains an
array of longs, in which the first 32 bits designates the order of the
document at the given index (docID) and the last 32 bits holds the date
time.

The ScoreDocComparator's methods are trivial:
  public int compare (ScoreDoc i, ScoreDoc j) {
    return order[i.doc] - order[j.doc];
    // Or is it the other way? I always mix them up
  }
  public Comparable sortValue (ScoreDoc i) {
    return Integer.valueOf((int)(order[i.doc] & 0xFFFFFFFF));
  }
  public int sortType(){
    return SortField.CUSTOM;
  }

Now, for generating the order-array, we do something like this in the
SortComparatorSource:

  TermDocs termDocs = reader.termDocs();
  // inverted[docID] == datetime | docID
  long[] inverted = new long[reader.maxDoc];
  TermEnum termEnum = reader.terms(new Term(fieldname, ""));

  do {
    Term term = termEnum.term();
    if (term == null || !fieldname.equals(term.field())) {
      break;
    }
    long dateTime = (long)stringDateTimeToInt(term.text()) << 32;
    termDocs.seek(termEnum);
    while (termDocs.next()) {
      inverted[termDocs.doc()] = dateTime | termDocs.doc();
    }
  } while (termEnum.next());
  termEnum.close();

  // inverted[order] == datetime | docID
  Arrays.sort(inverted); // works for date time 1970+

  // order[docID] == order | datetime
  long[] order = new long[inverted.length];
  for (long o = 0 ; o < inverted.length ; o++) {
    int docID = (int)(inverted[o] & 0xFFFFFFFF);
    order[docID] = (o << 32) | (inverted[o] >>> 32);
  }

It would be nice to avoid the extra array needed for reordering, but I'm
fresh out of ideas. Still, the memory-overhead is just
  8 bytes (long) * 2 (arrays) * maxDoc
and performance should high as Arrays.sort(long[]) is fast and
everything runs without taxing the garbage collector.


Caveat lector: I haven't implemented the stuff above, so it's just an
idea written in not-so-pseudo code.

Regards,
Toke Eskildsen



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: External sort

Posted by Michael McCandless <lu...@mikemccandless.com>.
There was a recent thread on neat algos that can compute the global
order in multiple passes to keep RAM usage low.

Mike

On Thu, Dec 17, 2009 at 1:48 AM, Ganesh <em...@yahoo.co.in> wrote:
> Hello all,
>
> We are facing serious issues related to Sorting in production environment. I know this issues has been discussed in this group. I am using v2.9.1 I am having multiple shards and i need to do only date time sorting. Sorting consumes 50% of RAM.
>
> Is anyone has attempted to use external sorting?
>
> Is there any way to index the records in sorted order. Sort and Merge the records at indexing time and pull the records sorted by doc id?
>
> Any other ideas.... I need to do sort with less memory. Performance is not the concern.
>
> Regards
> Ganesh
>
>
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org