You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2007/04/05 17:54:58 UTC

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

"Marvin Humphrey" <ma...@rectangular.com> wrote:
> 
> On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:
> 
> > The one thing that still baffles me is: I can't get a persistent
> > Posting hash to be any faster.
> 
> Don't use a hash, then.  :)
> 
> KS doesn't.
> 
>    * Give Token a "position" member.
>    * After you've got accumulated all the Tokens, calculate
>      position for each token from the position increment.
>    * Arrange the postings in an array sorted by position.
>    * Count the number of postings in a row with identical
>      text to get freq.
> 
> Relevant code from KinoSearch::Analysis::TokenBatch below.

OOOOOH!  I like that approach!

So you basically do not "de-dup" by field+term on your first pass
through the tokens in the doc (which is "roughly" what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?

This seems like it could be a major win!

Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?

I guess it depends on how many total terms you have in the doc vs how
many unique terms you have in the doc.  Qsort is NlogN, and, the
comparison of 2 terms is somewhat costly.  With de-duping, you pay a
hash lookup/insert cost (plus cost of managing little int buffers to
hold positions/offsets/etc) per term, but then only qsort on the
number of unique terms.

Whereas with your approach you don't pay any deduping cost but your
qsort is on total # terms, not total # unique terms.  I bet your
approach is quite a bit faster for small to medium sized docs (by
far the "norm") but not faster for very large docs?  Or maybe it's
even faster for very large docs because qsort is so darned fast :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Marvin Humphrey" <ma...@rectangular.com> wrote:
> 
> On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:
> 
> > So you basically do not "de-dup" by field+term on your first pass
> > through the tokens in the doc (which is "roughly" what that hash
> > does).  Instead, append all tokens in an array, then sort first by
> > field+text and second by position?  This is done for each document
> > right?
> 
> Almost.  The sorting is done per-field.
>
> Token doesn't have a "field", so comparison is cheaper than you're  
> thinking.

Got it.  I've done exactly that (process one field's tokens at a time)
with LUCENE-843 as well.

> > Did you every compare this approach against hash (or other de-dup data
> > structure, letter trie or something) approach?
> 
> KS used to use hashing, though it wasn't directly analogous to how  
> Lucene does things.  I've only tried these two techniques.  This was  
> faster by about 30%, but the difference is not all in the de-duping.

OK.  30% is very nice :)

> KS is concerned with preparing serialized postings to feed into an  
> external sorter.  In the hashing stratagem, every position added to a  
> term_text => serialized_posting pair in the hash requires a string  
> concatenation onto the end of serialized_posting, and thus a call to  
> realloc().
>
> Besides switching out hashing overhead for qsort overhead, the  
> sorting technique also allows KS to know up front how many positions  
> are associated with each posting, so the memory for the serialized  
> string only has to be allocated once.

Yeah those realloc()'s are costly (Lucene trunk has them too).  In
LUCENE-843 I found a way to share large int[] arrays so that a given
posting uses slices into the shared arrays instead of doing reallocs.

I think I'm doing something similar to feeding an external sorter: I'm
just using the same approach as Lucene's segment merging of the
postings, optimized somewhat to handle a very large number of segments
at once (for the merging of the "level 0" single document segments).
I use this same merger to merge level N RAM segments to level N+1 ram
segments, to merge RAM segments into a single flushed segment, to
merge flushed segments into a single flushed segment and then finally
to merge flushed and RAM segments into the "real" Lucene segment at
the end.

I think it differs from an external sorter in that I manage explicitly
when to flush a "run" to disk (autoCommit=false case) or to flush it
to a Lucene segment (autoCommit=true case) rather than letting the
sorter API decide.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:

> So you basically do not "de-dup" by field+term on your first pass
> through the tokens in the doc (which is "roughly" what that hash
> does).  Instead, append all tokens in an array, then sort first by
> field+text and second by position?  This is done for each document
> right?

Almost.  The sorting is done per-field.

Token doesn't have a "field", so comparison is cheaper than you're  
thinking.

int
Token_compare(const void *va, const void *vb)
{
     Token *const a = *((Token**)va);
     Token *const b = *((Token**)vb);

     size_t min_len = a->len < b->len ? a->len : b->len;

     int comparison = memcmp(a->text, b->text, min_len);

     if (comparison == 0) {
         if (a->len != b->len) {
             comparison = a->len < b->len ? -1 : 1;
         }
         else {
             comparison = a->pos < b->pos ? -1 : 1;
         }
     }

     return comparison;
}

> Did you every compare this approach against hash (or other de-dup data
> structure, letter trie or something) approach?

KS used to use hashing, though it wasn't directly analogous to how  
Lucene does things.  I've only tried these two techniques.  This was  
faster by about 30%, but the difference is not all in the de-duping.

KS is concerned with preparing serialized postings to feed into an  
external sorter.  In the hashing stratagem, every position added to a  
term_text => serialized_posting pair in the hash requires a string  
concatenation onto the end of serialized_posting, and thus a call to  
realloc().

Besides switching out hashing overhead for qsort overhead, the  
sorting technique also allows KS to know up front how many positions  
are associated with each posting, so the memory for the serialized  
string only has to be allocated once.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org