You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/25 23:18:13 UTC

TokenBatch

Greets,

Lucene's TokenStream concept does not translate very well from Java  
to our target languages, primarily because calling next() once for  
each analyzer-token pairing creates a lot of method-call overhead.   
Also, the analysis process creates a *lot* of Token objects; in  
Plucene these were hash-based objects, relatively expensive to create  
and destroy.

The solution adopted by KinoSearch is the TokenBatch.  Instead of  
passing tokens one by one through an analysis chain, they are grouped  
together and passed en masse from analyzer to analyzer.  Analyzers do  
not supply a next() method; they supply an analyze() method, which  
both accepts and returns a TokenBatch object.  Analysis chains are  
set up using a PolyAnalyzer object, which is an ordered array of  
Analyzers, each of which will be called upon to analyze() a  
TokenBatch in turn.

Implementing TokenBatch as a doubly-linked list of Token structs  
seems to work pretty well.

typedef struct lucy_Token {
     char              *text;
     lucy_i32_t         len;
     lucy_i32_t         start_offset;
     lucy_i32_t         end_offset;
     lucy_i32_t         pos_inc;
     struct lucy_Token *next;
     struct lucy_Token *prev;
} lucy_Token;

typedef struct lucy_TokenBatch {
     lucy_Token  *first;
     lucy_Token  *last;
     lucy_Token  *current;
     lucy_i32_t   size;
     lucy_bool_t  initialized;
} lucy_TokenBatch;

(We might define token->text as either an 8 or 16 bit type depending  
on how the target platform handles strings, but that's a topic for  
another day.)

The tricky part here is how to expose TokenBatch and its constituent  
tokens as a native API, which is necessary for subclassing Analyzer.

Core Analyzer subclasses distributed with Lucy might peek into the  
guts of Token structs.  However, it would be a bad idea to make *any*  
C struct's makeup part of Lucy's public API.  The only supported way  
to get at a struct's members should be through methods.

The obvious way to affect individual Tokens would be to spin them off  
as native objects when iterating over the TokenBatch object.   
Illustrated in Java:

   while ((token = batch.next()) != null) {
     token.setText( lowerCase(token.getText()) );
   }

However, that introduces a garbage collection issue.  Who's  
responsible for destroying the individual Token structs?  We can't  
have them destroyed twice, once when the TokenBatch gets destroyed,  
and once when the spun-off token gets destroyed.  I see two  
possibilities, neither of which is attractive.  We might implement  
our own reference-counting scheme, which would be messy and  
complicated.  Or we could start off by creating each token as a  
native object and defer to native GC, which is messy, complicated,  
and expensive to boot.

The least worst solution I see is to avoid exposing the individual  
tokens via native API, and allow access to them one at a time via  
methods against the TokenBatch object.  next() would return a boolean  
indicating whether the TokenBatch was currently located at a valid  
Token position, instead of a native Token object.

   while (batch.next()) {
     batch.setText( lowerCase(batch.getText()) );
   }

This is perhaps a little less intuitive than iterating over an faux- 
array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum  
and Scorer classes all work.

It's also the fastest scheme I've come up with.  Without making  
TokenBatch a native array of Token objects...

   for my $token (@tokens) {
     $token->set_text( $token->get_text );
   }

... there's no way to avoid the method-call overhead of next().  Any  
user-defined subclass of Analyzer implemented using the native API is  
therefore going to be a little slower than it's possible to make core  
Analyzer classes which are allowed to access struct members, but  
that's the way it goes.  If we supply a good set of flexible Analyzer  
classes, hopefully it won't be necessary for people to create  
multiple custom Analyzers and their indexing apps will stay fast enough.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Re: TokenBatch

Posted by David Balmain <db...@gmail.com>.
On 6/26/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> Lucene's TokenStream concept does not translate very well from Java
> to our target languages, primarily because calling next() once for
> each analyzer-token pairing creates a lot of method-call overhead.
> Also, the analysis process creates a *lot* of Token objects; in
> Plucene these were hash-based objects, relatively expensive to create
> and destroy.
>
> The solution adopted by KinoSearch is the TokenBatch.  Instead of
> passing tokens one by one through an analysis chain, they are grouped
> together and passed en masse from analyzer to analyzer.  Analyzers do
> not supply a next() method; they supply an analyze() method, which
> both accepts and returns a TokenBatch object.  Analysis chains are
> set up using a PolyAnalyzer object, which is an ordered array of
> Analyzers, each of which will be called upon to analyze() a
> TokenBatch in turn.
>
> Implementing TokenBatch as a doubly-linked list of Token structs
> seems to work pretty well.
>
> typedef struct lucy_Token {
>      char              *text;
>      lucy_i32_t         len;
>      lucy_i32_t         start_offset;
>      lucy_i32_t         end_offset;
>      lucy_i32_t         pos_inc;
>      struct lucy_Token *next;
>      struct lucy_Token *prev;
> } lucy_Token;
>
> typedef struct lucy_TokenBatch {
>      lucy_Token  *first;
>      lucy_Token  *last;
>      lucy_Token  *current;
>      lucy_i32_t   size;
>      lucy_bool_t  initialized;
> } lucy_TokenBatch;
>
> (We might define token->text as either an 8 or 16 bit type depending
> on how the target platform handles strings, but that's a topic for
> another day.)
>
> The tricky part here is how to expose TokenBatch and its constituent
> tokens as a native API, which is necessary for subclassing Analyzer.
>
> Core Analyzer subclasses distributed with Lucy might peek into the
> guts of Token structs.  However, it would be a bad idea to make *any*
> C struct's makeup part of Lucy's public API.  The only supported way
> to get at a struct's members should be through methods.
>
> The obvious way to affect individual Tokens would be to spin them off
> as native objects when iterating over the TokenBatch object.
> Illustrated in Java:
>
>    while ((token = batch.next()) != null) {
>      token.setText( lowerCase(token.getText()) );
>    }
>
> However, that introduces a garbage collection issue.  Who's
> responsible for destroying the individual Token structs?  We can't
> have them destroyed twice, once when the TokenBatch gets destroyed,
> and once when the spun-off token gets destroyed.  I see two
> possibilities, neither of which is attractive.  We might implement
> our own reference-counting scheme, which would be messy and
> complicated.  Or we could start off by creating each token as a
> native object and defer to native GC, which is messy, complicated,
> and expensive to boot.
>
> The least worst solution I see is to avoid exposing the individual
> tokens via native API, and allow access to them one at a time via
> methods against the TokenBatch object.  next() would return a boolean
> indicating whether the TokenBatch was currently located at a valid
> Token position, instead of a native Token object.
>
>    while (batch.next()) {
>      batch.setText( lowerCase(batch.getText()) );
>    }
>
> This is perhaps a little less intuitive than iterating over an faux-
> array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum
> and Scorer classes all work.
>
> It's also the fastest scheme I've come up with.  Without making
> TokenBatch a native array of Token objects...
>
>    for my $token (@tokens) {
>      $token->set_text( $token->get_text );
>    }
>
> ... there's no way to avoid the method-call overhead of next().  Any
> user-defined subclass of Analyzer implemented using the native API is
> therefore going to be a little slower than it's possible to make core
> Analyzer classes which are allowed to access struct members, but
> that's the way it goes.  If we supply a good set of flexible Analyzer
> classes, hopefully it won't be necessary for people to create
> multiple custom Analyzers and their indexing apps will stay fast enough.

FYI, Ferret's Analyzers each have a single Token object with space for
MAX_WORD_LENGTH characters. TokenFilters just use the Token owned by
the Tokenizer that they filter. This way a new token doesn't need to
be allocated for each term. I'm not sure how much time is to be gained
by having a batch of tokens. I'll have to do some experiments.

Cheers,
Dave