You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/25 23:18:13 UTC
TokenBatch
Greets,
Lucene's TokenStream concept does not translate very well from Java
to our target languages, primarily because calling next() once for
each analyzer-token pairing creates a lot of method-call overhead.
Also, the analysis process creates a *lot* of Token objects; in
Plucene these were hash-based objects, relatively expensive to create
and destroy.
The solution adopted by KinoSearch is the TokenBatch. Instead of
passing tokens one by one through an analysis chain, they are grouped
together and passed en masse from analyzer to analyzer. Analyzers do
not supply a next() method; they supply an analyze() method, which
both accepts and returns a TokenBatch object. Analysis chains are
set up using a PolyAnalyzer object, which is an ordered array of
Analyzers, each of which will be called upon to analyze() a
TokenBatch in turn.
Implementing TokenBatch as a doubly-linked list of Token structs
seems to work pretty well.
typedef struct lucy_Token {
char *text;
lucy_i32_t len;
lucy_i32_t start_offset;
lucy_i32_t end_offset;
lucy_i32_t pos_inc;
struct lucy_Token *next;
struct lucy_Token *prev;
} lucy_Token;
typedef struct lucy_TokenBatch {
lucy_Token *first;
lucy_Token *last;
lucy_Token *current;
lucy_i32_t size;
lucy_bool_t initialized;
} lucy_TokenBatch;
(We might define token->text as either an 8 or 16 bit type depending
on how the target platform handles strings, but that's a topic for
another day.)
The tricky part here is how to expose TokenBatch and its constituent
tokens as a native API, which is necessary for subclassing Analyzer.
Core Analyzer subclasses distributed with Lucy might peek into the
guts of Token structs. However, it would be a bad idea to make *any*
C struct's makeup part of Lucy's public API. The only supported way
to get at a struct's members should be through methods.
The obvious way to affect individual Tokens would be to spin them off
as native objects when iterating over the TokenBatch object.
Illustrated in Java:
while ((token = batch.next()) != null) {
token.setText( lowerCase(token.getText()) );
}
However, that introduces a garbage collection issue. Who's
responsible for destroying the individual Token structs? We can't
have them destroyed twice, once when the TokenBatch gets destroyed,
and once when the spun-off token gets destroyed. I see two
possibilities, neither of which is attractive. We might implement
our own reference-counting scheme, which would be messy and
complicated. Or we could start off by creating each token as a
native object and defer to native GC, which is messy, complicated,
and expensive to boot.
The least worst solution I see is to avoid exposing the individual
tokens via native API, and allow access to them one at a time via
methods against the TokenBatch object. next() would return a boolean
indicating whether the TokenBatch was currently located at a valid
Token position, instead of a native Token object.
while (batch.next()) {
batch.setText( lowerCase(batch.getText()) );
}
This is perhaps a little less intuitive than iterating over an faux-
array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum
and Scorer classes all work.
It's also the fastest scheme I've come up with. Without making
TokenBatch a native array of Token objects...
for my $token (@tokens) {
$token->set_text( $token->get_text );
}
... there's no way to avoid the method-call overhead of next(). Any
user-defined subclass of Analyzer implemented using the native API is
therefore going to be a little slower than it's possible to make core
Analyzer classes which are allowed to access struct members, but
that's the way it goes. If we supply a good set of flexible Analyzer
classes, hopefully it won't be necessary for people to create
multiple custom Analyzers and their indexing apps will stay fast enough.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: TokenBatch
Posted by David Balmain <db...@gmail.com>.
On 6/26/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Greets,
>
> Lucene's TokenStream concept does not translate very well from Java
> to our target languages, primarily because calling next() once for
> each analyzer-token pairing creates a lot of method-call overhead.
> Also, the analysis process creates a *lot* of Token objects; in
> Plucene these were hash-based objects, relatively expensive to create
> and destroy.
>
> The solution adopted by KinoSearch is the TokenBatch. Instead of
> passing tokens one by one through an analysis chain, they are grouped
> together and passed en masse from analyzer to analyzer. Analyzers do
> not supply a next() method; they supply an analyze() method, which
> both accepts and returns a TokenBatch object. Analysis chains are
> set up using a PolyAnalyzer object, which is an ordered array of
> Analyzers, each of which will be called upon to analyze() a
> TokenBatch in turn.
>
> Implementing TokenBatch as a doubly-linked list of Token structs
> seems to work pretty well.
>
> typedef struct lucy_Token {
> char *text;
> lucy_i32_t len;
> lucy_i32_t start_offset;
> lucy_i32_t end_offset;
> lucy_i32_t pos_inc;
> struct lucy_Token *next;
> struct lucy_Token *prev;
> } lucy_Token;
>
> typedef struct lucy_TokenBatch {
> lucy_Token *first;
> lucy_Token *last;
> lucy_Token *current;
> lucy_i32_t size;
> lucy_bool_t initialized;
> } lucy_TokenBatch;
>
> (We might define token->text as either an 8 or 16 bit type depending
> on how the target platform handles strings, but that's a topic for
> another day.)
>
> The tricky part here is how to expose TokenBatch and its constituent
> tokens as a native API, which is necessary for subclassing Analyzer.
>
> Core Analyzer subclasses distributed with Lucy might peek into the
> guts of Token structs. However, it would be a bad idea to make *any*
> C struct's makeup part of Lucy's public API. The only supported way
> to get at a struct's members should be through methods.
>
> The obvious way to affect individual Tokens would be to spin them off
> as native objects when iterating over the TokenBatch object.
> Illustrated in Java:
>
> while ((token = batch.next()) != null) {
> token.setText( lowerCase(token.getText()) );
> }
>
> However, that introduces a garbage collection issue. Who's
> responsible for destroying the individual Token structs? We can't
> have them destroyed twice, once when the TokenBatch gets destroyed,
> and once when the spun-off token gets destroyed. I see two
> possibilities, neither of which is attractive. We might implement
> our own reference-counting scheme, which would be messy and
> complicated. Or we could start off by creating each token as a
> native object and defer to native GC, which is messy, complicated,
> and expensive to boot.
>
> The least worst solution I see is to avoid exposing the individual
> tokens via native API, and allow access to them one at a time via
> methods against the TokenBatch object. next() would return a boolean
> indicating whether the TokenBatch was currently located at a valid
> Token position, instead of a native Token object.
>
> while (batch.next()) {
> batch.setText( lowerCase(batch.getText()) );
> }
>
> This is perhaps a little less intuitive than iterating over an faux-
> array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum
> and Scorer classes all work.
>
> It's also the fastest scheme I've come up with. Without making
> TokenBatch a native array of Token objects...
>
> for my $token (@tokens) {
> $token->set_text( $token->get_text );
> }
>
> ... there's no way to avoid the method-call overhead of next(). Any
> user-defined subclass of Analyzer implemented using the native API is
> therefore going to be a little slower than it's possible to make core
> Analyzer classes which are allowed to access struct members, but
> that's the way it goes. If we supply a good set of flexible Analyzer
> classes, hopefully it won't be necessary for people to create
> multiple custom Analyzers and their indexing apps will stay fast enough.
FYI, Ferret's Analyzers each have a single Token object with space for
MAX_WORD_LENGTH characters. TokenFilters just use the Token owned by
the Tokenizer that they filter. This way a new token doesn't need to
be allocated for each term. I'm not sure how much time is to be gained
by having a batch of tokens. I'll have to do some experiments.
Cheers,
Dave