You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/03/08 18:36:03 UTC

[lucy-dev] RegexTokenizer

Greets,

Right now, Lucy only has one tokenizer-style Analyzer subclass:
Lucy::Analysis::Tokenizer, which is regex based.  

At some point, I expect we will have other tokenizer classes which don't use a
regex engine, so I think it would be best to reserve the name "Tokenizer" for
future use and rename the current Tokenizer to "RegexTokenizer".

Another possibility would be "PerlRegexTokenizer", embedding the regex dialect
that will be used to interpret the supplied pattern in the class name.
However, the exact behavior of the regular expression engine is not consistent
across different versions of Perl.  In general, it's not going to be possible
to translate a pattern between different regex engines.  If we try to specify
the regex dialect precisely so that the tokenization behavior is fully defined
by the serialized analyzer within the schema file, the only remedy on mismatch
will be to throw an exception and refuse to read the index.

Therefore, I think we should just have a single class named "RegexTokenizer"
which is defined as deferring to the host language's regex engine.  Managing
portability across different host languages or different versions of the host
language will be left to the user.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by "Andrew S. Townley" <as...@atownley.org>.

On 8 Mar 2011, at 8:23 PM, Marvin Humphrey wrote:

> On Tue, Mar 08, 2011 at 07:35:59PM +0000, Andrew S. Townley wrote:
>> Based on your answers though, it still seems like this should be possible
>> using a C++ as C host implementation strategy--convoluted as it may sound.
> 
> You may be interested to know that Lucy has to compile under C++ because we
> run MSVC in C++ mode in order to get support for mixed declarations and code.
> Our dialect of C is the intersection of C++ and C99.

Interesting, but hardly surprising for that environment.  I remember trying to get MSVC to digest pre-release versions of the STL and standard C++ library back in 95/96, and I remember that it was a particularly schizophrenic beast in terms of language feature support back then.  While I think I last used it to compile anything in C++ around '98/99, I wouldn't have much faith in it keeping up with anything close to cutting edge features.

This time around, I'm hoping to successfully avoid it as much as possible for as long as possible! :)  Good to know that it's regularly validated with C++ compilation, though.

Cheers,

ast
--
Andrew S. Townley <as...@atownley.org>
http://atownley.org

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 07:35:59PM +0000, Andrew S. Townley wrote:
> Based on your answers though, it still seems like this should be possible
> using a C++ as C host implementation strategy--convoluted as it may sound.

You may be interested to know that Lucy has to compile under C++ because we
run MSVC in C++ mode in order to get support for mixed declarations and code.
Our dialect of C is the intersection of C++ and C99.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by "Andrew S. Townley" <as...@atownley.org>.

On 8 Mar 2011, at 7:24 PM, Marvin Humphrey wrote:

> On Tue, Mar 08, 2011 at 05:50:34PM +0000, Andrew S. Townley wrote:
> 
>> If you wanted to lock it in across host languages, then you could always
>> implement this in C using the library of your choice due to the
>> architecture, right?
> 
> Yes, most likely using PCRE.  I think that would make sense to implement as an
> extension, distributed seperately.  Bundling PCRE with core Lucy would provide
> very little benefit at a large cost, though.  Every host provides a regex
> engine that users are already familiar with, and I expect that few users will
> require indexes to work across multiple hosts.

Unless you're doing something crazy like I plan to do (eventually down the line) and make a common C++ codebase the lucy client and then expose that C++ codebase API in multiple host languages accessing the same underlying store infrastructure. ;)

In the current architecture I have with Ferret, even though everything's Ruby, nobody ever touches Ferret directly.  The future architecture will have similar characteristics, but the implementation languages and components will be different.

In this case, however, the regex support will most likely be implemented via Boost's regex engine(s).  That's one of the other reasons I want match offset information to be available from the search library directly instead of depending on anything from the (ultimate) host language.

Based on your answers though, it still seems like this should be possible using a C++ as C host implementation strategy--convoluted as it may sound.

Cheers,

ast
--
Andrew S. Townley <as...@atownley.org>
http://atownley.org

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 05:50:34PM +0000, Andrew S. Townley wrote:
> Tokenizer for the interface and RegexTokenizer for platform-specific regexes
> (which, in fairness, is kinda what people would expect anyway).

Yes, that's the idea.  :)

> Many things support Perl5 regexes to varying degrees, so you'd likely not
> have too much trouble from a portability perspective.  

That's true, but I think it makes sense to endorse the full use of the host
language's regex engine if that's possible.  (It will be a little tricky to
make the analysis chain work with different host string encodings.)

> If you wanted to lock it in across host languages, then you could always
> implement this in C using the library of your choice due to the
> architecture, right?

Yes, most likely using PCRE.  I think that would make sense to implement as an
extension, distributed seperately.  Bundling PCRE with core Lucy would provide
very little benefit at a large cost, though.  Every host provides a regex
engine that users are already familiar with, and I expect that few users will
require indexes to work across multiple hosts.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by "Andrew S. Townley" <as...@atownley.org>.

On 8 Mar 2011, at 5:36 PM, Marvin Humphrey wrote:

> Greets,
> 
> Right now, Lucy only has one tokenizer-style Analyzer subclass:
> Lucy::Analysis::Tokenizer, which is regex based.  
> 
> At some point, I expect we will have other tokenizer classes which don't use a
> regex engine, so I think it would be best to reserve the name "Tokenizer" for
> future use and rename the current Tokenizer to "RegexTokenizer".
> 
> Another possibility would be "PerlRegexTokenizer", embedding the regex dialect
> that will be used to interpret the supplied pattern in the class name.
> However, the exact behavior of the regular expression engine is not consistent
> across different versions of Perl.  In general, it's not going to be possible
> to translate a pattern between different regex engines.  If we try to specify
> the regex dialect precisely so that the tokenization behavior is fully defined
> by the serialized analyzer within the schema file, the only remedy on mismatch
> will be to throw an exception and refuse to read the index.
> 
> Therefore, I think we should just have a single class named "RegexTokenizer"
> which is defined as deferring to the host language's regex engine.  Managing
> portability across different host languages or different versions of the host
> language will be left to the user.
> 
> Marvin Humphrey

Sounds like a reasonable approach.  Tokenizer for the interface and RegexTokenizer for platform-specific regexes (which, in fairness, is kinda what people would expect anyway).

Many things support Perl5 regexes to varying degrees, so you'd likely not have too much trouble from a portability perspective.  If you wanted to lock it in across host languages, then you could always implement this in C using the library of your choice due to the architecture, right?

Cheers,

ast
--
Andrew S. Townley <as...@atownley.org>
http://atownley.org

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 11:36:43AM -0800, Nathan Kurz wrote:
> Once each index becomes specific to each host language, wouldn't you lose
> the ability to create the index in one language and access it from another?   

Indexes are specific to the host language right now, since Tokenizer uses
Perl's regex engine and CaseFolder uses Perl's lowercasing (which is imperfect
in its implementation of the Unicode case-folding algorithm).  I'm not
personally planning to work on that prior to 0.1.0.

> While there is some advantage to having all the tokenizing be host native, I
> think there is greater value in being able to do create the index with a
> good text processing language (Perl in my case) while being able to perform
> the searches from a compiled language (likely C).

I agree that that such cross-host-language flexibility would be a nice option.
I also think it's important that we not lard up core with mandatory
dependencies.  Rather than add PCRE, I'd prefer to focus on extracting
Snowball!  A C application should be able to link in only the Lucy modules it
needs.

> I'd suggest instead that RegexTokenizer be host-independent and use
> something like PCRE.  While this might make for a few odd corner cases, I
> think it will work better in multilingual projects.   

Well, so long as a "PCRETokenizer" is available as a module, those who require
cross-host-language compatibility can get what they need.  So the main
question is whether we should *stop* providing an analyzer which uses the host
regex engine.

I'd actually prefer to pull *all* of the Analyzers out of core.  That's what
Lucene has done, with Robert Muir doing most of the work to put everything
into a "modules" directory.  

But that's a larger discussion and more than I want to take on prior to 0.1.0.
Right now, reserving the name "Tokenizer" is my priority.

> do you view the (future) C API as distinct from Lucy Core?

That's the way the design looks at the moment.  Not all the functions declared
in the header files within trunk/core/ have bodies defined within trunk/core/
-- some of the implementations are within trunk/perl/xs/ and we would need
analogous implementations within trunk/c/.

The design isn't set in stone, though.  The port to C isn't finished, and I
expect that we'll need to make adjustments as we add other bindings.

> >  If we try to specify the regex dialect precisely so that the tokenization
> >  behavior is fully defined by the serialized analyzer within the schema
> >  file, the only remedy on mismatch will be to throw an exception and
> >  refuse to read the index.
> 
> I'm not getting this.  Is there a failure other than not finding token
> you search for?  

I'm guessing that there are regexes which are legal in one host but syntax
errors in another... but silent failure to match is indeed my main concern.

If we specify that "PerlRegexTokenizer" has the behavior of the regex engine
in Perl 5.10.1, what happens when we load Lucy in Perl 5.12.2 or 5.8.9?
Should we attempt to translate and provide full feature-compatibility and
bug-compatibility?  No way that would work.

This same problem affects Java Lucene when you change your JVM and the new one
has a different version of Unicode.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by Nathan Kurz <na...@verse.com>.

On Tue, Mar 8, 2011 at 9:36 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> Therefore, I think we should just have a single class named "RegexTokenizer"
> which is defined as deferring to the host language's regex engine.  Managing
> portability across different host languages or different versions of the host
> language will be left to the user.

Maybe I'm misunderstanding, but I'd suggest thinking really closely
before doing this.

I think one of the strengths of Lucy's host-core split is that the
core remains language agnostic.  Once each index becomes specific to
each host language, wouldn't you lose the ability to create the index
in one language and access it from another?   While there is some
advantage to having all the tokenizing be host native, I think there
is greater value in being able to do create the index with a good text
processing language (Perl in my case) while being able to perform the
searches from a compiled language (likely C).

I'd suggest instead that RegexTokenizer be host-independent and use
something like PCRE.  While this might make for a few odd corner
cases, I think it will work better in multilingual projects.   Make it
easy to switch to a different tokenizer, but provide something built
in that can be used standalone.  But maybe this is a philosophical
rather than practical problem:  do you view the (future) C API as
distinct from Lucy Core?  If one wanted to wrap the core up to act as
a freestanding HTTP or 0mq server, what would the "host language" be?

>  If we try to specify
> the regex dialect precisely so that the tokenization behavior is fully defined
> by the serialized analyzer within the schema file, the only remedy on mismatch
> will be to throw an exception and refuse to read the index.

I'm not getting this.  Is there a failure other than not finding token
you search for?  I think I can envision cases where you might
consciously want to different tokenizers working on the same index:
stemming one and not the other, or maybe even indexing bi-grams as a
means of boosting ad hoc phrase queries.

Nathan Kurz
nate@verse.com

Re: [lucy-dev] RegexTokenizer

Posted by "Andrew S. Townley" <as...@atownley.org>.

On 8 Mar 2011, at 5:40 PM, David E. Wheeler wrote:

> On Mar 8, 2011, at 9:36 AM, Marvin Humphrey wrote:
> 
>> Greets,
>> 
>> Right now, Lucy only has one tokenizer-style Analyzer subclass:
>> Lucy::Analysis::Tokenizer, which is regex based.  
>> 
>> At some point, I expect we will have other tokenizer classes which don't use a
>> regex engine, so I think it would be best to reserve the name "Tokenizer" for
>> future use and rename the current Tokenizer to "RegexTokenizer".
> 
> Lucy::Tokenizer::Regex please!

Yes.  That makes more sense.

--
Andrew S. Townley <as...@atownley.org>
http://atownley.org

Re: [lucy-dev] RegexTokenizer

Posted by "David E. Wheeler" <da...@kineticode.com>.

On Mar 8, 2011, at 7:56 PM, Marvin Humphrey wrote:

> Like how we're sneaking namespaces into Lucy's C code via prefixes? :)  Or how
> we jammed namespaces into JavaScript via objects back in our OpenJSAN 
> days?  :)

For languages that don't have namespaces, you do what'cha gotta do.

> I don't think it's a good idea for Lucy's class hierarchy to be organized
> differently for each host language binding.  (That seems like the logical
> extrapolation of your remark, though I believe you intended to express an
> ideal rather than make a concrete recommendation.)

No, not differently, just more naturally. So what might be Lucy::Tokenizer::Regex in Perl might be LucyRegexRecognizer elsewhere.

> Since the class hierarchy must be shared, its design has to balance many
> competing interests and work well across the gamut of hosts.  What we have now
> doesn't violate anybody's language rules or conventions to the best of my
> knowledge.  It's internally consistent, and works OK for our C code.

Good.

> It's technically doable.  
> 
> Still, Lucy's namespacing scheme and class hierarchy have been been mulled
> over very hard over a very long time.  When renaming Lucy::Analysis::Tokenizer
> to something else, we should strive to operate within the existing
> conventions.

Agreed. Which is why you should ignore my ignorant ass.

> Upending the existing hierarchy and changing the rules would be a much larger
> undertaking.  It's not even worth contemplating without someone willing to do
> the work -- and I rather suspect that such a volunteer would become frustrated
> quickly by all the concerns I'd raise as someone who works on Lucy's C code.

Okay.

> There are two or three hundred classes in Lucy, and there will likely be
> hundreds more in time.  I think we should be conservative about what we put at
> the second level of the hierarchy, so that scanning any one directory with the
> naked eye produces sensible results.

Sure, but not *that* conservative. Oh, and 100s of classes? Yow!

> We inherited all the dirs under Lucy except for Lucy/Plan and Lucy/Object from
> Lucene.  IMO the organization has served us pretty well.

Great, I'll STFU then!

>> But I'm very late to this discussion, so feel free to ignore my ignorant
>> harping. :-)
> 
> I see your smiley, but I'll emphasize this anyway: we're definitely not
> ignoring your suggestion even if we don't adopt it.

Clearly. You're too kind.

David

Re: [lucy-dev] RegexTokenizer

Posted by Andi Vajda <va...@osafoundation.org>.

On Mar 8, 2011, at 19:56, Marvin Humphrey <ma...@rectangular.com> wrote:

> On Tue, Mar 08, 2011 at 12:25:49PM -0800, David E. Wheeler wrote:
>> Yeah. It just drives me nuts to see the namespacing conventions of one
>> language forced on another. 
> 
> Like how we're sneaking namespaces into Lucy's C code via prefixes? :)  Or how
> we jammed namespaces into JavaScript via objects back in our OpenJSAN 
> days?  :)
> 
>> Each language should have names that make sense by the conventions of that
>> language IMHO.
> 
> I don't think it's a good idea for Lucy's class hierarchy to be organized
> differently for each host language binding.  (That seems like the logical
> extrapolation of your remark, though I believe you intended to express an
> ideal rather than make a concrete recommendation.)
> 
> Since the class hierarchy must be shared, its design has to balance many
> competing interests and work well across the gamut of hosts.  What we have now
> doesn't violate anybody's language rules or conventions to the best of my
> knowledge.  It's internally consistent, and works OK for our C code.

With JCC I made the conscious decision a long time ago to not carry over the Java package structure into Python modules but use a flat namespace instead, generating one Python module for an entire class tree.

Name collisions are surprisingly rare. When they occur, the conflicts can be usually be resolved with a --rename.

In the C++ layer, I keep the Java package structure because it's free but in the Python layer, it's in the way.

It seems that for API entrypoints that matter, people tend to pick unique class  names anyway.

Andi..

> 
>>> If someone is willing work up a patch which makes "Lucy::Tokenizer::Regex"
>>> possible, then we can consider it.  Until then, it has to be ruled out for
>>> technical reasons.
>> 
>> Probably not too difficult.
> 
> It's technically doable.  
> 
> Still, Lucy's namespacing scheme and class hierarchy have been been mulled
> over very hard over a very long time.  When renaming Lucy::Analysis::Tokenizer
> to something else, we should strive to operate within the existing
> conventions.
> 
> Upending the existing hierarchy and changing the rules would be a much larger
> undertaking.  It's not even worth contemplating without someone willing to do
> the work -- and I rather suspect that such a volunteer would become frustrated
> quickly by all the concerns I'd raise as someone who works on Lucy's C code.
> 
>>> FWIW, "Lucy::Tokenizer::Regex" implies that we would have a Lucy::Tokenizer
>>> class, which would break another convention -- we no longer have any classes
>>> which live directly under Lucy.
>> 
>> Now that's a shame. Seems like a waste of namespace hierarchy.
> 
> There are two or three hundred classes in Lucy, and there will likely be
> hundreds more in time.  I think we should be conservative about what we put at
> the second level of the hierarchy, so that scanning any one directory with the
> naked eye produces sensible results.
> 
> We inherited all the dirs under Lucy except for Lucy/Plan and Lucy/Object from
> Lucene.  IMO the organization has served us pretty well.
> 
>> But I'm very late to this discussion, so feel free to ignore my ignorant
>> harping. :-)
> 
> I see your smiley, but I'll emphasize this anyway: we're definitely not
> ignoring your suggestion even if we don't adopt it.
> 
> Cheers,
> 
> Marvin Humphrey
>

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 12:25:49PM -0800, David E. Wheeler wrote:
> Yeah. It just drives me nuts to see the namespacing conventions of one
> language forced on another. 

Like how we're sneaking namespaces into Lucy's C code via prefixes? :)  Or how
we jammed namespaces into JavaScript via objects back in our OpenJSAN 
days?  :)

> Each language should have names that make sense by the conventions of that
> language IMHO.

I don't think it's a good idea for Lucy's class hierarchy to be organized
differently for each host language binding.  (That seems like the logical
extrapolation of your remark, though I believe you intended to express an
ideal rather than make a concrete recommendation.)

Since the class hierarchy must be shared, its design has to balance many
competing interests and work well across the gamut of hosts.  What we have now
doesn't violate anybody's language rules or conventions to the best of my
knowledge.  It's internally consistent, and works OK for our C code.

> > If someone is willing work up a patch which makes "Lucy::Tokenizer::Regex"
> > possible, then we can consider it.  Until then, it has to be ruled out for
> > technical reasons.
> 
> Probably not too difficult.

It's technically doable.  

Still, Lucy's namespacing scheme and class hierarchy have been been mulled
over very hard over a very long time.  When renaming Lucy::Analysis::Tokenizer
to something else, we should strive to operate within the existing
conventions.

Upending the existing hierarchy and changing the rules would be a much larger
undertaking.  It's not even worth contemplating without someone willing to do
the work -- and I rather suspect that such a volunteer would become frustrated
quickly by all the concerns I'd raise as someone who works on Lucy's C code.

> > FWIW, "Lucy::Tokenizer::Regex" implies that we would have a Lucy::Tokenizer
> > class, which would break another convention -- we no longer have any classes
> > which live directly under Lucy.
> 
> Now that's a shame. Seems like a waste of namespace hierarchy.

There are two or three hundred classes in Lucy, and there will likely be
hundreds more in time.  I think we should be conservative about what we put at
the second level of the hierarchy, so that scanning any one directory with the
naked eye produces sensible results.

We inherited all the dirs under Lucy except for Lucy/Plan and Lucy/Object from
Lucene.  IMO the organization has served us pretty well.

> But I'm very late to this discussion, so feel free to ignore my ignorant
> harping. :-)

I see your smiley, but I'll emphasize this anyway: we're definitely not
ignoring your suggestion even if we don't adopt it.

Cheers,

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by "David E. Wheeler" <da...@kineticode.com>.

On Mar 8, 2011, at 12:19 PM, Marvin Humphrey wrote:

>> It doesn't take a lot of munging to get what you want:
> 
> Your demo of the transform is admirably compact.  However, the existing naming
> scheme is fairly deeply baked in to our object system, and a good deal of the
> code that touches on it is written in C.  Breaking the existing convention
> would require a certain amount of work.

Yeah. It just drives me nuts to see the namespacing conventions of one language forced on another. Each language should have names that make sense by the conventions of that language IMHO.

> If someone is willing work up a patch which makes "Lucy::Tokenizer::Regex"
> possible, then we can consider it.  Until then, it has to be ruled out for
> technical reasons.

Probably not too difficult.

> FWIW, "Lucy::Tokenizer::Regex" implies that we would have a Lucy::Tokenizer
> class, which would break another convention -- we no longer have any classes
> which live directly under Lucy.

Now that's a shame. Seems like a waste of namespace hierarchy.

But I'm very late to this discussion, so feel free to ignore my ignorant harping. :-)

Best,

David

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 11:15:20AM -0800, David E. Wheeler wrote:
> I suggest bending C to Perl's namespacing rather than the other way around.

Python users have the option of aliasing the module:

    from Lucy.Analysis import RegexTokenizer as RegexTokenizer
    tokenizer = RegexTokenizer.new()

So do Perl users, thanks to the 'aliased' module from CPAN:

    use aliased 'Lucy::Analysis::RegexTokenizer' => 'RegexTokenizer';
    my $tokenizer = RegexTokenizer->new;

> It doesn't take a lot of munging to get what you want:

Your demo of the transform is admirably compact.  However, the existing naming
scheme is fairly deeply baked in to our object system, and a good deal of the
code that touches on it is written in C.  Breaking the existing convention
would require a certain amount of work.

If someone is willing work up a patch which makes "Lucy::Tokenizer::Regex"
possible, then we can consider it.  Until then, it has to be ruled out for
technical reasons.

FWIW, "Lucy::Tokenizer::Regex" implies that we would have a Lucy::Tokenizer
class, which would break another convention -- we no longer have any classes
which live directly under Lucy.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by "David E. Wheeler" <da...@kineticode.com>.

On Mar 8, 2011, at 11:09 AM, Marvin Humphrey wrote:

> On Tue, Mar 08, 2011 at 09:40:30AM -0800, David E. Wheeler wrote:
>> Lucy::Tokenizer::Regex please!
> 
> Sorry, but due to C's flat namespace we have a limitation in our class naming
> scheme which excludes that possibility.  The last part of the class name is
> used for the C struct name, which means that at the C level, an object
> belonging to the class "Lucy::Tokenizer::Regex" would be a "Regex".  That's
> obviously inappropriate.

I suggest bending C to Perl's namespacing rather than the other way around. It doesn't take a lot of munging to get what you want:

package Lucy::Tokenizer::Regex;

(my $cname = __PACKAGE__) =~ s/^Lucy:://;
$cname = join '', reverse split /::/ => $cname;
say $cname; # TokenizerRegex

Best,

David

Re: [lucy-dev] RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 08, 2011 at 09:40:30AM -0800, David E. Wheeler wrote:
> Lucy::Tokenizer::Regex please!

Sorry, but due to C's flat namespace we have a limitation in our class naming
scheme which excludes that possibility.  The last part of the class name is
used for the C struct name, which means that at the C level, an object
belonging to the class "Lucy::Tokenizer::Regex" would be a "Regex".  That's
obviously inappropriate.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Posted by Peter Karman <pe...@peknet.com>.

David E. Wheeler wrote on 03/08/2011 11:40 AM:
> On Mar 8, 2011, at 9:36 AM, Marvin Humphrey wrote:
> 
>> Greets,
>>
>> Right now, Lucy only has one tokenizer-style Analyzer subclass:
>> Lucy::Analysis::Tokenizer, which is regex based.  
>>
>> At some point, I expect we will have other tokenizer classes which don't use a
>> regex engine, so I think it would be best to reserve the name "Tokenizer" for
>> future use and rename the current Tokenizer to "RegexTokenizer".
> 
> Lucy::Tokenizer::Regex please!
> 

+1

(
although, iirc there was some discussion awhile back about class naming
and depth vs breadth.

/me can recall neither the outcome nor the gist, and my finding-fu is
weak atm.
)

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] RegexTokenizer

Posted by "David E. Wheeler" <da...@kineticode.com>.

On Mar 8, 2011, at 9:36 AM, Marvin Humphrey wrote:

> Greets,
> 
> Right now, Lucy only has one tokenizer-style Analyzer subclass:
> Lucy::Analysis::Tokenizer, which is regex based.  
> 
> At some point, I expect we will have other tokenizer classes which don't use a
> regex engine, so I think it would be best to reserve the name "Tokenizer" for
> future use and rename the current Tokenizer to "RegexTokenizer".

Lucy::Tokenizer::Regex please!

Best,

David