You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Itamar Syn-Hershko <it...@code972.com> on 2014/11/13 15:19:45 UTC

JFlex, tokenization, and custom token exceptions

Hey all,

I posted this question also to the JFlex[1] list as it seems a more
appropriate place, but I thought I should raise this here as well.

I'm looking for ways to use Lucene's tokenizers, but preserve some custom
tokens defined by the user. For example, use StandardTokenizer but preserve
C++, C# and i-phone as whole tokens. The gotcha here is I want that list to
be loaded on runtime, and not compiled into the tokenizer - mainly because
it will change over time.

The problem is there's no real way of doing this currently. While I had
implemented this myself, JFlex doesn't seem to support this (other than
defining new macros and regenerating the Java pieces, recompiling etc).

I discussed this with Rob Muir a couple of months back and he seemed
interested, will be happy to see if there's interest in pursuing this, or
get any new ideas on how to enable this more easily on the JFlex layer or
otherwise. I'll be happy to take this on but every approach I'm looking at
currently has some significant flaws.

Cheers,

  [1] http://sourceforge.net/p/jflex/mailman/jflex-users/?viewmonth=201411

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Author of RavenDB in Action <http://manning.com/synhershko/>