You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tavi Nathanson <ta...@gmail.com> on 2011/02/10 22:01:34 UTC

Running a string through a simple Tokenizer, and then additional Tokenizers (vs. TokenFilters)

Hey everyone,



I'm trying to do the following:



1. Run a string through a simple tokenizer (i.e. WhitespaceTokenizer)

2. Run the resultant tokens through my current tokenizer as well as
StandardTokenizer, in order to isolate the tokens that are different between
them. (Background: I want to do this so that I can maintain my current
tokenization scheme while allowing for matches against the StandardTokenizer
scheme.)

~~

Example:



Original string: "Bob v2document"



1. "Bob v2document" => ["Bob", "v2document"]

2. Run each of these through my current tokenizer and StandardTokenizer. For
"v2document", let's say that my tokenizer spits out ["v", "2", "document"]
and StandardTokenizer spits out ["v2document"]. I would take the latter
token and append it, and get something like ["Bob", "v", "2", "document",
GAP_SO_PHRASE_QUERIES_SKIP, "v2document"].

~~

I'll be doing this at scale when I index any document, and I'm concerned
about performance. For every single original token, I'll need to a build a
StringReader and pass it through both Tokenizers.


I know that in general, what I'm doing seems more appropriate for
TokenFilter's, but I see two issues:


- It looks like there's no TokenFilter version of StandardTokenizer that
applies the rules to each token. Maybe I'm overlooking something?

- If I pass ["Bob", "v2document"] through two TokenFilters separately, I'll
get something like: ["Bob", "v", "2", "document"] / ["Bob", "v2document"].
And then I won't know which tokens belong to which original token.



Any suggestions for doing this in a performant way?


 Thanks!

Tavi