You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/27 15:10:25 UTC
TeeTokenFilter and SinkTokenizer
Anyone have any thoughts on how best to integrate Lucene's new
SinkTokenizer and TeeTokenFilter (https://issues.apache.org/jira/browse/LUCENE-1058
and
http://www.gossamer-threads.com/lists/lucene/java-dev/55927?search_string=TeeTokenFilter;#55927)
into Solr? It doesn't fit into the TokenizerFactory and
TokenFilterFactory model for create since the constructors have
dependencies on things other than a Reader and a TokenStream.
I can do one off Analyzer constructions, but that doesn't really fit
with the Solr way. I think this could have some nice benefits for the
copyField mechanism, as well, but that is more work to get right.
My initial (half-baked?) thinking is that we need the ability to name
TokenStreams (Tokenizers and TokenFilters) so that we could do
something like:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="my.TokenFilterFactory" name="step1"/>
-->
<filter class="my.next.TFF" name="step2"/>
<filter class="my.other.TFF" name="step3"/>
</analyzer>
Thus, each of the named filters create a TeeTokenFilter and have an
associated SinkTokenizer. Then, I can declare another analyzer that
looks like:
<analyzer type="index">
<tokenizer name="step2"/>
</analyzer>
which would just use the tokens saved by step 2 in the first
Analysis. Similarly, we do that for step 3 with some other filters
added like:
<analyzer type="index">
<tokenizer name="step3"/>
<filter class="StopFilterFactory"/>
</analyzer>
Now, Solr would need to be smart about this and know that it has to
index the fields using the first analyzer before those using the
sinks. And there might be some concerns about what to do if multiple
fields use the same "Tee" analyzer and whether that effects the
Sinks. The "name" attribute, of course, would be optional. There
also is the issue of initialization in that we would most likely need
2 pass initialization so that the names of the token streams are known
ahead of time.
I know, of course, the proof is in the pudding, as they say and a
patch does wonders, but I am wondering if people have any initial
thoughts on this. I think the performance upside can be significant
in some common cases, especially once we work out issues w/ Lucene's
clone method and in the case where the SinkTokenizer is not keeping
all tokens.
-Grant
Re: TeeTokenFilter and SinkTokenizer
Posted by Chris Hostetter <ho...@fucit.org>.
: My initial (half-baked?) thinking is that we need the ability to name
: TokenStreams (Tokenizers and TokenFilters) so that we could do something like:
....
: Thus, each of the named filters create a TeeTokenFilter and have an associated
: SinkTokenizer. Then, I can declare another analyzer that looks like:
I don't think it's that half backed ... but i'm not sure why it would need
to be that implicit. i think it would make a lot of sense to have a
TeeTokenFilterFactory that takes the name of a "tee" to write to (ie: no
implicit creation of TeeTokenFilter's between every existing Factory) ...
the question is then what to do with those "tees"...
This seems very analogous to the way copyField works ... let the user
specify that anytime something which went into a field named "foo" (or
matching a pattern of "foo*") and comes out of a tee named "bar" it should
be sent to a field named "baz" ... where "baz" must have a fieldtype that
uses the SinkTokenizer (or perhaps the SinkTokenizer can be implicit at
least, since we'll want to do error checking that you don't attempt to
"tee" to a field that has some other Analyzer or TokenizerFactory...
<fieldType name="text" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
<filter class="solr.TeeFilterFactory" tee="text_nostop" />
<filter class="solr.WordDelimiterFilterFactory" .. />
<filter class="solr.TeeFilterFactory" tee="text" />
</analyzer>
...
</fieldtype>
<fieldtype name="caseInsensitive" class="solr.TextField">
<analyzer type="index">
<!-- no tokenizer, not an error, see below -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- must still have a tokenizer here -->
...
</fieldtype>
<fieldtype name="properNouns" class="solr.TextField">
<analyzer type="index">
<!-- no tokenizer, not an error, see below -->
<filter class="my.IgnoreAllImproperNounsTokenFilter" />
</analyzer>
<analyzer type="query">
<!-- must still have a tokenizer here -->
...
</fieldtype>
<field name="body" type="text" />
<!-- since these fields uses types whose index analyzer has no
tokenizer, it must be in a teeField declaration (or error at
startup), and you cannot index to it directly (error when adding doc)
-->
<field name="bodyCaseInsensitive" type="caseInsensitive" />
<field name="nounsInBody type="properNouns" />
<teeField fromField="body" fromTee="text" toField="bodyCaseInsensitive" />
<teeField fromField="body" fromTee="text_nostop" toField="nounsInBody" />
...hmmmm, except ideally you'd want to be able to string together an
arbitrary number of "pipelines" to make a nice big mesh graph of of
interconnected analysis, and this would only let you do two ... Ah! except
that fields don't have to be stored or indexed. to the "toField" of a
<teeField/> could exist purely to point at some bits of an analysis
pipeline and then be the "fromField" of other <teeField/> rules.
-Hoss