You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/27 15:10:25 UTC

TeeTokenFilter and SinkTokenizer

Anyone have any thoughts on how best to integrate Lucene's new  
SinkTokenizer and TeeTokenFilter (https://issues.apache.org/jira/browse/LUCENE-1058 
  and
http://www.gossamer-threads.com/lists/lucene/java-dev/55927?search_string=TeeTokenFilter;#55927) 
  into Solr?  It doesn't fit into the TokenizerFactory and  
TokenFilterFactory model for create since the constructors have  
dependencies on things other than a Reader and a TokenStream.

I can do one off Analyzer constructions, but that doesn't really fit  
with the Solr way.  I think this could have some nice benefits for the  
copyField mechanism, as well, but that is more work to get right.

My initial (half-baked?) thinking is that we need the ability to name  
TokenStreams (Tokenizers and TokenFilters) so that we could do  
something like:

<analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <!-- in this example, we will only use synonyms at query time
         <filter class="my.TokenFilterFactory" name="step1"/>
         -->
         <filter class="my.next.TFF" name="step2"/>
         <filter class="my.other.TFF" name="step3"/>
       </analyzer>

Thus, each of the named filters create a TeeTokenFilter and have an  
associated SinkTokenizer.  Then, I can declare another analyzer that  
looks like:
<analyzer type="index">
	<tokenizer name="step2"/>
</analyzer>

which would just use the tokens saved by step 2 in the first  
Analysis.  Similarly, we do that for step 3 with some other filters  
added like:
<analyzer type="index">
	<tokenizer name="step3"/>
	<filter class="StopFilterFactory"/>
</analyzer>

Now, Solr would need to be smart about this and know that it has to  
index the fields using the first analyzer before those using the  
sinks.  And there might be some concerns about what to do if multiple  
fields use the same "Tee" analyzer and whether that effects the  
Sinks.  The "name" attribute, of course, would be optional.  There  
also is the issue of initialization in that we would most likely need  
2 pass initialization so that the names of the token streams are known  
ahead of time.

I know, of course, the proof is in the pudding, as they say and a  
patch does wonders, but I am wondering if people have any initial  
thoughts on this.  I think the performance upside can be significant  
in some common cases, especially once we work out issues w/ Lucene's  
clone method and in the case where the SinkTokenizer is not keeping  
all tokens.

-Grant

Re: TeeTokenFilter and SinkTokenizer

Posted by Chris Hostetter <ho...@fucit.org>.

: My initial (half-baked?) thinking is that we need the ability to name
: TokenStreams (Tokenizers and TokenFilters) so that we could do something like:
	....
: Thus, each of the named filters create a TeeTokenFilter and have an associated
: SinkTokenizer.  Then, I can declare another analyzer that looks like:

I don't think it's that half backed ... but i'm not sure why it would need 
to be that implicit.  i think it would make a lot of sense to have a 
TeeTokenFilterFactory that takes the name of a "tee" to write to (ie: no 
implicit creation of TeeTokenFilter's between every existing Factory) ... 
the question is then what to do with those "tees"...

This seems very analogous to the way copyField works ... let the user 
specify that anytime something which went into a field named "foo" (or 
matching a pattern of "foo*") and comes out of a tee named "bar" it should 
be sent to a field named "baz" ... where "baz" must have a fieldtype that 
uses the SinkTokenizer (or perhaps the SinkTokenizer can be implicit at 
least, since we'll want to do error checking that you don't attempt to 
"tee" to a field that has some other Analyzer or TokenizerFactory...

    <fieldType name="text" class="solr.TextField" >
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
        <filter class="solr.TeeFilterFactory" tee="text_nostop" />
        <filter class="solr.WordDelimiterFilterFactory" .. />
        <filter class="solr.TeeFilterFactory" tee="text" />
      </analyzer>
      ...
    </fieldtype>
    <fieldtype name="caseInsensitive" class="solr.TextField"> 
      <analyzer type="index">
        <!-- no tokenizer, not an error, see below -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!-- must still have a tokenizer here -->
        ...
   </fieldtype>
    <fieldtype name="properNouns" class="solr.TextField"> 
      <analyzer type="index">
        <!-- no tokenizer, not an error, see below -->
        <filter class="my.IgnoreAllImproperNounsTokenFilter" />
      </analyzer>
      <analyzer type="query">
        <!-- must still have a tokenizer here -->
        ...
   </fieldtype>
   <field name="body" type="text" />
   <!-- since these fields uses types whose index analyzer has no 
        tokenizer, it must be in a teeField declaration (or error at 
        startup), and you cannot index to it directly (error when adding doc)
   -->
   <field name="bodyCaseInsensitive" type="caseInsensitive" />
   <field name="nounsInBody          type="properNouns" />

   <teeField fromField="body" fromTee="text" toField="bodyCaseInsensitive" />
   <teeField fromField="body" fromTee="text_nostop" toField="nounsInBody"  />


...hmmmm, except ideally you'd want to be able to string together an 
arbitrary number of "pipelines" to make a nice big mesh graph of of 
interconnected analysis, and this would only let you do two ... Ah! except 
that fields don't have to be stored or indexed.  to the "toField" of a 
<teeField/> could exist purely to point at some bits of an analysis 
pipeline and then be the "fromField" of other <teeField/> rules.


-Hoss