You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/17 05:40:48 UTC

TeeTokenFilter performance testing

For those who don't recall, TeeTokenFilter was added on https://issues.apache.org/jira/browse/LUCENE-1058 
  to handle what I would consider a somewhat common case whereby two  
or more fields share a fair number of common analysis steps.  For  
instance, if one wanted a field that contained the proper nouns found  
in the body of a text or just the dates or Tokens that matched a  
certain type, then the TeeTokenFilter can be setup such that those  
special tokens are "siphoned off" from the main analysis path and  
saved for the other field, thus eliminating the need to go back  
through the analysis steps.

So, I have done some  preliminary performance testing of the  
TeeTokenFilter using the patch inlined below (unfortunately, some of  
the changes are from formatting).  See the testPerformance() method at  
the bottom of this msg.  I simulated some of the above scenarios by  
siphoning off every X number of tokens in the sink test, while doing  
the analysis twice in the non-sink test (but skipping all tokens in  
between (exclusive) X and 2X).  I then looped over various values of X  
ranging from 1 to 500 and analyzed a String containing Y tokens where  
Y ranges from 100 to 10000.

For the smaller token counts, any performance difference is  
negligible.  However, even at 500 tokens, one starts to see a  
difference.  The first thing to note is that TeeTokenFilter (TTF) is  
much _slower_ in the case that all tokens are siphoned off (X = 1).  I  
believe the reason is the cost of Token.clone(), but I have not  
validated this yet by profiling.  However, once X > 2, one starts to  
see a reduction in the time spent producing analysis tokens to  
somewhere in the 50-65% range of the total time of the two field  
approach.

My code is below, just add it in to the TeeSinkTokenTest in test  
o.a.l.analysis.

Is my test/thinking valid on this idea?  Intuitively it makes sense to  
me, but I am also tired and did most of the write up of this on the  
plane the other day, so mistakes happen.  In fact, I was kind of  
surprised by the poor showing in the X = 1 case as I would have  
thought it would be in the ballpark of the two field case, but will  
need to investigate more.

Thanks,
Grant

-----
/**
    * Not an explicit test, just useful to print out some info on  
performance
    *
    * @throws Exception
    */
   public void testPerformance() throws Exception {
     int[] tokCount = {100, 500, 1000, 2000, 5000, 10000, 50000};
     for (int k = 0; k < tokCount.length; k++) {
       StringBuffer buffer = new StringBuffer();
       System.out.println("-----Tokens: " + tokCount[k] + "-----");
       for (int i = 0; i < tokCount[k]; i++) {
         buffer.append(English.intToEnglish(i).toUpperCase()).append('  
');
       }
       //make sure we produce the same tokens
       ModuloSinkTokenizer sink = new ModuloSinkTokenizer(100);
       Token next = new Token();
       TokenStream result = new TeeTokenFilter(new StandardFilter(new  
StandardTokenizer(new StringReader(buffer.toString()))), sink);
       while ((next = result.next(next)) != null) {
       }
       result = new ModuloTokenFilter(new StandardFilter(new  
StandardTokenizer(new StringReader(buffer.toString()))), 100);
       next = new Token();
       List tmp = new ArrayList();
       while ((next = result.next(next)) != null) {
         tmp.add(next.clone());
       }
       List sinkList = sink.getTokens();
       assertTrue("tmp Size: " + tmp.size() + " is not: " +  
sinkList.size(), tmp.size() == sinkList.size());
       for (int i = 0; i < tmp.size(); i++) {
         Token tfTok = (Token) tmp.get(i);
         Token sinkTok = (Token) sinkList.get(i);
         assertTrue(tfTok.termText() + " is not equal to " +  
sinkTok.termText() + " at token: " + i,  
tfTok.termText().equals(sinkTok.termText()) == true);
       }
       //simulate two fields, each being analyzed once
       int[] modCounts = {1, 2, 5, 10, 20, 50, 100, 200, 500};
       for (int j = 0; j < modCounts.length; j++) {
         int tfPos = 0;
         long start = System.currentTimeMillis();
         for (int i = 0; i < 20; i++) {
           next = new Token();
           result = new StandardFilter(new StandardTokenizer(new  
StringReader(buffer.toString())));
           while ((next = result.next(next)) != null) {
             tfPos += next.getPositionIncrement();
           }
           next = new Token();
           result = new ModuloTokenFilter(new StandardFilter(new  
StandardTokenizer(new StringReader(buffer.toString()))), modCounts[j]);
           while ((next = result.next(next)) != null) {
             tfPos += next.getPositionIncrement();
           }
         }
         long finish = System.currentTimeMillis();
         System.out.println("ModCount: " + modCounts[j] + " Two fields  
took " + (finish - start) + " ms");
         int sinkPos = 0;
         start = System.currentTimeMillis();
         for (int i = 0; i < 20; i++) {
           sink = new ModuloSinkTokenizer(modCounts[j]);
           next = new Token();
           result = new TeeTokenFilter(new StandardFilter(new  
StandardTokenizer(new StringReader(buffer.toString()))), sink);
           while ((next = result.next(next)) != null) {
             sinkPos += next.getPositionIncrement();
           }
           //System.out.println("Modulo--------");
           result = sink;
           while ((next = result.next(next)) != null) {
             sinkPos += next.getPositionIncrement();
           }
         }
         finish = System.currentTimeMillis();
         System.out.println("ModCount: " + modCounts[j] + " Tee fields  
took " + (finish - start) + " ms");
         assertTrue(sinkPos + " does not equal: " + tfPos, sinkPos ==  
tfPos);

       }
       System.out.println("- End Tokens: " + tokCount[k] + "-----");
     }

   }


   class ModuloTokenFilter extends TokenFilter {

     int modCount;

     ModuloTokenFilter(TokenStream input, int mc) {
       super(input);
       modCount = mc;
     }

     int count = 0;

     //return every 100 tokens
     public Token next(Token result) throws IOException {

       while ((result = input.next(result)) != null && count %  
modCount != 0) {
         count++;
       }
       count++;
       return result;
     }
   }

   class ModuloSinkTokenizer extends SinkTokenizer {
     int count = 0;
     int modCount;


     ModuloSinkTokenizer(int mc) {
       modCount = mc;
       lst = new ArrayList(modCount + 1);
     }

     public void add(Token t) {
       if (t != null && count % modCount == 0) {
         lst.add(t.clone());
       }
       count++;
     }
   }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Karl Wettin <ka...@gmail.com>.

20 dec 2007 kl. 16.07 skrev Grant Ingersoll:

> I must admit I don't know a lot about the perf. differences between  
> clone and new, but I would think the cost should be on par, if not a  
> little cheaper, otherwise what's the point?

My guess is that clone() is a convenience implementation that to some  
extent use reflection.

> It also seems like we shouldn't have to go through nulling out/ 
> reinitialization of the term buffer, since we already know the size,  
> etc. of the term buffer.

Good call.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Yonik Seeley <yo...@apache.org>.

On Dec 20, 2007 10:07 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Hmmm, I will have to take a look at Token.clone.  I must admit I don't
> know a lot about the perf. differences between clone and new, but I
> would think the cost should be on par, if not a little cheaper,
> otherwise what's the point?

> It also seems like we shouldn't have to
> go through nulling out/reinitialization of the term buffer, since we
> already know the size, etc. of the term buffer.

I assume you are referring to this part of the clone code()?

      if (termBuffer != null) {
        t.termBuffer = null;
        t.setTermBuffer(termBuffer, 0, termLength);
      }

The nulling out was just so that setTermBuffer() code could be re-used.
I bet most of the cost in cloning is due to the native clone() call.
It looks like maybe we should replace that with new Token().... the
slight downside being that it no longer works the same way if someone
subclasses Token.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Grant Ingersoll <gs...@apache.org>.

Hmmm, I will have to take a look at Token.clone.  I must admit I don't  
know a lot about the perf. differences between clone and new, but I  
would think the cost should be on par, if not a little cheaper,  
otherwise what's the point?  It also seems like we shouldn't have to  
go through nulling out/reinitialization of the term buffer, since we  
already know the size, etc. of the term buffer.

-Grant

On Dec 18, 2007, at 11:07 AM, Karl Wettin wrote:

>
> 18 dec 2007 kl. 13.26 skrev Grant Ingersoll:
>>> I might be missing something here, but why do you clone?
>
>> Because the Token is changing and I am not saving all Tokens, just  
>> the ones changed.
>
> Aha!
>
>>>> The first thing to note is that TeeTokenFilter (TTF) is much  
>>>> _slower_ in the case that all tokens are siphoned off (X = 1).
>
>
> I added a line to the benchmark so that the "Two fields" does the  
> same thing as the "Tee field".
>
>         next = new Token();
>         result = new ModuloTokenFilter(new StandardFilter(new  
> StandardTokenizer(new StringReader(buffer.toString()))),  
> modCounts[j]);
>         while ((next = result.next(next)) != null) {
> +           next.clone(); // simulate what sink does
>           tfPos += next.getPositionIncrement();
>         }
>       }
>       long finish = System.currentTimeMillis();
>       System.out.println("ModCount: " + modCounts[j] + " Two fields  
> took " + (finish - start) + " ms");
>
> This has some effect, but much less than tweaking JVM settings:
>
> MacBook running
>
> -server -Xms64M -Xmx256M
>
> -----Tokens: 50000-----
> ModCount: 1 Two fields took 1943 ms
> ModCount: 1 Tee fields took 1172 ms
> ModCount: 2 Two fields took 833 ms
> ModCount: 2 Tee fields took 759 ms
> ModCount: 5 Two fields took 632 ms
> ModCount: 5 Tee fields took 473 ms
>
> -client -Xmx256M
>
> -----Tokens: 50000-----
> ModCount: 1 Two fields took 2025 ms
> ModCount: 1 Tee fields took 2537 ms
> ModCount: 2 Two fields took 1535 ms
> ModCount: 2 Tee fields took 1479 ms
> ModCount: 5 Two fields took 1314 ms
> ModCount: 5 Tee fields took 1088 ms
>
> Then I cut down the time spent even more by rewriting Token#clone:
>
> -server -Xms64M -Xmx256M
>
> -----Tokens: 50000-----
> ModCount: 1 Two fields took 1263 ms
> ModCount: 1 Tee fields took 854 ms
> ModCount: 2 Two fields took 692 ms
> ModCount: 2 Tee fields took 562 ms
> ModCount: 5 Two fields took 799 ms
> ModCount: 5 Tee fields took 432 ms
>
> -client -Xmx256M
>
> ModCount: 1 Two fields took 1600 ms
> ModCount: 1 Tee fields took 1981 ms
> ModCount: 2 Two fields took 1363 ms
> ModCount: 2 Tee fields took 1244 ms
> ModCount: 5 Two fields took 1252 ms
> ModCount: 5 Tee fields took 896 ms
>
>
> //  public Object clone() {
> //    try {
> //      Token t = (Token)super.clone();
> //      if (termBuffer != null) {
> //        t.termBuffer = null;
> //        t.setTermBuffer(termBuffer, 0, termLength);
> //      }
> //      if (payload != null) {
> //        t.setPayload((Payload) payload.clone());
> //      }
> //      return t;
> //    } catch (CloneNotSupportedException e) {
> //      throw new RuntimeException(e);  // shouldn't happen
> //    }
> //  }
>
>  public Object clone() {
>    Token t = new Token();
>    t.setStartOffset(startOffset);
>    t.setEndOffset(endOffset);
>    t.setType(type);
>    if (termBuffer != null) {
>      t.termBuffer = null;
>      t.setTermBuffer(termBuffer, 0, termLength);
>    }
>    if (payload != null) {
>      t.setPayload((Payload) payload.clone());
>    }
>    return t;
>  }
>
>>>> Is my test/thinking valid on this idea?
>
> Yes, I think so.
>
>
> -- 
> karl
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Karl Wettin <ka...@gmail.com>.

18 dec 2007 kl. 13.26 skrev Grant Ingersoll:
>> I might be missing something here, but why do you clone?

> Because the Token is changing and I am not saving all Tokens, just  
> the ones changed.

Aha!

>>> The first thing to note is that TeeTokenFilter (TTF) is much  
>>> _slower_ in the case that all tokens are siphoned off (X = 1).


I added a line to the benchmark so that the "Two fields" does the same  
thing as the "Tee field".

          next = new Token();
          result = new ModuloTokenFilter(new StandardFilter(new  
StandardTokenizer(new StringReader(buffer.toString()))), modCounts[j]);
          while ((next = result.next(next)) != null) {
+           next.clone(); // simulate what sink does
            tfPos += next.getPositionIncrement();
          }
        }
        long finish = System.currentTimeMillis();
        System.out.println("ModCount: " + modCounts[j] + " Two fields  
took " + (finish - start) + " ms");

This has some effect, but much less than tweaking JVM settings:

MacBook running

-server -Xms64M -Xmx256M

-----Tokens: 50000-----
ModCount: 1 Two fields took 1943 ms
ModCount: 1 Tee fields took 1172 ms
ModCount: 2 Two fields took 833 ms
ModCount: 2 Tee fields took 759 ms
ModCount: 5 Two fields took 632 ms
ModCount: 5 Tee fields took 473 ms

-client -Xmx256M

-----Tokens: 50000-----
ModCount: 1 Two fields took 2025 ms
ModCount: 1 Tee fields took 2537 ms
ModCount: 2 Two fields took 1535 ms
ModCount: 2 Tee fields took 1479 ms
ModCount: 5 Two fields took 1314 ms
ModCount: 5 Tee fields took 1088 ms

Then I cut down the time spent even more by rewriting Token#clone:

-server -Xms64M -Xmx256M

-----Tokens: 50000-----
ModCount: 1 Two fields took 1263 ms
ModCount: 1 Tee fields took 854 ms
ModCount: 2 Two fields took 692 ms
ModCount: 2 Tee fields took 562 ms
ModCount: 5 Two fields took 799 ms
ModCount: 5 Tee fields took 432 ms

-client -Xmx256M

ModCount: 1 Two fields took 1600 ms
ModCount: 1 Tee fields took 1981 ms
ModCount: 2 Two fields took 1363 ms
ModCount: 2 Tee fields took 1244 ms
ModCount: 5 Two fields took 1252 ms
ModCount: 5 Tee fields took 896 ms


//  public Object clone() {
//    try {
//      Token t = (Token)super.clone();
//      if (termBuffer != null) {
//        t.termBuffer = null;
//        t.setTermBuffer(termBuffer, 0, termLength);
//      }
//      if (payload != null) {
//        t.setPayload((Payload) payload.clone());
//      }
//      return t;
//    } catch (CloneNotSupportedException e) {
//      throw new RuntimeException(e);  // shouldn't happen
//    }
//  }

   public Object clone() {
     Token t = new Token();
     t.setStartOffset(startOffset);
     t.setEndOffset(endOffset);
     t.setType(type);
     if (termBuffer != null) {
       t.termBuffer = null;
       t.setTermBuffer(termBuffer, 0, termLength);
     }
     if (payload != null) {
       t.setPayload((Payload) payload.clone());
     }
     return t;
   }

>>> Is my test/thinking valid on this idea?

Yes, I think so.


-- 
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 18, 2007, at 2:55 AM, Karl Wettin wrote:

>
> 17 dec 2007 kl. 05.40 skrev Grant Ingersoll:
>
>> a somewhat common case whereby two or more fields share a fair  
>> number of common analysis steps.
>
> Right.
>
>> For the smaller token counts, any performance difference is  
>> negligible.  However, even at 500 tokens, one starts to see a  
>> difference.  The first thing to note is that TeeTokenFilter (TTF)  
>> is much _slower_ in the case that all tokens are siphoned off (X =  
>> 1).  I believe the reason is the cost of Token.clone()
>
> I might be missing something here, but why do you clone? I usually  
> fill a List<Token> with the same instances and then only clone the  
> tokens I need to update. The same token instances are used in  
> multiple fields and queries at the same time and I never had any  
> problems with that. Should I be expecting some?

Because the Token is changing and I am not saving all Tokens, just the  
ones changed. 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TeeTokenFilter performance testing

Posted by Karl Wettin <ka...@gmail.com>.

17 dec 2007 kl. 05.40 skrev Grant Ingersoll:

> a somewhat common case whereby two or more fields share a fair  
> number of common analysis steps.

Right.

> For the smaller token counts, any performance difference is  
> negligible.  However, even at 500 tokens, one starts to see a  
> difference.  The first thing to note is that TeeTokenFilter (TTF) is  
> much _slower_ in the case that all tokens are siphoned off (X = 1).   
> I believe the reason is the cost of Token.clone()

I might be missing something here, but why do you clone? I usually  
fill a List<Token> with the same instances and then only clone the  
tokens I need to update. The same token instances are used in multiple  
fields and queries at the same time and I never had any problems with  
that. Should I be expecting some?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org