You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2008/08/22 21:47:32 UTC

How do TeeTokenizer and SinkTokenizer work?

Hello,
I'm interested in knowing how these tokenizers work together.
The API doc for TeeTokenizer
http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html

has this sample code:
SinkTokenizer sink1 = new SinkTokenizer(null);
SinkTokenizer sink2 = new SinkTokenizer(null);

TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2);
TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2);

TokenStream final3 = new EntityDetect(sink1);
TokenStream final4 = new URLDetect(sink2);

with an explanation that reads "sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer",
but I don't understand how the input from reader1 and reader2 are mixed together.
Will sink1 first reaturn the reader1 text, and reader2?
Or are they mixed randomly?

-Kuro
 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How do TeeTokenizer and SinkTokenizer work?

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 25, 2008, at 7:29 PM, Teruhiko Kurosaka wrote:

> Thank you, Grant and (Koji) Sekiguchi-san.
>
>
>> but I don't
>>> understand how the input from reader1 and reader2 are mixed
>> together.
>>> Will sink1 first reaturn the reader1 text, and reader2?
>>
>> It depends on the order the fields are added.  If source1 is
>> used first, then reader1 will be first.
>
> This puzzles me.  Is this really useful if how SinkTokenizer
> and TeeTokenizer behave depends on how they are read?

Fields in a Document are added as a List, so the Field ordering is  
always the same.

>
> I've read the source code of these Tokenizers but that
> didn't solve my question.
>
> This is an excerpt from Sekiguchi-san's code sample:
>
> 	Analyzer analyzer = new Analyzer() {
>
> 		public TokenStream tokenStream(String field, Reader in) {
> 				return new TeeTokenFilter(
> 					new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ),
> 								sinkPerson ), sinkOrg );
> 		}
> 	};
>
> 	TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
> 	TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
> 	IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
> 	Document doc = new Document();
> 	doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
> 	doc.add( new Field( F_PERSON, exPerson ) );
> 	doc.add( new Field( F_ORG, exOrg ) );
> 	writer.addDocument( doc );
>
> It seems that the code works as expected only if the token stream from
> the analyzer on CONTENT is read completely, then the token stream from
> sinkPerson is read compeltely, followed by that from sinkOrg.
>
> Does Lucene's core gurantees that a field's token stream is read  
> completely
> before the next field's token stream is read, in the order the  
> Field's are add()'ed?

Yes, it processes all of one Field first, then the next one.  If it  
doesn't, then we have a bug, IMO, and we will have to have a different  
approach for the Tee/Sink.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: How do TeeTokenizer and SinkTokenizer work?

Posted by Teruhiko Kurosaka <Ku...@basistech.com>.

Thank you, Grant and (Koji) Sekiguchi-san.


> but I don't 
> > understand how the input from reader1 and reader2 are mixed 
> together.
> > Will sink1 first reaturn the reader1 text, and reader2?
> 
> It depends on the order the fields are added.  If source1 is 
> used first, then reader1 will be first.

This puzzles me.  Is this really useful if how SinkTokenizer
and TeeTokenizer behave depends on how they are read?
I've read the source code of these Tokenizers but that
didn't solve my question.

This is an excerpt from Sekiguchi-san's code sample:

	Analyzer analyzer = new Analyzer() {

		public TokenStream tokenStream(String field, Reader in) {
				return new TeeTokenFilter( 
					new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ), 
								sinkPerson ), sinkOrg );
		}
	};

	TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
	TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
	IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
	Document doc = new Document();
	doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
	doc.add( new Field( F_PERSON, exPerson ) );
	doc.add( new Field( F_ORG, exOrg ) );
	writer.addDocument( doc );

It seems that the code works as expected only if the token stream from
the analyzer on CONTENT is read completely, then the token stream from
sinkPerson is read compeltely, followed by that from sinkOrg.

Does Lucene's core gurantees that a field's token stream is read completely
before the next field's token stream is read, in the order the Field's are add()'ed?

- Kuro

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How do TeeTokenizer and SinkTokenizer work?

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 22, 2008, at 3:47 PM, Teruhiko Kurosaka wrote:

> Hello,
> I'm interested in knowing how these tokenizers work together.
> The API doc for TeeTokenizer
> http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html
>
> has this sample code:
> SinkTokenizer sink1 = new SinkTokenizer(null);
> SinkTokenizer sink2 = new SinkTokenizer(null);
>
> TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new  
> WhitespaceTokenizer(reader1), sink1), sink2);
> TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new  
> WhitespaceTokenizer(reader2), sink1), sink2);
>
> TokenStream final3 = new EntityDetect(sink1);
> TokenStream final4 = new URLDetect(sink2);
>
> with an explanation that reads "sink1 and sink2 will both get tokens  
> from both reader1 and reader2 after whitespace tokenizer",
> but I don't understand how the input from reader1 and reader2 are  
> mixed together.
> Will sink1 first reaturn the reader1 text, and reader2?

It depends on the order the fields are added.  If source1 is used  
first, then reader1 will be first.

Try out the code at the bottom.  I get the following if source1 is  
first:

------
final 1
(a,0,1)
(b,2,3)
(c,4,5)
(d,6,7)
(f,8,9)
(g,10,11)
-------- end final 1 -------
------
final 2
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
(m,10,11)
-------- end final 2 -------
------
final 3
(a,0,1)
(c,4,5)
(F,8,9)
(g,10,11)
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
(m,10,11)
-------- end final 3 -------
------
final 4
(a,0,1)
(b,2,3)
(c,4,5)
(d,6,7)
(F,8,9)
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
-------- end final 4 -------

and this if final2 is first:

------
final 2
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
(m,10,11)
-------- end final 2 -------
------
final 1
(a,0,1)
(b,2,3)
(c,4,5)
(d,6,7)
(f,8,9)
(g,10,11)
-------- end final 1 -------
------
final 3
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
(m,10,11)
(a,0,1)
(c,4,5)
(F,8,9)
(g,10,11)
-------- end final 3 -------
------
final 4
(h,0,1)
(i,2,3)
(J,4,5)
(k,6,7)
(L,8,9)
(a,0,1)
(b,2,3)
(c,4,5)
(d,6,7)
(F,8,9)
-------- end final 4 -------




public class SinkTest extends TestCase {
public class SinkTest extends TestCase {
   public void testSink() throws Exception {
     StringReader reader1 = new StringReader("a b c d F g");
     StringReader reader2 = new StringReader("h i J k L m");

     SinkTokenizer sink1 = new SinkTokenizer(null);
     SinkTokenizer sink2 = new SinkTokenizer(null);

     TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new  
WhitespaceTokenizer(reader1), sink1), sink2);
     TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new  
WhitespaceTokenizer(reader2), sink1), sink2);

     TokenStream final1 = new LowerCaseFilter(source1);
     TokenStream final2 = source2;
     String[] stops1 = {"b", "d"};
     TokenStream final3 = new StopFilter(sink1, stops1);
     String[] stops2 = {"m", "g"};
     TokenStream final4 = new StopFilter(sink2, stops2);



     printTokens(final1, "final 1");
     printTokens(final2, "final 2");

     printTokens(final3, "final 3");
     printTokens(final4, "final 4");


   }
   private void printTokens(TokenStream input, String label) throws  
IOException {
     Token next = new Token();
     System.out.println("------");
     System.out.println(label);
     while ((next = input.next(next)) != null) {
       System.out.println(next);
     }
     System.out.println("-------- end " + label + " -------");
   }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How do TeeTokenizer and SinkTokenizer work?

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Hi Kurosaka-san,

I'd written an article on my blog several month ago about SinkTokenizer
and TeeTokenFilter.

See:
http://lucene.jugem.jp/?eid=172

Sorry, but all written in Japanese...

Koji


Teruhiko Kurosaka wrote:
> Hello,
> I'm interested in knowing how these tokenizers work together.
> The API doc for TeeTokenizer
> http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html
>
> has this sample code:
> SinkTokenizer sink1 = new SinkTokenizer(null);
> SinkTokenizer sink2 = new SinkTokenizer(null);
>
> TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2);
> TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2);
>
> TokenStream final3 = new EntityDetect(sink1);
> TokenStream final4 = new URLDetect(sink2);
>
> with an explanation that reads "sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer",
> but I don't understand how the input from reader1 and reader2 are mixed together.
> Will sink1 first reaturn the reader1 text, and reader2?
> Or are they mixed randomly?
>
> -Kuro
>  
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org