You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robert Gründler <ro...@dubture.com> on 2010/11/11 00:39:12 UTC

Concatenate multiple tokens into one

Hi,

i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:

<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
<filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->

<!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->

With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
Input Query: "George Cloo"
Matches:
- "George Harrison"
- "John Clooridge"
- "George Smith"
-"George Clooney"
- etc

However, only "George Clooney" should match in the autocompletion use case.
Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
Are there filters which can do such a thing?

If not, are there examples how to implement a custom TokenFilter?

thanks!

-robert


 


Re: Concatenate multiple tokens into one

Posted by Robert Gründler <ro...@dubture.com>.
this is the full source code, but be warned, i'm not a java developer, and i have no background in lucine/solr development:

// ConcatFilter

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ConcatFilter extends TokenFilter {

  protected ConcatFilter(TokenStream input) 
  {
    super(input);		
  }

  @Override
  public Token next() throws IOException 
  {
    Token token = new Token();
    StringBuilder builder = new StringBuilder();

    TermAttribute termAttribute = (TermAttribute) input.getAttribute(TermAttribute.class);
    TypeAttribute typeAttribute = (TypeAttribute) input.getAttribute(TypeAttribute.class);

    boolean hasToken = false;

    while (input.incrementToken()) 
    {
      if (typeAttribute.type().equals("word")) {
        builder.append(termAttribute.term());
        hasToken = true;
      }			
    }

    if (hasToken == true) {
      token.setTermBuffer(builder.toString());
      return token;
    }
      
    return null;
  }
}

//ConcatFilterFactory:

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ConcatFilterFactory extends BaseTokenFilterFactory {

	@Override
	public TokenStream create(TokenStream stream) {
		return new ConcatFilter(stream);		
	}
}

and in your schema.xml, you can simply add the filterfactory using this element:

<filter class="com.example.ConcatFilterFactory" />

Jar files i have included in the buildpath (can be found in the solr download package):

apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar


good luck ;)


-robert




On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:

> Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from.
> Will check the thread you mention.
> 
> Best
> 
> Nick
> 
> On 11 Nov 2010, at 18:13, Robert Gründler wrote:
> 
>> I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i
>> realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types).
>> 
>> Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: "EdgeNGram relevancy).
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> See 
>> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
>> 
>>> Hi Robert, All,
>>> 
>>> I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/
>>> I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
>>> If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful.
>>> 
>>> Many thanks
>>> 
>>> Nick
>>> 
>>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>>> 
>>>> 
>>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>>> 
>>>>> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 
>>>> 
>>>> in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>>>> make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.
>>>> 
>>>>> 
>>>>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 
>>>> 
>>>> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
>>>> 
>>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:
>>>> 
>>>> public class ConcatFilter extends TokenFilter {
>>>> 
>>>> 	private TokenStream tstream;
>>>> 
>>>> 	protected ConcatFilter(TokenStream input) {
>>>> 		super(input);
>>>> 		this.tstream = input;
>>>> 	}
>>>> 
>>>> 	@Override
>>>> 	public Token next() throws IOException {
>>>> 		
>>>> 		Token token = new Token();
>>>> 		StringBuilder builder = new StringBuilder();
>>>> 		
>>>> 		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
>>>> 		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
>>>> 		
>>>> 		boolean incremented = false;
>>>> 		
>>>> 		while (tstream.incrementToken()) {
>>>> 			
>>>> 			if (typeAttribute.type().equals("word")) {
>>>> 				builder.append(termAttribute.term());				
>>>> 			}
>>>> 			incremented = true;
>>>> 		}
>>>> 		
>>>> 		token.setTermBuffer(builder.toString());
>>>> 		
>>>> 		if (incremented == true)
>>>> 			return token;
>>>> 		
>>>> 		return null;
>>>> 	}
>>>> }
>>>> 
>>>> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.
>>>> 
>>>> 
>>>> best
>>>> 
>>>> 
>>>> -robert
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 
>>>>> 
>>>>> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________________
>>>>> From: Robert Gründler [robert@dubture.com]
>>>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Concatenate multiple tokens into one
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
>>>>> 
>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
>>>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
>>>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->
>>>>> 
>>>>> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->
>>>>> 
>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->
>>>>> 
>>>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
>>>>> Input Query: "George Cloo"
>>>>> Matches:
>>>>> - "George Harrison"
>>>>> - "John Clooridge"
>>>>> - "George Smith"
>>>>> -"George Clooney"
>>>>> - etc
>>>>> 
>>>>> However, only "George Clooney" should match in the autocompletion use case.
>>>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>>>>> Are there filters which can do such a thing?
>>>>> 
>>>>> If not, are there examples how to implement a custom TokenFilter?
>>>>> 
>>>>> thanks!
>>>>> 
>>>>> -robert
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Concatenate multiple tokens into one

Posted by Nick Martin <ia...@googlemail.com>.
Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from.
Will check the thread you mention.

Best

Nick

On 11 Nov 2010, at 18:13, Robert Gründler wrote:

> I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i
> realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types).
> 
> Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: "EdgeNGram relevancy).
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
> See 
> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> 
>> Hi Robert, All,
>> 
>> I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/
>> I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
>> If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful.
>> 
>> Many thanks
>> 
>> Nick
>> 
>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>> 
>>> 
>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>> 
>>>> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 
>>> 
>>> in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>>> make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.
>>> 
>>>> 
>>>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 
>>> 
>>> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
>>> 
>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:
>>> 
>>> public class ConcatFilter extends TokenFilter {
>>> 
>>> 	private TokenStream tstream;
>>> 
>>> 	protected ConcatFilter(TokenStream input) {
>>> 		super(input);
>>> 		this.tstream = input;
>>> 	}
>>> 
>>> 	@Override
>>> 	public Token next() throws IOException {
>>> 		
>>> 		Token token = new Token();
>>> 		StringBuilder builder = new StringBuilder();
>>> 		
>>> 		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
>>> 		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
>>> 		
>>> 		boolean incremented = false;
>>> 		
>>> 		while (tstream.incrementToken()) {
>>> 			
>>> 			if (typeAttribute.type().equals("word")) {
>>> 				builder.append(termAttribute.term());				
>>> 			}
>>> 			incremented = true;
>>> 		}
>>> 		
>>> 		token.setTermBuffer(builder.toString());
>>> 		
>>> 		if (incremented == true)
>>> 			return token;
>>> 		
>>> 		return null;
>>> 	}
>>> }
>>> 
>>> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.
>>> 
>>> 
>>> best
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 
>>>> 
>>>> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 
>>>> 
>>>> Jonathan
>>>> 
>>>> 
>>>> 
>>>> ________________________________________
>>>> From: Robert Gründler [robert@dubture.com]
>>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Concatenate multiple tokens into one
>>>> 
>>>> Hi,
>>>> 
>>>> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
>>>> 
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
>>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
>>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->
>>>> 
>>>> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->
>>>> 
>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->
>>>> 
>>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
>>>> Input Query: "George Cloo"
>>>> Matches:
>>>> - "George Harrison"
>>>> - "John Clooridge"
>>>> - "George Smith"
>>>> -"George Clooney"
>>>> - etc
>>>> 
>>>> However, only "George Clooney" should match in the autocompletion use case.
>>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>>>> Are there filters which can do such a thing?
>>>> 
>>>> If not, are there examples how to implement a custom TokenFilter?
>>>> 
>>>> thanks!
>>>> 
>>>> -robert
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 


Re: Concatenate multiple tokens into one

Posted by Robert Gründler <ro...@dubture.com>.
I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: "EdgeNGram relevancy).


best


-robert




See 
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

> Hi Robert, All,
> 
> I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful.
> 
> Many thanks
> 
> Nick
> 
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
> 
>> 
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>> 
>>> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 
>> 
>> in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>> make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.
>> 
>>> 
>>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 
>> 
>> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
>> 
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:
>> 
>> public class ConcatFilter extends TokenFilter {
>> 
>> 	private TokenStream tstream;
>> 
>> 	protected ConcatFilter(TokenStream input) {
>> 		super(input);
>> 		this.tstream = input;
>> 	}
>> 
>> 	@Override
>> 	public Token next() throws IOException {
>> 		
>> 		Token token = new Token();
>> 		StringBuilder builder = new StringBuilder();
>> 		
>> 		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
>> 		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
>> 		
>> 		boolean incremented = false;
>> 		
>> 		while (tstream.incrementToken()) {
>> 			
>> 			if (typeAttribute.type().equals("word")) {
>> 				builder.append(termAttribute.term());				
>> 			}
>> 			incremented = true;
>> 		}
>> 		
>> 		token.setTermBuffer(builder.toString());
>> 		
>> 		if (incremented == true)
>> 			return token;
>> 		
>> 		return null;
>> 	}
>> }
>> 
>> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>>> 
>>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 
>>> 
>>> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 
>>> 
>>> Jonathan
>>> 
>>> 
>>> 
>>> ________________________________________
>>> From: Robert Gründler [robert@dubture.com]
>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Concatenate multiple tokens into one
>>> 
>>> Hi,
>>> 
>>> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
>>> 
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->
>>> 
>>> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->
>>> 
>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->
>>> 
>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
>>> Input Query: "George Cloo"
>>> Matches:
>>> - "George Harrison"
>>> - "John Clooridge"
>>> - "George Smith"
>>> -"George Clooney"
>>> - etc
>>> 
>>> However, only "George Clooney" should match in the autocompletion use case.
>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>>> Are there filters which can do such a thing?
>>> 
>>> If not, are there examples how to implement a custom TokenFilter?
>>> 
>>> thanks!
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>> 
> 


Re: Concatenate multiple tokens into one

Posted by Nick Martin <ia...@googlemail.com>.
Hi Robert, All,

I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful.

Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gründler wrote:

> 
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> 
>> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 
> 
> in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.
> 
>> 
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 
> 
> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
> 
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:
> 
> public class ConcatFilter extends TokenFilter {
> 
> 	private TokenStream tstream;
> 
> 	protected ConcatFilter(TokenStream input) {
> 		super(input);
> 		this.tstream = input;
> 	}
> 
> 	@Override
> 	public Token next() throws IOException {
> 		
> 		Token token = new Token();
> 		StringBuilder builder = new StringBuilder();
> 		
> 		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
> 		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
> 		
> 		boolean incremented = false;
> 		
> 		while (tstream.incrementToken()) {
> 			
> 			if (typeAttribute.type().equals("word")) {
> 				builder.append(termAttribute.term());				
> 			}
> 			incremented = true;
> 		}
> 		
> 		token.setTermBuffer(builder.toString());
> 		
> 		if (incremented == true)
> 			return token;
> 		
> 		return null;
> 	}
> }
> 
> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
>> 
>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 
>> 
>> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 
>> 
>> Jonathan
>> 
>> 
>> 
>> ________________________________________
>> From: Robert Gründler [robert@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>> 
>> Hi,
>> 
>> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
>> 
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->
>> 
>> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->
>> 
>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->
>> 
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>> 
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>> 
>> If not, are there examples how to implement a custom TokenFilter?
>> 
>> thanks!
>> 
>> -robert
>> 
>> 
>> 
>> 
> 


Re: Concatenate multiple tokens into one

Posted by Robert Gründler <ro...@dubture.com>.
On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:

> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 

in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which
make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.

> 
> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 

I started out with the KeywordTokenizer, which worked well, except the StopWord problem.

For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:

public class ConcatFilter extends TokenFilter {

	private TokenStream tstream;

	protected ConcatFilter(TokenStream input) {
		super(input);
		this.tstream = input;
	}

	@Override
	public Token next() throws IOException {
		
		Token token = new Token();
		StringBuilder builder = new StringBuilder();
		
		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
		
		boolean incremented = false;
		
		while (tstream.incrementToken()) {
			
			if (typeAttribute.type().equals("word")) {
				builder.append(termAttribute.term());				
			}
			incremented = true;
		}
		
		token.setTermBuffer(builder.toString());
		
		if (incremented == true)
			return token;
		
		return null;
	}
}

I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.


best


-robert




> 
> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 
> 
> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 
> 
> Jonathan
> 
> 
> 
> ________________________________________
> From: Robert Gründler [robert@dubture.com]
> Sent: Wednesday, November 10, 2010 6:39 PM
> To: solr-user@lucene.apache.org
> Subject: Concatenate multiple tokens into one
> 
> Hi,
> 
> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->
> 
> <!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->
> 
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->
> 
> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
> Input Query: "George Cloo"
> Matches:
> - "George Harrison"
> - "John Clooridge"
> - "George Smith"
> -"George Clooney"
> - etc
> 
> However, only "George Clooney" should match in the autocompletion use case.
> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
> Are there filters which can do such a thing?
> 
> If not, are there examples how to implement a custom TokenFilter?
> 
> thanks!
> 
> -robert
> 
> 
> 
> 


RE: Concatenate multiple tokens into one

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do. 

And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. 

Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. 

If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this. 

Jonathan



________________________________________
From: Robert Gründler [robert@dubture.com]
Sent: Wednesday, November 10, 2010 6:39 PM
To: solr-user@lucene.apache.org
Subject: Concatenate multiple tokens into one

Hi,

i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:

<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens separated by whitespace -->
<filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out stopwords -->
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out all everything except a-z -->

<!-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -->

<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches -->

With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
Input Query: "George Cloo"
Matches:
- "George Harrison"
- "John Clooridge"
- "George Smith"
-"George Clooney"
- etc

However, only "George Clooney" should match in the autocompletion use case.
Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
Are there filters which can do such a thing?

If not, are there examples how to implement a custom TokenFilter?

thanks!

-robert