You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by "Michael Garski (JIRA)" <ji...@apache.org> on 2010/01/19 18:37:54 UTC

[jira] Created: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

TokenAttribute for Selectively Including Tokens in Length Norm
--------------------------------------------------------------

                 Key: LUCENENET-337
                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
             Project: Lucene.Net
          Issue Type: Improvement
            Reporter: Michael Garski
            Priority: Minor


This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.

Following is an example of how to use the new attribute.

Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.

{code:title=CustomTokenFilter.cs|borderStyle=solid}
public class CustomTokenFilter : TokenFilter
{
	private LengthNormAttribute lnAttribute;
	
	public CustomTokenFilter(TokenStream input) : base(input)
	{
		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
	}
		
	public override bool IncrementToken()
	{
		if (input.IncrementToken())
		{
			// make determination that the token is not to be 
			// included in the length norm value
			// this example marks all tokens to not be 
			// included in the length norm value
			this.lnAttribute.IncludeInLengthNorm = false;

			return true;
		}
		else
		{
			return false;
		}
	}    
}
{code}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

Posted by "Artem Chereisky (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENENET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828450#action_12828450 ] 

Artem Chereisky commented on LUCENENET-337:
-------------------------------------------

Michael, thank you for the patch. SynonymFilter that I've got is implemented with TokenStream.Next method which is obsolete in 2.9. Are you aware of any port of SynonymFilter to 2.9 using IncrementToken() and AttributeSource APIs instead?

> TokenAttribute for Selectively Including Tokens in Length Norm
> --------------------------------------------------------------
>
>                 Key: LUCENENET-337
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
>             Project: Lucene.Net
>          Issue Type: Improvement
>            Reporter: Michael Garski
>            Priority: Minor
>         Attachments: LengthNorm.patch
>
>
> This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.
> Following is an example of how to use the new attribute.
> Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.
> {code:title=CustomTokenFilter.cs|borderStyle=solid}
> public class CustomTokenFilter : TokenFilter
> {
> 	private LengthNormAttribute lnAttribute;
> 	
> 	public CustomTokenFilter(TokenStream input) : base(input)
> 	{
> 		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
> 	}
> 		
> 	public override bool IncrementToken()
> 	{
> 		if (input.IncrementToken())
> 		{
> 			// make determination that the token is not to be 
> 			// included in the length norm value
> 			// this example marks all tokens to not be 
> 			// included in the length norm value
> 			this.lnAttribute.IncludeInLengthNorm = false;
> 			return true;
> 		}
> 		else
> 		{
> 			return false;
> 		}
> 	}    
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

Posted by "Michael Garski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENENET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843812#action_12843812 ] 

Michael Garski commented on LUCENENET-337:
------------------------------------------

There is an alternate way to avoid having synonyms add to the length norm of a field, which is currently in the trunk.

Rather than using the default SImilarity as-is, if you call SetDiscountOverlaps(true) on DefaultSimilarity overlapping tokens such as synonyms will not be considered for the length norm.

That being said, there are cases where the approach in the patch would work better, such as when you wish to maintain positional information but specify the length norm to be something other than the number of tokens such as using synonyms with acronyms.  This is a case I run into quite a bit with names of musicians and bands.

> TokenAttribute for Selectively Including Tokens in Length Norm
> --------------------------------------------------------------
>
>                 Key: LUCENENET-337
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
>             Project: Lucene.Net
>          Issue Type: Improvement
>            Reporter: Michael Garski
>            Priority: Minor
>         Attachments: LengthNorm.patch
>
>
> This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.
> Following is an example of how to use the new attribute.
> Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.
> {code:title=CustomTokenFilter.cs|borderStyle=solid}
> public class CustomTokenFilter : TokenFilter
> {
> 	private LengthNormAttribute lnAttribute;
> 	
> 	public CustomTokenFilter(TokenStream input) : base(input)
> 	{
> 		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
> 	}
> 		
> 	public override bool IncrementToken()
> 	{
> 		if (input.IncrementToken())
> 		{
> 			// make determination that the token is not to be 
> 			// included in the length norm value
> 			// this example marks all tokens to not be 
> 			// included in the length norm value
> 			this.lnAttribute.IncludeInLengthNorm = false;
> 			return true;
> 		}
> 		else
> 		{
> 			return false;
> 		}
> 	}    
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

Posted by "Artem Chereisky (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENENET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828841#action_12828841 ] 

Artem Chereisky commented on LUCENENET-337:
-------------------------------------------

I found a java version of a multi-word synonym filter, http://www.java2s.com/Open-Source/Java-Document/Search-Engine/apache-solr-1.2.0/org/apache/solr/analysis/SynonymFilter.java.htm, and coded it in c#. I thought it was a de facto standard. Now I'm beginning to realize there is no standard. 

The issue is that it uses look ahead method to determine the longest possible match. I guess my issue is I can't figure out how to do look ahead using IncrementToken().

> TokenAttribute for Selectively Including Tokens in Length Norm
> --------------------------------------------------------------
>
>                 Key: LUCENENET-337
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
>             Project: Lucene.Net
>          Issue Type: Improvement
>            Reporter: Michael Garski
>            Priority: Minor
>         Attachments: LengthNorm.patch
>
>
> This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.
> Following is an example of how to use the new attribute.
> Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.
> {code:title=CustomTokenFilter.cs|borderStyle=solid}
> public class CustomTokenFilter : TokenFilter
> {
> 	private LengthNormAttribute lnAttribute;
> 	
> 	public CustomTokenFilter(TokenStream input) : base(input)
> 	{
> 		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
> 	}
> 		
> 	public override bool IncrementToken()
> 	{
> 		if (input.IncrementToken())
> 		{
> 			// make determination that the token is not to be 
> 			// included in the length norm value
> 			// this example marks all tokens to not be 
> 			// included in the length norm value
> 			this.lnAttribute.IncludeInLengthNorm = false;
> 			return true;
> 		}
> 		else
> 		{
> 			return false;
> 		}
> 	}    
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

Posted by "Michael Garski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENENET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated LUCENENET-337:
-------------------------------------

    Attachment: LengthNorm.patch

Patch is attached.  Note that the updated project file has been converted to VS2008 version, so if you are using 2005, you will want to skip updating the project file and manually add the two new files to the project:

Analysis\Tokenattributes\LengthNormAttribute.cs
Analysis\Tokenattributes\LengthNormAttributeImpl.cs

> TokenAttribute for Selectively Including Tokens in Length Norm
> --------------------------------------------------------------
>
>                 Key: LUCENENET-337
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
>             Project: Lucene.Net
>          Issue Type: Improvement
>            Reporter: Michael Garski
>            Priority: Minor
>         Attachments: LengthNorm.patch
>
>
> This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.
> Following is an example of how to use the new attribute.
> Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.
> {code:title=CustomTokenFilter.cs|borderStyle=solid}
> public class CustomTokenFilter : TokenFilter
> {
> 	private LengthNormAttribute lnAttribute;
> 	
> 	public CustomTokenFilter(TokenStream input) : base(input)
> 	{
> 		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
> 	}
> 		
> 	public override bool IncrementToken()
> 	{
> 		if (input.IncrementToken())
> 		{
> 			// make determination that the token is not to be 
> 			// included in the length norm value
> 			// this example marks all tokens to not be 
> 			// included in the length norm value
> 			this.lnAttribute.IncludeInLengthNorm = false;
> 			return true;
> 		}
> 		else
> 		{
> 			return false;
> 		}
> 	}    
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENENET-337) TokenAttribute for Selectively Including Tokens in Length Norm

Posted by "Michael Garski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENENET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828668#action_12828668 ] 

Michael Garski commented on LUCENENET-337:
------------------------------------------

What SynonymFilter are you referring to?  I'm not aware of any and use the functionality in the patch in a few different TokenFilters that we use.



> TokenAttribute for Selectively Including Tokens in Length Norm
> --------------------------------------------------------------
>
>                 Key: LUCENENET-337
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-337
>             Project: Lucene.Net
>          Issue Type: Improvement
>            Reporter: Michael Garski
>            Priority: Minor
>         Attachments: LengthNorm.patch
>
>
> This patch adds functionality to Lucene.Net that allow a TokenFilter to mark a Token as not to be included in the length norm calculation through the use of a new TokenAttribute interface LengthNormAttribute and a corresponding implementation LengthNormAttributeImpl.  This functionality is useful to prevent the increase of the length norm during synonym injection, particularly in cases where there are a large number of synonyms in relation to the number of original tokens.
> Following is an example of how to use the new attribute.
> Within your custom TokenFilter, define a field to persist a reference to the attribute and set it's value in the constructor.  When a the stream advances to a new Token within the call to IncrementToken() the value of the IncludeInLengthNorm property of the attribute is set to false for Tokens which should not be included in the length norm calculation.  It defaults to true and is reset to true after each Token is consumed within DocInverterPerField.ProcessFields.
> {code:title=CustomTokenFilter.cs|borderStyle=solid}
> public class CustomTokenFilter : TokenFilter
> {
> 	private LengthNormAttribute lnAttribute;
> 	
> 	public CustomTokenFilter(TokenStream input) : base(input)
> 	{
> 		this.lnAttribute = (LengthNormAttribute)AddAttribute(typeof(LengthNormAttribute));
> 	}
> 		
> 	public override bool IncrementToken()
> 	{
> 		if (input.IncrementToken())
> 		{
> 			// make determination that the token is not to be 
> 			// included in the length norm value
> 			// this example marks all tokens to not be 
> 			// included in the length norm value
> 			this.lnAttribute.IncludeInLengthNorm = false;
> 			return true;
> 		}
> 		else
> 		{
> 			return false;
> 		}
> 	}    
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.