You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by oh...@cox.net on 2009/07/31 04:12:22 UTC

Is there a list of "special" characters for standard analyzer?

Hi,

I was wonder if there is a list of special characters for the standard analyzer?  

What I mean by "special" is characters that the analyzer considers break characters.  For example, if I have something like "foo=something", apparently the analyzer considers this as two terms, "foo" and "something.

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a list of "special" characters for standard analyzer?

Posted by Simon Willnauer <si...@googlemail.com>.
On Fri, Jul 31, 2009 at 5:00 PM, <oh...@cox.net> wrote:
> Hi Ahmet,
>
> Thanks for the clarification and information!  That was exactly what I was looking for.
>
> Jim
>
>
> ---- AHMET ARSLAN <io...@yahoo.com> wrote:
>>
>> > I guess that the obvious question is "Which characters are
>> > considered 'punctuation characters'?".
>>
>> Punctuation = ("_"|"-"|"/"|"."|",")
Those punctuation are only for floating point, ip-addresses etc.
StandardTokenizer does not have punctuation explicitly set. You can
assume that it will drop and split on almost all punctuations coming
along in the input string.

Have a look at StandardTokenizerImpl.jflex the gramma is quiet easy to
understand and gives you a better idea what this tokenizer does.

simon
>>
>> > In particular, does the analyzer consider "=" (equal) and
>> > ":" (colon) to be punctuation characters?
>>
>> ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.
>>
>> If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added.
>>
>> Ahmet
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a list of "special" characters for standard analyzer?

Posted by oh...@cox.net.
Hi Ahmet,

Thanks for the clarification and information!  That was exactly what I was looking for.

Jim


---- AHMET ARSLAN <io...@yahoo.com> wrote: 
> 
> > I guess that the obvious question is "Which characters are
> > considered 'punctuation characters'?".
>  
> Punctuation = ("_"|"-"|"/"|"."|",")
> 
> > In particular, does the analyzer consider "=" (equal) and
> > ":" (colon) to be punctuation characters?
> 
> ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.
> 
> If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added. 
> 
> Ahmet
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a list of "special" characters for standard analyzer?

Posted by AHMET ARSLAN <io...@yahoo.com>.
> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
 
Punctuation = ("_"|"-"|"/"|"."|",")

> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?

":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.

If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added. 

Ahmet


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a list of "special" characters for standard analyzer?

Posted by oh...@cox.net.
---- Phil Whelan <ph...@gmail.com> wrote: 
> On Thu, Jul 30, 2009 at 7:12 PM, <oh...@cox.net> wrote:
> > I was wonder if there is a list of special characters for the standard analyzer?
> >
> > What I mean by "special" is characters that the analyzer considers break characters.
> > For example, if I have something like "foo=something", apparently the analyzer
> > considers this as two terms, "foo" and "something.
> 
> Hi Jim,
> 
> This is what I could find in the docs...
> 
> StandardAnalyzer uses StandardTokenizer
> 
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
> * Splits words at punctuation characters, removing punctuation.
> However, a dot that's not followed by whitespace is considered part of
> a token.
> * Splits words at hyphens, unless there's a number in the token, in
> which case the whole token is interpreted as a product number and is
> not split.
> * Recognizes email addresses and internet hostnames as one token.
> 
> Also, these are the tokens that will be removed..
> 
>   public static final String[] ENGLISH_STOP_WORDS = {
>     "a", "an", "and", "are", "as", "at", "be", "but", "by",
>     "for", "if", "in", "into", "is", "it",
>     "no", "not", "of", "on", "or", "such",
>     "that", "the", "their", "then", "there", "these",
>     "they", "this", "to", "was", "will", "with"
>   };
> 
> Thanks,
> Phil
> 


Hi Phil,

I guess that the obvious question is "Which characters are considered 'punctuation characters'?".

In particular, does the analyzer consider "=" (equal) and ":" (colon) to be punctuation characters?

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a list of "special" characters for standard analyzer?

Posted by Phil Whelan <ph...@gmail.com>.
On Thu, Jul 30, 2009 at 7:12 PM, <oh...@cox.net> wrote:
> I was wonder if there is a list of special characters for the standard analyzer?
>
> What I mean by "special" is characters that the analyzer considers break characters.
> For example, if I have something like "foo=something", apparently the analyzer
> considers this as two terms, "foo" and "something.

Hi Jim,

This is what I could find in the docs...

StandardAnalyzer uses StandardTokenizer

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
* Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of
a token.
* Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
* Recognizes email addresses and internet hostnames as one token.

Also, these are the tokens that will be removed..

  public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org