You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by dr <bf...@126.com> on 2016/06/16 11:01:42 UTC

Some questions about StandardTokenizer and UNICODE Regular Expressions

Hi guys
   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 

   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
     "
    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*
HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]        [\p{WB:Format}\p{WB:Extend}]*
KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]* 
MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]* 
......
"
What does them mean, like HangulEx  or NumericEx  ?
In ClassicTokenizerImpl.jflex, for num, it is expressed like this
"
P           = ("_"|"-"|"/"|"."|",")
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
"
This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS.



 I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
Anyone has some reference of these kinds of Regular Expressions or tell me where to find the meanings of these UNICODE Regular Expressions


Thanks.

Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

Posted by dr <bf...@126.com>.

Thank you so much, Steve. Your reply is very helpful.







At 2016-06-16 23:01:18, "Steve Rowe" <sa...@gmail.com> wrote:
>Hi dr,
>
>Unicode’s character property model is described here: <http://unicode.org/reports/tr23/>.
>
>Wikipedia has a description of Unicode character properties: <https://en.wikipedia.org/wiki/Unicode_character_property>
>
>JFlex allows you to refer to the set of characters that have a given Unicode property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:
>
>  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*
>
>This matches a Hangul script character (\p{Script:Hangul})[1] that also either has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or more characters that have either the “Format” or “Extend” Word-Break properties[2].  
>
>Some helpful resources:
>
>* Character code charts organized by Unicode block: <http://www.unicode.org/charts/>
>* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - note that this utility supports a different regex syntax from JFlex - click on the “help” link for more info.
>
>[1] All characters matching \p{Script:Hangul}: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
>[2] Word-Break properties, which in JFlex can be referred to with the abbreviation “WB:” in \p{WB:property-name}, are described in the table at <http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.
>
>--
>Steve
>www.lucidworks.com
>
>
>> On Jun 16, 2016, at 7:01 AM, dr <bf...@126.com> wrote:
>> 
>> Hi guys
>>   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
>>    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
>> 
>>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
>>     "
>>    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*
>> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
>> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]        [\p{WB:Format}\p{WB:Extend}]*
>> KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]* 
>> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]* 
>> ......
>> "
>> What does them mean, like HangulEx  or NumericEx  ?
>> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
>> "
>> P           = ("_"|"-"|"/"|"."|",")
>> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>>           | {HAS_DIGIT} {P} {ALPHANUM}
>>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
>> "
>> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS.
>> 
>> 
>> 
>> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
>> UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
>> Anyone has some reference of these kinds of Regular Expressions or tell me where to find the meanings of these UNICODE Regular Expressions
>> 
>> 
>> Thanks.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

Posted by Steve Rowe <sa...@gmail.com>.

Hi dr,

Unicode’s character property model is described here: <http://unicode.org/reports/tr23/>.

Wikipedia has a description of Unicode character properties: <https://en.wikipedia.org/wiki/Unicode_character_property>

JFlex allows you to refer to the set of characters that have a given Unicode property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:

  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*

This matches a Hangul script character (\p{Script:Hangul})[1] that also either has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or more characters that have either the “Format” or “Extend” Word-Break properties[2].  

Some helpful resources:

* Character code charts organized by Unicode block: <http://www.unicode.org/charts/>
* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - note that this utility supports a different regex syntax from JFlex - click on the “help” link for more info.

[1] All characters matching \p{Script:Hangul}: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
[2] Word-Break properties, which in JFlex can be referred to with the abbreviation “WB:” in \p{WB:property-name}, are described in the table at <http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.

--
Steve
www.lucidworks.com

> On Jun 16, 2016, at 7:01 AM, dr <bf...@126.com> wrote:
> 
> Hi guys
>   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
>    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
> 
>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
>     "
>    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*
> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]        [\p{WB:Format}\p{WB:Extend}]*
> KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]* 
> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]* 
> ......
> "
> What does them mean, like HangulEx  or NumericEx  ?
> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
> "
> P           = ("_"|"-"|"/"|"."|",")
> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>           | {HAS_DIGIT} {P} {ALPHANUM}
>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> "
> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS.
> 
> 
> 
> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
> UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
> Anyone has some reference of these kinds of Regular Expressions or tell me where to find the meanings of these UNICODE Regular Expressions
> 
> 
> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org