You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2012/12/19 23:10:53 UTC

ICUTokenizer labels number as Han character?

Hello,

Don't know if the Solr admin panel is lying, or if this is a wierd bug.
The string: "1986年"  gets analyzed by the ICUTokenizer with "1986" being
identified as type:NUM and script:Han.  Then the CJKBigram filter
identifies "1986" as type:Num and script:Han and "年" as type:Single and
script: Common.

This doesn't seem right.   Couldn't fit the whole analysis output on one
screen so there are two screenshots attached.

Any clues as to what is going on and whether it is a problem?

Tom

Re: ICUTokenizer labels number as Han character?

Posted by Robert Muir <rc...@gmail.com>.

Your attachment didnt come through: I think the list strips them.
Maybe just open a JIRA and attach your screenshots? or put them
elsewhere and just include a link?

As far as the ultimate behavior, I think its correct. Keep in mind
tokens don't really get a script value: runs of untokenized text do.
"common" is stuff like numbers/punctuation/etc that just keeps the run
whatever it was before (e.g. Han).

And the bigram filter only bigrams text with certain token types (NUM
is not one of them), so making a singleton is correct.

On Wed, Dec 19, 2012 at 5:10 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Hello,
>
> Don't know if the Solr admin panel is lying, or if this is a wierd bug.
> The string: "1986年"  gets analyzed by the ICUTokenizer with "1986" being
> identified as type:NUM and script:Han.  Then the CJKBigram filter identifies
> "1986" as type:Num and script:Han and "年" as type:Single and script: Common.
>
> This doesn't seem right.   Couldn't fit the whole analysis output on one
> screen so there are two screenshots attached.
>
> Any clues as to what is going on and whether it is a problem?
>
> Tom