You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Yasufumi Mizoguchi <ya...@gmail.com> on 2018/10/01 04:14:45 UTC

Creating CJK bigram tokens with ClassicTokenizer

Hi,

I am looking for the way to create CJK bigram tokens with ClassicTokenizer.
I tried this by using CJKBigramFilter, but it only supports for
StandardTokenizer...

So, is there any good way to do that?

Thanks,
Yasufumi

Re: Creating CJK bigram tokens with ClassicTokenizer

Posted by Yasufumi Mizoguchi <ya...@gmail.com>.

Hi, Shawn

Thank you for replying me.

> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?

I am sorry for lack of information. I tried this with Solr 5.5.5 and 7.5.0.
And here is analyzer configuration from my managed-schema.

<fieldType name="text_classic" class="solr.TextField"
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory"/>
  </analyzer>
</fieldType>

And what I want to do is
1. to create CJ bigram token
2. to extract each word that contains a hyphen and stopwords as a single
token
   (e.g. as-is, to-be, etc...) from CJK and English sentences.

CJKBigramFilter seems to check TOKEN_TYPES attribute added by
StandardTokenizer when creating CJK bigram token.
(See
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java#L64
)

ClassicTokenizer also adds obsolete TOKEN_TYPES "CJ" to the CJ token and
"ALPHANUM" to the Korean alphabet, but both are not targets for
CJKBigramFilter...

Thanks,
Yasufumi

2018年10月2日(火) 0:05 Shawn Heisey <ap...@elyograg.org>:

> On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:
> > I am looking for the way to create CJK bigram tokens with
> ClassicTokenizer.
> > I tried this by using CJKBigramFilter, but it only supports for
> > StandardTokenizer...
>
> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?
>
> I don't have access to the systems where I was using that filter, but if
> I recall correctly, I was using the whitespace tokenizer.
>
> Thanks,
> Shawn
>
>

Re: Creating CJK bigram tokens with ClassicTokenizer

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:
> I am looking for the way to create CJK bigram tokens with ClassicTokenizer.
> I tried this by using CJKBigramFilter, but it only supports for
> StandardTokenizer...

CJKBigramFilter shouldn't care what tokenizer you're using.  It should 
work with any tokenizer.  What problem are you seeing that you're trying 
to solve?  What version of Solr, what configuration, and what does it do 
that you're not expecting, and what do you want it to do?

I don't have access to the systems where I was using that filter, but if 
I recall correctly, I was using the whitespace tokenizer.

Thanks,
Shawn