You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Eswar K <kj...@gmail.com> on 2007/11/26 14:30:52 UTC

CJK Analyzers for Solr

Hi,

Does Solr come with Language analyzers for CJK? If not, can you please
direct me to some good CJK analyzers?

Regards,
Eswar

Re: CJK Analyzers for Solr

Posted by James liu <li...@gmail.com>.

if ur analyzer is standard, u can try use tokenize.(u can find the answer
from analyzer source code and schema.xml)


On Nov 27, 2007 9:39 AM, zx zhang <bn...@gmail.com> wrote:

> lance,
>
> The following is a instance schema fieldtype using solr1.2 and CJK
> package.
> And it works. As you said, CJK does parse cjk string in a bi-gram way,
> just
> like turning 'C1C2C3C4' into 'C1C2 C2C3 C3C4'.
>
> More to the point, it is worthwhile to mention that the index expand
> beyond
> tolerance to use cjk package, and it will take a long time to index
> document. For most enterprise applications, I think, it need a more
> effective string parser.
>
>
> <fieldtype name="text_cjk" class="solr.TextField">
>      <analyzer class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>
> </fieldtype>
>
>
>
> On 11/27/07, Norskog, Lance <la...@divvio.com> wrote:
> >
> > I notice this is in the future tense. Is the CJKTokenizer available yet?
> > From what I can see, the CJK code should be a Filter instead anyway.
> > Also, the ChineseFilter and CJKTokenizer do two different things.
> >
> > CJKTokenizer turns C1C2C3C4 into 'C1C2 C2C3 C3C4'. ChineseFilter (from
> > 2001) turns C1C2 into 'C1 C2'. I hope someone who speaks Mandarin or
> > Cantonese understands what this should do.
> >
> > Lance
> >
> > -----Original Message-----
> > From: Eswar K [mailto:kja.eswar@gmail.com]
> > Sent: Monday, November 26, 2007 10:28 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: CJK Analyzers for Solr
> >
> > Hoss,
> >
> > Thanks a lot. Will look into it.
> >
> > Regards,
> > Eswar
> >
> > On Nov 26, 2007 11:55 PM, Chris Hostetter <ho...@fucit.org>
> > wrote:
> >
> > >
> > > : Does Solr come with Language analyzers for CJK? If not, can you
> > > please
> > > : direct me to some good CJK analyzers?
> > >
> > > Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers
> > jar.
> > > they can be used in Solr.  both have been included in Solr for a while
> >
> > > now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but
> > > starting with Solr 1.3 a Factory for the Tokenizer will also be
> > > included so it can be used in a more complex analysis chain defined in
> > the schema.
> > >
> > >
> > >
> > > -Hoss
> > >
> > >
> >
>



-- 
regards
jl

Re: CJK Analyzers for Solr

Posted by zx zhang <bn...@gmail.com>.

lance,

The following is a instance schema fieldtype using solr1.2 and CJK package.
And it works. As you said, CJK does parse cjk string in a bi-gram way, just
like turning 'C1C2C3C4' into 'C1C2 C2C3 C3C4'.

More to the point, it is worthwhile to mention that the index expand beyond
tolerance to use cjk package, and it will take a long time to index
document. For most enterprise applications, I think, it need a more
effective string parser.


<fieldtype name="text_cjk" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>
</fieldtype>



On 11/27/07, Norskog, Lance <la...@divvio.com> wrote:
>
> I notice this is in the future tense. Is the CJKTokenizer available yet?
> From what I can see, the CJK code should be a Filter instead anyway.
> Also, the ChineseFilter and CJKTokenizer do two different things.
>
> CJKTokenizer turns C1C2C3C4 into 'C1C2 C2C3 C3C4'. ChineseFilter (from
> 2001) turns C1C2 into 'C1 C2'. I hope someone who speaks Mandarin or
> Cantonese understands what this should do.
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 10:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: CJK Analyzers for Solr
>
> Hoss,
>
> Thanks a lot. Will look into it.
>
> Regards,
> Eswar
>
> On Nov 26, 2007 11:55 PM, Chris Hostetter <ho...@fucit.org>
> wrote:
>
> >
> > : Does Solr come with Language analyzers for CJK? If not, can you
> > please
> > : direct me to some good CJK analyzers?
> >
> > Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers
> jar.
> > they can be used in Solr.  both have been included in Solr for a while
>
> > now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but
> > starting with Solr 1.3 a Factory for the Tokenizer will also be
> > included so it can be used in a more complex analysis chain defined in
> the schema.
> >
> >
> >
> > -Hoss
> >
> >
>

RE: CJK Analyzers for Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: I notice this is in the future tense. Is the CJKTokenizer available yet?

CJKTokenizer and CJKAnalyzer are both available in Solr 1.2, but no 
TokenizerFactory was provided for CJKTokenizer in 1.2, so it wasn't 
possible to use "out of the box" without writing a 3 line java plugin.  
that 3 line plugin will be available in 1.3.  since CJKAnalyzer has a no 
arg constructor, it is usable out of the box in Solr 1.2.

: >From what I can see, the CJK code should be a Filter instead anyway.
: Also, the ChineseFilter and CJKTokenizer do two different things. 

these questions may be better suited to the java-user@lucene mailing list, 
as you are more likely to find existing users who can discuss the 
advantages of each approach.



-Hoss

RE: CJK Analyzers for Solr

Posted by "Norskog, Lance" <la...@divvio.com>.

I notice this is in the future tense. Is the CJKTokenizer available yet?
>From what I can see, the CJK code should be a Filter instead anyway.
Also, the ChineseFilter and CJKTokenizer do two different things. 

CJKTokenizer turns C1C2C3C4 into 'C1C2 C2C3 C3C4'. ChineseFilter (from
2001) turns C1C2 into 'C1 C2'. I hope someone who speaks Mandarin or
Cantonese understands what this should do.

Lance

-----Original Message-----
From: Eswar K [mailto:kja.eswar@gmail.com] 
Sent: Monday, November 26, 2007 10:28 AM
To: solr-user@lucene.apache.org
Subject: Re: CJK Analyzers for Solr

Hoss,

Thanks a lot. Will look into it.

Regards,
Eswar

On Nov 26, 2007 11:55 PM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> : Does Solr come with Language analyzers for CJK? If not, can you 
> please
> : direct me to some good CJK analyzers?
>
> Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers
jar.
> they can be used in Solr.  both have been included in Solr for a while

> now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but 
> starting with Solr 1.3 a Factory for the Tokenizer will also be 
> included so it can be used in a more complex analysis chain defined in
the schema.
>
>
>
> -Hoss
>
>

Re: CJK Analyzers for Solr

Posted by Eswar K <kj...@gmail.com>.

Hoss,

Thanks a lot. Will look into it.

Regards,
Eswar

On Nov 26, 2007 11:55 PM, Chris Hostetter <ho...@fucit.org> wrote:

>
> : Does Solr come with Language analyzers for CJK? If not, can you please
> : direct me to some good CJK analyzers?
>
> Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers jar.
> they can be used in Solr.  both have been included in Solr for a while
> now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but
> starting with Solr 1.3 a Factory for the Tokenizer will also be included
> so it can be used in a more complex analysis chain defined in the schema.
>
>
>
> -Hoss
>
>

Re: CJK Analyzers for Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: Does Solr come with Language analyzers for CJK? If not, can you please
: direct me to some good CJK analyzers?

Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers jar.  
they can be used in Solr.  both have been included in Solr for a while 
now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but 
starting with Solr 1.3 a Factory for the Tokenizer will also be included 
so it can be used in a more complex analysis chain defined in the schema.



-Hoss