You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Hao Wu <ec...@gmail.com> on 2017/02/17 22:44:21 UTC
[lucy-user] Chinese support?
Hi all,
I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.
What is the simple way to use lucy with chinese doc? Thanks.
Best,
Hao Wu
Re: [lucy-user] Custom Analyzer [was Chinese support?]
Posted by Hao Wu <ec...@gmail.com>.
Hi Peter,
works great. The document is significantly better now.
Thanks everyone for taking care of this issues.
Best,
Hao
On Thu, Feb 23, 2017 at 8:23 AM, Peter Karman <pe...@peknet.com> wrote:
> Nick Wellnhofer wrote on 2/23/17 6:49 AM:
>
>> On 23/02/2017 04:52, Peter Karman wrote:
>>
>>> package MyAnalyzer {
>>> use base qw( Lucy::Analysis::Analyzer );
>>> sub transform { $_[1] }
>>> }
>>>
>>
>> Every Analyzer needs an `equals` method. For simple Analyzers, it can
>> simply
>> check whether the class of the other object matches:
>>
>> package MyAnalyzer {
>> use base qw( Lucy::Analysis::Analyzer );
>> sub transform { $_[1] }
>> sub equals { $_[1]->isa(__PACKAGE__) }
>> }
>>
>> If the Analyzer uses (inside-out) member variables, you'll also need dump
>> and
>> load methods. Unfortunately, we don't have good documentation for writing
>> custom
>> analyzers yet.
>>
>>
>
> Thanks for the quick response and accurate diagnosis, Nick. I see you've
> already committed changes to the POD so that will be very helpful in future.
>
> Hao, if you add the `sub equals` method to your ChineseAnalyzer, I think
> that should fix your problem. I have confirmed that locally with my own
> tests.
>
>
>
> --
> Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
>
Re: [lucy-user] Custom Analyzer [was Chinese support?]
Posted by Peter Karman <pe...@peknet.com>.
Nick Wellnhofer wrote on 2/23/17 6:49 AM:
> On 23/02/2017 04:52, Peter Karman wrote:
>> package MyAnalyzer {
>> use base qw( Lucy::Analysis::Analyzer );
>> sub transform { $_[1] }
>> }
>
> Every Analyzer needs an `equals` method. For simple Analyzers, it can simply
> check whether the class of the other object matches:
>
> package MyAnalyzer {
> use base qw( Lucy::Analysis::Analyzer );
> sub transform { $_[1] }
> sub equals { $_[1]->isa(__PACKAGE__) }
> }
>
> If the Analyzer uses (inside-out) member variables, you'll also need dump and
> load methods. Unfortunately, we don't have good documentation for writing custom
> analyzers yet.
>
Thanks for the quick response and accurate diagnosis, Nick. I see you've already
committed changes to the POD so that will be very helpful in future.
Hao, if you add the `sub equals` method to your ChineseAnalyzer, I think that
should fix your problem. I have confirmed that locally with my own tests.
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Re: [lucy-user] Custom Analyzer [was Chinese support?]
Posted by Nick Wellnhofer <we...@aevum.de>.
On 23/02/2017 04:52, Peter Karman wrote:
> package MyAnalyzer {
> use base qw( Lucy::Analysis::Analyzer );
> sub transform { $_[1] }
> }
Every Analyzer needs an `equals` method. For simple Analyzers, it can simply
check whether the class of the other object matches:
package MyAnalyzer {
use base qw( Lucy::Analysis::Analyzer );
sub transform { $_[1] }
sub equals { $_[1]->isa(__PACKAGE__) }
}
If the Analyzer uses (inside-out) member variables, you'll also need dump and
load methods. Unfortunately, we don't have good documentation for writing
custom analyzers yet.
> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
This is something that should be fixed. I think that the following happens:
- The exception thrown in the Lucy code causes a refcount leak.
- Because of the leak, the object still exists in Perl's global destruction
phase where the DESTROY method is invoked on the remaining objects in
random order.
- So it can happen that Clownfish object A is destroyed with object B still
referencing it. When B is destroyed, it tries to decrease the refcount of
A, causing memory corruption.
We'll need a custom DESTROY implementation for Perl that ignores objects with
a non-zero refcount or checks whether we're in the global destruction phase.
Nick
[lucy-user] Custom Analyzer [was Chinese support?]
Posted by Peter Karman <pe...@peknet.com>.
Marvin, Nick, others with more Lucy-fu than I possess:
The example below is failing. I have created a similar test case to create a PR
but am having trouble running tests within my local git checkout at the moment.
----------------------------------------------------------
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use Lucy;
package MyAnalyzer {
use base qw( Lucy::Analysis::Analyzer );
sub transform { $_[1] }
}
package main;
use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;
my $path_to_index = shift(@ARGV) or die "usage: $0 path/to/index";
for my $try ( ( 1 .. 3 ) ) {
my $schema = Lucy::Plan::Schema->new;
my $my_analyzer = MyAnalyzer->new();
my $raw_type = Lucy::Plan::FullTextType->new( analyzer => $my_analyzer, );
$schema->spec_field( name => 'body', type => $raw_type );
my $indexer = Lucy::Index::Indexer->new(
index => $path_to_index,
schema => $schema,
create => 1,
);
my $doc = { body => 'test' };
$indexer->add_doc($doc);
$indexer->commit;
say "finished $try";
}
--------------------------------------------------------
Example above ^^ based on the gist below.
Hao Wu wrote on 2/20/17 11:40 PM:
> Hi Peter,
>
> Thanks for spending time in the script.
>
> I clean it up a bit, so there is no dependency now.
>
> https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff
>
> still do not work.
>
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Re: [lucy-user] Chinese support?
Posted by Hao Wu <ec...@gmail.com>.
Hi Peter,
Thanks for spending time in the script.
I clean it up a bit, so there is no dependency now.
https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff
still do not work.
On Mon, Feb 20, 2017 at 9:03 PM, Peter Karman <pe...@peknet.com> wrote:
> Hao Wu wrote on 2/20/17 10:18 PM:
>
>> Hi Peter,
>>
>> Thanks for reply.
>>
>> That could be a problem. But probably not in my case.
>>
>> I removed the old index.
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0 twice. the
>> second
>> time, will give me the error.
>>
>> 'body' assigned conflicting FieldType
>> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
>> line 118.
>> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
>> mitbbs_index.pl
>> <http://mitbbs_index.pl> line 26
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0 twice, no
>> error. but I
>> want to update the index.
>>
>> run the program with 'StandardTokenizer', with truncate 0 or 1, both
>> work fine.
>>
>> So, this make me think I must miss something in the 'ChineseAnalyzer' I
>> have.
>>
>>
>
> This is not your default, I don't think. This seems like a bug.
>
> Here's a smaller gist demonstrating the problem:
>
> https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc
>
> With the 2 files in the gist, I get this result:
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> Building prefix dict from the default dictionary ...
> Loading model from cache /var/folders/r3/yk7hmbb9125fns
> df9bqs6lrm0000gp/T/jieba.cache
> Loading model cost 0.553 seconds.
> Prefix dict has been built succesfully.
> Finished.
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> 'body' assigned conflicting FieldType
> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
> at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm
> line 118.
> Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index",
> "test-index", "schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18),
> "create", 1) called at indexer.pl line 23
> Segmentation fault: 11
>
>
>
> I would expect the code to work as you wrote it, so maybe someone else can
> spot what's going wrong.
>
> Here's what the schema_1.json file looks like after the initial index
> creation:
>
> {
> "_class": "Lucy::Plan::Schema",
> "analyzers": [
> null,
> {
> "_class": "ChineseAnalyzer"
> }
> ],
> "fields": {
> "body": {
> "analyzer": "1",
> "type": "fulltext"
>
> }
> }
> }
>
>
> --
> Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
>
Re: [lucy-user] Chinese support?
Posted by Peter Karman <pe...@peknet.com>.
Hao Wu wrote on 2/20/17 10:18 PM:
> Hi Peter,
>
> Thanks for reply.
>
> That could be a problem. But probably not in my case.
>
> I removed the old index.
>
> run the program with 'ChineseAnalyzer' and truncate => 0 twice. the second
> time, will give me the error.
>
> 'body' assigned conflicting FieldType
> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm line 118.
> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
> '/home/hwu/data/lucy/mitbbs.index', 'schema',
> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at mitbbs_index.pl
> <http://mitbbs_index.pl> line 26
>
> run the program with 'ChineseAnalyzer' and truncate => 0 twice, no error. but I
> want to update the index.
>
> run the program with 'StandardTokenizer', with truncate 0 or 1, both work fine.
>
> So, this make me think I must miss something in the 'ChineseAnalyzer' I have.
>
This is not your default, I don't think. This seems like a bug.
Here's a smaller gist demonstrating the problem:
https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc
With the 2 files in the gist, I get this result:
[karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
Building prefix dict from the default dictionary ...
Loading model from cache
/var/folders/r3/yk7hmbb9125fnsdf9bqs6lrm0000gp/T/jieba.cache
Loading model cost 0.553 seconds.
Prefix dict has been built succesfully.
Finished.
[karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm line 118.
Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index", "test-index",
"schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18), "create", 1) called at
indexer.pl line 23
Segmentation fault: 11
I would expect the code to work as you wrote it, so maybe someone else can spot
what's going wrong.
Here's what the schema_1.json file looks like after the initial index creation:
{
"_class": "Lucy::Plan::Schema",
"analyzers": [
null,
{
"_class": "ChineseAnalyzer"
}
],
"fields": {
"body": {
"analyzer": "1",
"type": "fulltext"
}
}
}
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Re: [lucy-user] Chinese support?
Posted by Hao Wu <ec...@gmail.com>.
Hi Peter,
Thanks for reply.
That could be a problem. But probably not in my case.
I removed the old index.
run the program with 'ChineseAnalyzer' and truncate => 0 twice. the second
time, will give me the error.
'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
mitbbs_index.pl line 26
run the program with 'ChineseAnalyzer' and truncate => 0 twice, no error.
but I want to update the index.
run the program with 'StandardTokenizer', with truncate 0 or 1, both work
fine.
So, this make me think I must miss something in the 'ChineseAnalyzer' I
have.
On Mon, Feb 20, 2017 at 6:47 PM, Peter Karman <pe...@peknet.com> wrote:
> Hao Wu wrote on 2/20/17 6:12 PM:
>
>> Still have problem when I try to update the index using the custom
>> analyzer.
>>
>> If I comment out the
>> truncate => 1
>>
>> rerun I got the following errror.
>>
>>
>> 'body' assigned conflicting FieldType
>> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy
>> .pm
>> line 118.
>> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
>> mitbbs_index.pl line 26
>> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
>>
>> If I switch the analyzer to Lucy::Analysis::StandardTokenize. works
>> fine.
>> a new seg_2 is created.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>> analyzer => $tokenizer,
>> );
>>
>> So I guess I must miss something in the custom Chinese Analyzer.
>>
>>
> since you changed the field definition with a new analyzer, you must
> create a new index. You cannot update an existing index with 2 different
> field definitions in the same schema.
>
>
>
> --
> Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
>
Re: [lucy-user] Chinese support?
Posted by Peter Karman <pe...@peknet.com>.
Hao Wu wrote on 2/20/17 6:12 PM:
> Still have problem when I try to update the index using the custom analyzer.
>
> If I comment out the
> truncate => 1
>
> rerun I got the following errror.
>
>
> 'body' assigned conflicting FieldType
> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
> line 118.
> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
> '/home/hwu/data/lucy/mitbbs.index', 'schema',
> 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
> mitbbs_index.pl line 26
> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
>
> If I switch the analyzer to Lucy::Analysis::StandardTokenize. works fine.
> a new seg_2 is created.
>
> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
> my $raw_type = Lucy::Plan::FullTextType->new(
> analyzer => $tokenizer,
> );
>
> So I guess I must miss something in the custom Chinese Analyzer.
>
since you changed the field definition with a new analyzer, you must create a
new index. You cannot update an existing index with 2 different field
definitions in the same schema.
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Re: [lucy-user] Chinese support?
Posted by Hao Wu <ec...@gmail.com>.
Still have problem when I try to update the index using the custom analyzer.
If I comment out the
truncate => 1
rerun I got the following errror.
'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
If I switch the analyzer to Lucy::Analysis::StandardTokenize. works fine.
a new seg_2 is created.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
So I guess I must miss something in the custom Chinese Analyzer.
------------------my script--------------------
#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );
use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;
use ChineseAnalyzer;
my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';
# Create Schema.
my $schema = Lucy::Plan::Schema->new;
my $chinese = ChineseAnalyzer->new();
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $chinese,
);
$schema->spec_field( name => 'body', type => $raw_type);
# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
index => $path_to_index,
schema => $schema,
create => 1,
truncate => 1,
);
my $driver = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 }) or die $DBI::errstr;
my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;
while(my @row = $sth->fetchrow_array()) {
print "id = ". $row[0] . "\n";
print $row[1];
my $doc = { body => $row[1] };
$indexer->add_doc($doc);
}
$indexer->commit;
print "Finished.\n";
On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer <we...@aevum.de>
wrote:
> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
> $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>
Re: [lucy-user] Chinese support?
Posted by Nick Wellnhofer <we...@aevum.de>.
On 18/02/2017 07:22, Hao Wu wrote:
> Thanks. Get it work.
Lucy's StandardTokenizer breaks up the text at the word boundaries defined in
Unicode Standard Annex #29. Then we treat every Alphabetic character that
doesn't have a Word_Break property as a single term. These are characters that
match \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break:
Complex_Context}. This should work for Chinese but as Peter mentioned, we
don't support n-grams.
If you're using QueryParser, you're likely to run into problems, though.
QueryParser will turn a sequence of Chinese characters into a PhraseQuery
which is obviously wrong. A quick hack is to insert a space after every
Chinese character before passing a query string to QueryParser:
$query_string =~ s/\p{Ideographic}/$& /g;
Nick
Re: [lucy-user] Chinese support?
Posted by Hao Wu <ec...@gmail.com>.
Thanks. Get it work.
code pasted below in case anyone have similar question.
package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );
sub new {
my $self = shift->SUPER::new;
return $self;
}
sub transform {
my ($self, $inversion)= @_;
return $inversion;
}
sub transform_text {
my ($self, $text) = @_;
my $inversion = Lucy::Analysis::Inversion->new;
my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
$inversion->append(
Lucy::Analysis::Token->new(text =>$_->[0],
start_offset=> $_->[1] ,
end_offset=>$_->[2]
)
) for @tokens;
return $inversion;
}
1;
package Jieba;
use v5.10;
sub jieba_tokenize {
jieba_tokenize_python(shift);
}
# TODO:
#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize
def jieba_tokenize_python(text):
seg_list = tokenize(text, mode='search')
return(list(seg_list))
END_OF_PYTHON_CODE
1;
On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman <pe...@peknet.com> wrote:
> Hao Wu wrote on 2/17/17 4:44 PM:
>
>> Hi all,
>>
>> I use the StandardTokenizer. search by English word work, but in
>> Chinese give me strange results.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>> analyzer => $tokenizer,
>> );
>>
>> also, I was going to use the EasyAnalyzer (
>> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
>> /EasyAnalyzer.pod
>> )
>> , but chinese in not supported.
>>
>> What is the simple way to use lucy with chinese doc? Thanks.
>>
>
> There is currently no equivalent of
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/cjk/CJKTokenizer.html
> within core Lucy.
>
> Furthermore, there is no automatic language detection in Lucy. You'll note
> in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
> /EasyAnalyzer.pod
> that the language must be explicitly specified, and that is for the
> stemming analyzer. Also, Chinese is not among the supported languages
> listed.
>
> Maybe something wrapped around https://metacpan.org/pod/Lingu
> a::CJK::Tokenizer would work as a custom analyzer.
>
> You can see an example in the documentation here
> https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
>
>
>
> --
> Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
>
Re: [lucy-user] Chinese support?
Posted by Peter Karman <pe...@peknet.com>.
Hao Wu wrote on 2/17/17 4:44 PM:
> Hi all,
>
> I use the StandardTokenizer. search by English word work, but in
> Chinese give me strange results.
>
> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
> my $raw_type = Lucy::Plan::FullTextType->new(
> analyzer => $tokenizer,
> );
>
> also, I was going to use the EasyAnalyzer (
> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
> )
> , but chinese in not supported.
>
> What is the simple way to use lucy with chinese doc? Thanks.
There is currently no equivalent of
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKTokenizer.html
within core Lucy.
Furthermore, there is no automatic language detection in Lucy. You'll note in
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
that the language must be explicitly specified, and that is for the stemming
analyzer. Also, Chinese is not among the supported languages listed.
Maybe something wrapped around https://metacpan.org/pod/Lingua::CJK::Tokenizer
would work as a custom analyzer.
You can see an example in the documentation here
https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman