You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Hao Wu <ec...@gmail.com> on 2017/02/17 22:44:21 UTC

[lucy-user] Chinese support?

Hi all,

I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $tokenizer,
);

also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.

What is the simple way to use lucy with chinese doc? Thanks.

Best,

Hao Wu

Re: [lucy-user] Custom Analyzer [was Chinese support?]

Posted by Hao Wu <ec...@gmail.com>.

Hi Peter,

works great. The document is significantly better now.

Thanks everyone for taking care of this issues.

Best,

Hao

On Thu, Feb 23, 2017 at 8:23 AM, Peter Karman <pe...@peknet.com> wrote:

> Nick Wellnhofer wrote on 2/23/17 6:49 AM:
>
>> On 23/02/2017 04:52, Peter Karman wrote:
>>
>>> package MyAnalyzer {
>>>     use base qw( Lucy::Analysis::Analyzer );
>>>     sub transform { $_[1] }
>>> }
>>>
>>
>> Every Analyzer needs an `equals` method. For simple Analyzers, it can
>> simply
>> check whether the class of the other object matches:
>>
>>     package MyAnalyzer {
>>         use base qw( Lucy::Analysis::Analyzer );
>>         sub transform { $_[1] }
>>         sub equals { $_[1]->isa(__PACKAGE__) }
>>     }
>>
>> If the Analyzer uses (inside-out) member variables, you'll also need dump
>> and
>> load methods. Unfortunately, we don't have good documentation for writing
>> custom
>> analyzers yet.
>>
>>
>
> Thanks for the quick response and accurate diagnosis, Nick. I see you've
> already committed changes to the POD so that will be very helpful in future.
>
> Hao, if you add the `sub equals` method to your ChineseAnalyzer, I think
> that should fix your problem. I have confirmed that locally with my own
> tests.
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>

Re: [lucy-user] Custom Analyzer [was Chinese support?]

Posted by Peter Karman <pe...@peknet.com>.

Nick Wellnhofer wrote on 2/23/17 6:49 AM:
> On 23/02/2017 04:52, Peter Karman wrote:
>> package MyAnalyzer {
>>     use base qw( Lucy::Analysis::Analyzer );
>>     sub transform { $_[1] }
>> }
>
> Every Analyzer needs an `equals` method. For simple Analyzers, it can simply
> check whether the class of the other object matches:
>
>     package MyAnalyzer {
>         use base qw( Lucy::Analysis::Analyzer );
>         sub transform { $_[1] }
>         sub equals { $_[1]->isa(__PACKAGE__) }
>     }
>
> If the Analyzer uses (inside-out) member variables, you'll also need dump and
> load methods. Unfortunately, we don't have good documentation for writing custom
> analyzers yet.
>


Thanks for the quick response and accurate diagnosis, Nick. I see you've already 
committed changes to the POD so that will be very helpful in future.

Hao, if you add the `sub equals` method to your ChineseAnalyzer, I think that 
should fix your problem. I have confirmed that locally with my own tests.


-- 
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman

Re: [lucy-user] Custom Analyzer [was Chinese support?]

Posted by Nick Wellnhofer <we...@aevum.de>.

On 23/02/2017 04:52, Peter Karman wrote:
> package MyAnalyzer {
>     use base qw( Lucy::Analysis::Analyzer );
>     sub transform { $_[1] }
> }

Every Analyzer needs an `equals` method. For simple Analyzers, it can simply 
check whether the class of the other object matches:

     package MyAnalyzer {
         use base qw( Lucy::Analysis::Analyzer );
         sub transform { $_[1] }
         sub equals { $_[1]->isa(__PACKAGE__) }
     }

If the Analyzer uses (inside-out) member variables, you'll also need dump and 
load methods. Unfortunately, we don't have good documentation for writing 
custom analyzers yet.

> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***

This is something that should be fixed. I think that the following happens:

- The exception thrown in the Lucy code causes a refcount leak.
- Because of the leak, the object still exists in Perl's global destruction
   phase where the DESTROY method is invoked on the remaining objects in
   random order.
- So it can happen that Clownfish object A is destroyed with object B still
   referencing it. When B is destroyed, it tries to decrease the refcount of
   A, causing memory corruption.

We'll need a custom DESTROY implementation for Perl that ignores objects with 
a non-zero refcount or checks whether we're in the global destruction phase.

Nick

[lucy-user] Custom Analyzer [was Chinese support?]

Posted by Peter Karman <pe...@peknet.com>.

Marvin, Nick, others with more Lucy-fu than I possess:

The example below is failing. I have created a similar test case to create a PR 
but am having trouble running tests within my local git checkout at the moment.

----------------------------------------------------------
#!/usr/bin/env perl

use strict;
use warnings;
use v5.10;
use Lucy;

package MyAnalyzer {
     use base qw( Lucy::Analysis::Analyzer );
     sub transform { $_[1] }
}

package main;

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

my $path_to_index = shift(@ARGV) or die "usage: $0 path/to/index";

for my $try ( ( 1 .. 3 ) ) {
     my $schema = Lucy::Plan::Schema->new;

     my $my_analyzer = MyAnalyzer->new();

     my $raw_type = Lucy::Plan::FullTextType->new( analyzer => $my_analyzer, );

     $schema->spec_field( name => 'body', type => $raw_type );

     my $indexer = Lucy::Index::Indexer->new(
         index  => $path_to_index,
         schema => $schema,
         create => 1,
     );

     my $doc = { body => 'test' };
     $indexer->add_doc($doc);

     $indexer->commit;

     say "finished $try";
}
--------------------------------------------------------

Example above ^^ based on the gist below.

Hao Wu wrote on 2/20/17 11:40 PM:
> Hi Peter,
>
> Thanks for spending time in the script.
>
> I clean it up a bit, so there is no dependency now.
>
> https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff
>
> still do not work.
>




-- 
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman

Re: [lucy-user] Chinese support?

Posted by Hao Wu <ec...@gmail.com>.

Hi Peter,

Thanks for spending time in the script.

I clean it up a bit, so there is no dependency now.

https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff

still do not work.




On Mon, Feb 20, 2017 at 9:03 PM, Peter Karman <pe...@peknet.com> wrote:

> Hao Wu wrote on 2/20/17 10:18 PM:
>
>> Hi Peter,
>>
>> Thanks for reply.
>>
>> That could be a problem. But probably not in my case.
>>
>> I removed the old index.
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice. the
>> second
>> time, will give me the error.
>>
>> 'body' assigned conflicting FieldType
>>         LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>>         at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
>> line 118.
>>         Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
>> mitbbs_index.pl
>> <http://mitbbs_index.pl> line 26
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice, no
>> error. but I
>> want to update the index.
>>
>> run the program with 'StandardTokenizer', with  truncate 0 or 1, both
>> work fine.
>>
>> So, this make me think I must miss something in the 'ChineseAnalyzer' I
>> have.
>>
>>
>
> This is not your default, I don't think. This seems like a bug.
>
> Here's a smaller gist demonstrating the problem:
>
> https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc
>
> With the 2 files in the gist, I get this result:
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> Building prefix dict from the default dictionary ...
> Loading model from cache /var/folders/r3/yk7hmbb9125fns
> df9bqs6lrm0000gp/T/jieba.cache
> Loading model cost 0.553 seconds.
> Prefix dict has been built succesfully.
> Finished.
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> 'body' assigned conflicting FieldType
>         LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>         at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm
> line 118.
>         Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index",
> "test-index", "schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18),
> "create", 1) called at indexer.pl line 23
> Segmentation fault: 11
>
>
>
> I would expect the code to work as you wrote it, so maybe someone else can
> spot what's going wrong.
>
> Here's what the schema_1.json file looks like after the initial index
> creation:
>
> {
>   "_class": "Lucy::Plan::Schema",
>   "analyzers": [
>     null,
>     {
>       "_class": "ChineseAnalyzer"
>     }
>   ],
>   "fields": {
>     "body": {
>       "analyzer": "1",
>       "type": "fulltext"
>
>     }
>   }
> }
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>

Re: [lucy-user] Chinese support?

Posted by Peter Karman <pe...@peknet.com>.

Hao Wu wrote on 2/20/17 10:18 PM:
> Hi Peter,
>
> Thanks for reply.
>
> That could be a problem. But probably not in my case.
>
> I removed the old index.
>
> run the program with 'ChineseAnalyzer' and truncate => 0  twice. the second
> time, will give me the error.
>
> 'body' assigned conflicting FieldType
>         LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>         at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm line 118.
>         Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
> '/home/hwu/data/lucy/mitbbs.index', 'schema',
> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at mitbbs_index.pl
> <http://mitbbs_index.pl> line 26
>
> run the program with 'ChineseAnalyzer' and truncate => 0  twice, no error. but I
> want to update the index.
>
> run the program with 'StandardTokenizer', with  truncate 0 or 1, both work fine.
>
> So, this make me think I must miss something in the 'ChineseAnalyzer' I have.
>


This is not your default, I don't think. This seems like a bug.

Here's a smaller gist demonstrating the problem:

https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc

With the 2 files in the gist, I get this result:

[karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
Building prefix dict from the default dictionary ...
Loading model from cache 
/var/folders/r3/yk7hmbb9125fnsdf9bqs6lrm0000gp/T/jieba.cache
Loading model cost 0.553 seconds.
Prefix dict has been built succesfully.
Finished.

[karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
'body' assigned conflicting FieldType
	LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
	at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm line 118.
	Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index", "test-index", 
"schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18), "create", 1) called at 
indexer.pl line 23
Segmentation fault: 11



I would expect the code to work as you wrote it, so maybe someone else can spot 
what's going wrong.

Here's what the schema_1.json file looks like after the initial index creation:

{
   "_class": "Lucy::Plan::Schema",
   "analyzers": [
     null,
     {
       "_class": "ChineseAnalyzer"
     }
   ],
   "fields": {
     "body": {
       "analyzer": "1",
       "type": "fulltext"
     }
   }
}


-- 
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman

Re: [lucy-user] Chinese support?

Posted by Hao Wu <ec...@gmail.com>.

Hi Peter,

Thanks for reply.

That could be a problem. But probably not in my case.

I removed the old index.

run the program with 'ChineseAnalyzer' and truncate => 0  twice. the second
time, will give me the error.

'body' assigned conflicting FieldType
        LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
        at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
        Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
mitbbs_index.pl line 26

run the program with 'ChineseAnalyzer' and truncate => 0  twice, no error.
but I want to update the index.

run the program with 'StandardTokenizer', with  truncate 0 or 1, both work
fine.

So, this make me think I must miss something in the 'ChineseAnalyzer' I
have.




On Mon, Feb 20, 2017 at 6:47 PM, Peter Karman <pe...@peknet.com> wrote:

> Hao Wu wrote on 2/20/17 6:12 PM:
>
>> Still have problem when I try to update the index using the custom
>> analyzer.
>>
>> If I comment out the
>>    truncate => 1
>>
>> rerun I got the following errror.
>>
>>
>> 'body' assigned conflicting FieldType
>>         LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>>         at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy
>> .pm
>> line 118.
>>         Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
>> mitbbs_index.pl line 26
>> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
>>
>> If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works
>> fine.
>> a new seg_2 is created.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>>         analyzer => $tokenizer,
>> );
>>
>> So I guess I must miss something in the custom Chinese Analyzer.
>>
>>
> since you changed the field definition with a new analyzer, you must
> create a new index. You cannot update an existing index with 2 different
> field definitions in the same schema.
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>

Re: [lucy-user] Chinese support?

Posted by Peter Karman <pe...@peknet.com>.

Hao Wu wrote on 2/20/17 6:12 PM:
> Still have problem when I try to update the index using the custom analyzer.
>
> If I comment out the
>    truncate => 1
>
> rerun I got the following errror.
>
>
> 'body' assigned conflicting FieldType
>         LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>         at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
> line 118.
>         Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
> '/home/hwu/data/lucy/mitbbs.index', 'schema',
> 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
> mitbbs_index.pl line 26
> *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
>
> If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
> a new seg_2 is created.
>
> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
> my $raw_type = Lucy::Plan::FullTextType->new(
>         analyzer => $tokenizer,
> );
>
> So I guess I must miss something in the custom Chinese Analyzer.
>

since you changed the field definition with a new analyzer, you must create a 
new index. You cannot update an existing index with 2 different field 
definitions in the same schema.


-- 
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman

Re: [lucy-user] Chinese support?

Posted by Hao Wu <ec...@gmail.com>.

Still have problem when I try to update the index using the custom analyzer.

If I comment out the
   truncate => 1

rerun I got the following errror.


'body' assigned conflicting FieldType
        LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
        at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
        Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***

If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
a new seg_2 is created.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $tokenizer,
);

So I guess I must miss something in the custom Chinese Analyzer.



------------------my script--------------------

#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

use ChineseAnalyzer;

my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';

# Create Schema.
my $schema = Lucy::Plan::Schema->new;

my $chinese = ChineseAnalyzer->new();

my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $chinese,
);

$schema->spec_field( name => 'body',  type => $raw_type);

# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
    index    => $path_to_index,
    schema   => $schema,
    create   => 1,
    truncate => 1,
);

my $driver   = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 })  or die $DBI::errstr;


my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;

while(my @row = $sth->fetchrow_array()) {
      print "id = ". $row[0] . "\n";
      print $row[1];
      my $doc = { body => $row[1] };
      $indexer->add_doc($doc);
}

$indexer->commit;

print "Finished.\n";

On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer <we...@aevum.de>
wrote:

> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
>     $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>

Re: [lucy-user] Chinese support?

Posted by Nick Wellnhofer <we...@aevum.de>.

On 18/02/2017 07:22, Hao Wu wrote:
> Thanks. Get it work.

Lucy's StandardTokenizer breaks up the text at the word boundaries defined in 
Unicode Standard Annex #29. Then we treat every Alphabetic character that 
doesn't have a Word_Break property as a single term. These are characters that 
match \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break: 
Complex_Context}. This should work for Chinese but as Peter mentioned, we 
don't support n-grams.

If you're using QueryParser, you're likely to run into problems, though. 
QueryParser will turn a sequence of Chinese characters into a PhraseQuery 
which is obviously wrong. A quick hack is to insert a space after every 
Chinese character before passing a query string to QueryParser:

     $query_string =~ s/\p{Ideographic}/$& /g;

Nick

Re: [lucy-user] Chinese support?

Posted by Hao Wu <ec...@gmail.com>.

Thanks. Get it work.

code pasted below in case anyone have similar question.

package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );

sub new {
    my $self = shift->SUPER::new;
    return $self;
}

sub transform {
    my ($self, $inversion)= @_;
    return $inversion;
}

sub transform_text {
    my ($self, $text) = @_;
    my $inversion = Lucy::Analysis::Inversion->new;
    my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
    $inversion->append(
       Lucy::Analysis::Token->new(text =>$_->[0],
                                  start_offset=> $_->[1] ,
                                  end_offset=>$_->[2]
        )

    ) for @tokens;
    return $inversion;
}

1;



package Jieba;
use v5.10;

sub jieba_tokenize {
    jieba_tokenize_python(shift);
}

# TODO:
#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize

def jieba_tokenize_python(text):
    seg_list = tokenize(text, mode='search')
    return(list(seg_list))

END_OF_PYTHON_CODE

1;


On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman <pe...@peknet.com> wrote:

> Hao Wu wrote on 2/17/17 4:44 PM:
>
>> Hi all,
>>
>> I use the StandardTokenizer. search by English word work, but in
>> Chinese give me strange results.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>>         analyzer => $tokenizer,
>> );
>>
>> also, I was going to use the EasyAnalyzer (
>> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
>> /EasyAnalyzer.pod
>> )
>> , but chinese in not supported.
>>
>> What is the simple way to use lucy with chinese doc? Thanks.
>>
>
> There is currently no equivalent of
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/cjk/CJKTokenizer.html
> within core Lucy.
>
> Furthermore, there is no automatic language detection in Lucy. You'll note
> in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
> /EasyAnalyzer.pod
> that the language must be explicitly specified, and that is for the
> stemming analyzer. Also, Chinese is not among the supported languages
> listed.
>
> Maybe something wrapped around https://metacpan.org/pod/Lingu
> a::CJK::Tokenizer would work as a custom analyzer.
>
> You can see an example in the documentation here
> https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>

Re: [lucy-user] Chinese support?

Posted by Peter Karman <pe...@peknet.com>.

Hao Wu wrote on 2/17/17 4:44 PM:
> Hi all,
>
> I use the StandardTokenizer. search by English word work, but in
> Chinese give me strange results.
>
> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
> my $raw_type = Lucy::Plan::FullTextType->new(
>         analyzer => $tokenizer,
> );
>
> also, I was going to use the EasyAnalyzer (
> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
> )
> , but chinese in not supported.
>
> What is the simple way to use lucy with chinese doc? Thanks.

There is currently no equivalent of
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKTokenizer.html
within core Lucy.

Furthermore, there is no automatic language detection in Lucy. You'll note in 
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
that the language must be explicitly specified, and that is for the stemming 
analyzer. Also, Chinese is not among the supported languages listed.

Maybe something wrapped around https://metacpan.org/pod/Lingua::CJK::Tokenizer 
would work as a custom analyzer.

You can see an example in the documentation here
https://metacpan.org/pod/Lucy::Analysis::Analyzer#new



-- 
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman