You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by CassUser CassUser <ca...@gmail.com> on 2011/02/17 20:05:45 UTC

Splitting word tokens - other languages

Hey all,

I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.

the approach I took was simple as follows:

1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
standard tokenizer, and shingle min/max/splitter.

This worked pretty well for us.  Now we would like to tokenize hangul/korean
into word grams.

I'm curious others have done something similar and would share their
experience.  Any pointers to get started with this would be great.

Thanks.

Re: Splitting word tokens - other languages

Posted by Simon Willnauer <si...@googlemail.com>.

Hey,

I am not an expert on this but I think you should look into
CJKAnalyzer / CJKTokenizer

simon

On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser <ca...@gmail.com> wrote:
> Hey all,
>
> I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
> wrote to tokenize a document into word grams.
>
> the approach I took was simple as follows:
>
> 1. extended the lucene Analyzer
> 2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
> standard tokenizer, and shingle min/max/splitter.
>
> This worked pretty well for us.  Now we would like to tokenize hangul/korean
> into word grams.
>
> I'm curious others have done something similar and would share their
> experience.  Any pointers to get started with this would be great.
>
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org