You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by CassUser CassUser <ca...@gmail.com> on 2011/02/17 20:05:45 UTC
Splitting word tokens - other languages
Hey all,
I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.
the approach I took was simple as follows:
1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter. Passed in the
standard tokenizer, and shingle min/max/splitter.
This worked pretty well for us. Now we would like to tokenize hangul/korean
into word grams.
I'm curious others have done something similar and would share their
experience. Any pointers to get started with this would be great.
Thanks.
Re: Splitting word tokens - other languages
Posted by Simon Willnauer <si...@googlemail.com>.
Hey,
I am not an expert on this but I think you should look into
CJKAnalyzer / CJKTokenizer
simon
On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser <ca...@gmail.com> wrote:
> Hey all,
>
> I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we
> wrote to tokenize a document into word grams.
>
> the approach I took was simple as follows:
>
> 1. extended the lucene Analyzer
> 2. In the tokenStream method use ShingleMatrixFilter. Passed in the
> standard tokenizer, and shingle min/max/splitter.
>
> This worked pretty well for us. Now we would like to tokenize hangul/korean
> into word grams.
>
> I'm curious others have done something similar and would share their
> experience. Any pointers to get started with this would be great.
>
> Thanks.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org