You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Pichai Ongvasith <pi...@yahoo.com> on 2004/02/18 01:46:04 UTC
Thai analyzer
Hi,
I have written some simple adaptor/wrapper classes for
java.text.BreakIterator, available in jdk 1.4 and
later. I also created a ThaiAnalyzer class based on
those wrappers.
Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token.
JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from
StdTokenizer.
My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.
The source code is available here
http://pichai.netfirms.com/thai_analyzer.zip
I'm not sure if this code is worth being part of
Lucene. If it is, I can modify the code as you guys
suggest, and contribute it to Lucene project.
Thanks,
Pichai
__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Thai analyzer
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Pichai,
This contribution looks fine. If you are fine with it being part of
the sandbox analyzer area, please attach this to a Bugzilla issue
(first create the issue, then attach the .zip file) - this way it won't
get lost in the shuffle and will eventually be added to the sandbox.
Thanks
Erik
On Feb 23, 2004, at 5:11 AM, Pichai Ongvasith wrote:
> The code and a test case is available at
> http://pichai.netfirms.com/download.html
>
> There are quite a few files in the th package. But
> most of them are generated from javaCC. I also copied
> some of them from analysis.standard package.
>
> thanks,
> pichai
> --- Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>
>> We could certainly add it to the Sandbox analyzers
>> section. If you
>> could package it up with the Apache License attached
>> like you see the
>> code in the jakarta-lucene-sandbox repository along
>> with some test
>> cases I would add it there.
>>
>> Erik
>>
>>
>>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail SpamGuard - Read only the mail you want.
> http://antispam.yahoo.com/tools
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Thai analyzer
Posted by Pichai Ongvasith <pi...@yahoo.com>.
The code and a test case is available at
http://pichai.netfirms.com/download.html
There are quite a few files in the th package. But
most of them are generated from javaCC. I also copied
some of them from analysis.standard package.
thanks,
pichai
--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> We could certainly add it to the Sandbox analyzers
> section. If you
> could package it up with the Apache License attached
> like you see the
> code in the jakarta-lucene-sandbox repository along
> with some test
> cases I would add it there.
>
> Erik
>
>
>
__________________________________
Do you Yahoo!?
Yahoo! Mail SpamGuard - Read only the mail you want.
http://antispam.yahoo.com/tools
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Thai analyzer
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 17, 2004, at 7:46 PM, Pichai Ongvasith wrote:
> I'm not sure if this code is worth being part of
> Lucene. If it is, I can modify the code as you guys
> suggest, and contribute it to Lucene project.
We could certainly add it to the Sandbox analyzers section. If you
could package it up with the Apache License attached like you see the
code in the jakarta-lucene-sandbox repository along with some test
cases I would add it there.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org