You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Pichai Ongvasith <pi...@yahoo.com> on 2004/02/18 01:46:04 UTC

Thai analyzer

Hi,

I have written some simple adaptor/wrapper classes for
java.text.BreakIterator, available in jdk 1.4 and
later. I also created a ThaiAnalyzer class based on
those wrappers.

Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token. 

JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from
StdTokenizer.

My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.

The source code is available here
http://pichai.netfirms.com/thai_analyzer.zip

I'm not sure if this code is worth being part of
Lucene. If it is, I can modify the code as you guys
suggest, and contribute it to Lucene project.

Thanks,
Pichai

__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Thai analyzer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Pichai,

This contribution looks fine.  If  you are fine with it being part of 
the sandbox analyzer area, please attach this to a Bugzilla issue 
(first create the issue, then attach the .zip file) - this way it won't 
get lost in the shuffle and will eventually be added to the sandbox.

Thanks
	Erik

On Feb 23, 2004, at 5:11 AM, Pichai Ongvasith wrote:

> The code and a test case is available at
> http://pichai.netfirms.com/download.html
>
> There are quite a few files in the th package. But
> most of them are generated from javaCC. I also copied
> some of them from analysis.standard package.
>
> thanks,
> pichai
> --- Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>
>> We could certainly add it to the Sandbox analyzers
>> section.  If you
>> could package it up with the Apache License attached
>> like you see the
>> code in the jakarta-lucene-sandbox repository along
>> with some test
>> cases I would add it there.
>>
>> 	Erik
>>
>>
>>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail SpamGuard - Read only the mail you want.
> http://antispam.yahoo.com/tools
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Thai analyzer

Posted by Pichai Ongvasith <pi...@yahoo.com>.
The code and a test case is available at
http://pichai.netfirms.com/download.html

There are quite a few files in the th package. But
most of them are generated from javaCC. I also copied
some of them from analysis.standard package.

thanks,
pichai
--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> 
> We could certainly add it to the Sandbox analyzers
> section.  If you 
> could package it up with the Apache License attached
> like you see the 
> code in the jakarta-lucene-sandbox repository along
> with some test 
> cases I would add it there.
> 
> 	Erik
> 
> 
> 


__________________________________
Do you Yahoo!?
Yahoo! Mail SpamGuard - Read only the mail you want.
http://antispam.yahoo.com/tools

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Thai analyzer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 17, 2004, at 7:46 PM, Pichai Ongvasith wrote:
> I'm not sure if this code is worth being part of
> Lucene. If it is, I can modify the code as you guys
> suggest, and contribute it to Lucene project.

We could certainly add it to the Sandbox analyzers section.  If you 
could package it up with the Apache License attached like you see the 
code in the jakarta-lucene-sandbox repository along with some test 
cases I would add it there.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org