You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bhaskar Ghosh <bj...@yahoo.co.in> on 2010/10/02 17:46:40 UTC

How to get multi-language support for training/classifying text into classes through Mahout?

Dear All,

I have a requirement where I need to classify text in a non-English language. I 
have heard that Mahout supports multi-language. Can anyone please tell me how do 
I achieve this? Some documents/links where I can get some examples on this, 
would be really really helpful.
 Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Robin Anil <ro...@gmail.com>.

Classifier supports non english tokens(its assumes string is Utf8 encoded)

Robin


On Sat, Oct 2, 2010 at 9:16 PM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Dear All,
>
> I have a requirement where I need to classify text in a non-English
> language. I
> have heard that Mahout supports multi-language. Can anyone please tell me
> how do
> I achieve this? Some documents/links where I can get some examples on this,
> would be really really helpful.
>  Regards
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Lance Norskog <go...@gmail.com>.

Lucene these days has custom code for Hindi. I have no idea how good it is.

The custom code in Lucene for particular languages often solves one 
person's problem, rather than being a general-purpose tool.

Ted Dunning wrote:
> Yes.  Post to the Lucene list and if you get an answer from Robert Muir,
> listen especially carefully.
>
> To answer the question, this code snippet could be adapted/corrected to
> print out tokens in your data.  (don't
> assume it works as it stands!)
>
>      for (String line : Files.readLines(new File("my/file/here"))) {
>         TokenStream ts = analyzer.tokenStream("text", new
> StringReader(line));
>         ts.addAttribute(TermAttribute.class);
>         while (ts.incrementToken()) {
>            String s = ts.getAttribute(TermAttribute.class).term();
>            words.add(s);
>         }
>      }
>
>
> On Sun, Oct 3, 2010 at 9:34 AM, Ken Krugler<kk...@transpac.com>wrote:
>
>    
>> Hi Neil,
>>
>> That's really a Lucene question, not something for Mahout.
>>
>> If you post to the Lucene list, you're also likely get some useful feedback
>> from the community about whether there are issues with tokenizing Hindi.
>>
>> E.g. there was an email from last summer about this same topic. Snippet is:
>>
>>   Apart from using WhiteSpaceAnalyzer which will tokenize words based on
>>      
>>> spaces, you can try writing a simple custom analyzer which'll a bit more.
>>> I
>>> did the following for handling Indic languages intermingled with English
>>> content,
>>>
>>> /**
>>> * Analyzer for Indian language.
>>> */
>>> public class IndicAnalyzerIndex extends Analyzer {
>>>    public TokenStream tokenStream(String fieldName, Reader reader) {
>>>        TokenStream ts = new WhitespaceTokenizer(reader);
>>>        /**
>>>
>>>        
>>
>> -- Ken
>>
>> PS - latest Lucene that's released is 3.0.2, not 2.4.0 (what you reference
>> below)
>>
>>
>> On Oct 3, 2010, at 12:10am, Neil Ghosh wrote:
>>
>>   Ted , I think this is the latest tokenizer.
>>      
>>>
>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
>>>
>>> Do you have any suggestion , how do I see the intermediate tokens
>>> generated
>>> ? So that I can verify with Hindi text as string ?
>>>
>>> Thanks
>>> Neil
>>>
>>> On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning<te...@gmail.com>
>>> wrote:
>>>
>>>   Hindi should be pretty good to go with the default Lucene analyzer.  You
>>>        
>>>> should look at the
>>>> tokens to be sure they are reasonable.  Punctuation and some other work
>>>> breaking characters
>>>> in Hindi may not be handled well, but if the first five sentences work
>>>> well,
>>>> you should be OK.
>>>>
>>>> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh<bj...@yahoo.co.in>
>>>> wrote:
>>>>
>>>>   Hi Ted,
>>>>          
>>>>> I need to tokenize Hindi, an Indian language. I learnt from Robin
>>>>> earlier
>>>>> that
>>>>> "Classifier supports non english tokens(its assumes string is Utf8
>>>>> encoded)",
>>>>> Does that mean that the Classifier would just tokenize based on unicode
>>>>> encoding, so that we do not need to worry about the language? Or, we do
>>>>> need to
>>>>> make some configurations?
>>>>>
>>>>> I do not have a knowledge of factors that makes a language harder to
>>>>> tokenize.
>>>>> But, I have learnt from earlier conversations in this mailing list, that
>>>>> languages in which a word is represented as multi-worded (sequence of
>>>>> words),
>>>>> are hard to tokenize. In that sense, I can assume that words in Hindi
>>>>>
>>>>>            
>>>> would
>>>>
>>>>          
>>>>> be
>>>>> single words.
>>>>>
>>>>> Thanks
>>>>> Bhaskar Ghosh
>>>>> Hyderabad, India
>>>>>
>>>>> http://www.google.com/profiles/bjgindia
>>>>>
>>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Ted Dunning<te...@gmail.com>
>>>>> To: user@mahout.apache.org
>>>>> Sent: Sun, 3 October, 2010 12:53:37 AM
>>>>> Subject: Re: How to get multi-language support for training/classifying
>>>>> text
>>>>> into classes through Mahout?
>>>>>
>>>>> You will need to make sure that the tokenization is done reasonable.
>>>>>
>>>>> There is an example program for a sequential classifier in
>>>>> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>>>>>
>>>>> It assumes data in the 20 news groups format and uses a Lucene
>>>>> tokenizer.
>>>>>
>>>>> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
>>>>> the
>>>>> command line.
>>>>>
>>>>> Can you say which languages?  Are they easy to tokenize (like French)?
>>>>>
>>>>>            
>>>> Or
>>>>
>>>>          
>>>>> medium (like German/Turkish)?
>>>>> Or hard (like Chinese/Japanese)?
>>>>>
>>>>> Can you say how much data?
>>>>>
>>>>> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh<bj...@yahoo.co.in>
>>>>> wrote:
>>>>>
>>>>>   Dear All,
>>>>>            
>>>>>> I have a requirement where I need to classify text in a non-English
>>>>>> language. I
>>>>>> have heard that Mahout supports multi-language. Can anyone please tell
>>>>>>
>>>>>>              
>>>>> me
>>>>>            
>>>>          
>>>>> how do
>>>>>            
>>>>>> I achieve this? Some documents/links where I can get some examples on
>>>>>>
>>>>>>              
>>>>> this,
>>>>>
>>>>>            
>>>>>> would be really really helpful.
>>>>>> Regards
>>>>>> Bhaskar Ghosh
>>>>>> Hyderabad, India
>>>>>>
>>>>>> http://www.google.com/profiles/bjgindia
>>>>>>
>>>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>            
>>>>          
>>>
>>> --
>>> Thanks and Regards
>>> Neil
>>> http://neilghosh.com
>>>
>>>        
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>
>>      
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Yes.  Post to the Lucene list and if you get an answer from Robert Muir,
listen especially carefully.

To answer the question, this code snippet could be adapted/corrected to
print out tokens in your data.  (don't
assume it works as it stands!)

    for (String line : Files.readLines(new File("my/file/here"))) {
       TokenStream ts = analyzer.tokenStream("text", new
StringReader(line));
       ts.addAttribute(TermAttribute.class);
       while (ts.incrementToken()) {
          String s = ts.getAttribute(TermAttribute.class).term();
          words.add(s);
       }
    }


On Sun, Oct 3, 2010 at 9:34 AM, Ken Krugler <kk...@transpac.com>wrote:

> Hi Neil,
>
> That's really a Lucene question, not something for Mahout.
>
> If you post to the Lucene list, you're also likely get some useful feedback
> from the community about whether there are issues with tokenizing Hindi.
>
> E.g. there was an email from last summer about this same topic. Snippet is:
>
>  Apart from using WhiteSpaceAnalyzer which will tokenize words based on
>> spaces, you can try writing a simple custom analyzer which'll a bit more.
>> I
>> did the following for handling Indic languages intermingled with English
>> content,
>>
>> /**
>> * Analyzer for Indian language.
>> */
>> public class IndicAnalyzerIndex extends Analyzer {
>>   public TokenStream tokenStream(String fieldName, Reader reader) {
>>       TokenStream ts = new WhitespaceTokenizer(reader);
>>       /**
>>
>
>
> -- Ken
>
> PS - latest Lucene that's released is 3.0.2, not 2.4.0 (what you reference
> below)
>
>
> On Oct 3, 2010, at 12:10am, Neil Ghosh wrote:
>
>  Ted , I think this is the latest tokenizer.
>>
>>
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
>>
>> Do you have any suggestion , how do I see the intermediate tokens
>> generated
>> ? So that I can verify with Hindi text as string ?
>>
>> Thanks
>> Neil
>>
>> On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>  Hindi should be pretty good to go with the default Lucene analyzer.  You
>>> should look at the
>>> tokens to be sure they are reasonable.  Punctuation and some other work
>>> breaking characters
>>> in Hindi may not be handled well, but if the first five sentences work
>>> well,
>>> you should be OK.
>>>
>>> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bj...@yahoo.co.in>
>>> wrote:
>>>
>>>  Hi Ted,
>>>>
>>>> I need to tokenize Hindi, an Indian language. I learnt from Robin
>>>> earlier
>>>> that
>>>> "Classifier supports non english tokens(its assumes string is Utf8
>>>> encoded)",
>>>> Does that mean that the Classifier would just tokenize based on unicode
>>>> encoding, so that we do not need to worry about the language? Or, we do
>>>> need to
>>>> make some configurations?
>>>>
>>>> I do not have a knowledge of factors that makes a language harder to
>>>> tokenize.
>>>> But, I have learnt from earlier conversations in this mailing list, that
>>>> languages in which a word is represented as multi-worded (sequence of
>>>> words),
>>>> are hard to tokenize. In that sense, I can assume that words in Hindi
>>>>
>>> would
>>>
>>>> be
>>>> single words.
>>>>
>>>> Thanks
>>>> Bhaskar Ghosh
>>>> Hyderabad, India
>>>>
>>>> http://www.google.com/profiles/bjgindia
>>>>
>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Ted Dunning <te...@gmail.com>
>>>> To: user@mahout.apache.org
>>>> Sent: Sun, 3 October, 2010 12:53:37 AM
>>>> Subject: Re: How to get multi-language support for training/classifying
>>>> text
>>>> into classes through Mahout?
>>>>
>>>> You will need to make sure that the tokenization is done reasonable.
>>>>
>>>> There is an example program for a sequential classifier in
>>>> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>>>>
>>>> It assumes data in the 20 news groups format and uses a Lucene
>>>> tokenizer.
>>>>
>>>> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
>>>> the
>>>> command line.
>>>>
>>>> Can you say which languages?  Are they easy to tokenize (like French)?
>>>>
>>> Or
>>>
>>>> medium (like German/Turkish)?
>>>> Or hard (like Chinese/Japanese)?
>>>>
>>>> Can you say how much data?
>>>>
>>>> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in>
>>>> wrote:
>>>>
>>>>  Dear All,
>>>>>
>>>>> I have a requirement where I need to classify text in a non-English
>>>>> language. I
>>>>> have heard that Mahout supports multi-language. Can anyone please tell
>>>>>
>>>> me
>>>
>>>> how do
>>>>> I achieve this? Some documents/links where I can get some examples on
>>>>>
>>>> this,
>>>>
>>>>> would be really really helpful.
>>>>> Regards
>>>>> Bhaskar Ghosh
>>>>> Hyderabad, India
>>>>>
>>>>> http://www.google.com/profiles/bjgindia
>>>>>
>>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks and Regards
>> Neil
>> http://neilghosh.com
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Ken Krugler <kk...@transpac.com>.

Hi Neil,

That's really a Lucene question, not something for Mahout.

If you post to the Lucene list, you're also likely get some useful  
feedback from the community about whether there are issues with  
tokenizing Hindi.

E.g. there was an email from last summer about this same topic.  
Snippet is:

> Apart from using WhiteSpaceAnalyzer which will tokenize words based on
> spaces, you can try writing a simple custom analyzer which'll a bit  
> more. I
> did the following for handling Indic languages intermingled with  
> English
> content,
>
> /**
> * Analyzer for Indian language.
> */
> public class IndicAnalyzerIndex extends Analyzer {
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>        TokenStream ts = new WhitespaceTokenizer(reader);
>        /**


-- Ken

PS - latest Lucene that's released is 3.0.2, not 2.4.0 (what you  
reference below)

On Oct 3, 2010, at 12:10am, Neil Ghosh wrote:

> Ted , I think this is the latest tokenizer.
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
>
> Do you have any suggestion , how do I see the intermediate tokens  
> generated
> ? So that I can verify with Hindi text as string ?
>
> Thanks
> Neil
>
> On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <te...@gmail.com>  
> wrote:
>
>> Hindi should be pretty good to go with the default Lucene  
>> analyzer.  You
>> should look at the
>> tokens to be sure they are reasonable.  Punctuation and some other  
>> work
>> breaking characters
>> in Hindi may not be handled well, but if the first five sentences  
>> work
>> well,
>> you should be OK.
>>
>> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bj...@yahoo.co.in>
>> wrote:
>>
>>> Hi Ted,
>>>
>>> I need to tokenize Hindi, an Indian language. I learnt from Robin  
>>> earlier
>>> that
>>> "Classifier supports non english tokens(its assumes string is Utf8
>>> encoded)",
>>> Does that mean that the Classifier would just tokenize based on  
>>> unicode
>>> encoding, so that we do not need to worry about the language? Or,  
>>> we do
>>> need to
>>> make some configurations?
>>>
>>> I do not have a knowledge of factors that makes a language harder to
>>> tokenize.
>>> But, I have learnt from earlier conversations in this mailing  
>>> list, that
>>> languages in which a word is represented as multi-worded (sequence  
>>> of
>>> words),
>>> are hard to tokenize. In that sense, I can assume that words in  
>>> Hindi
>> would
>>> be
>>> single words.
>>>
>>> Thanks
>>> Bhaskar Ghosh
>>> Hyderabad, India
>>>
>>> http://www.google.com/profiles/bjgindia
>>>
>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Ted Dunning <te...@gmail.com>
>>> To: user@mahout.apache.org
>>> Sent: Sun, 3 October, 2010 12:53:37 AM
>>> Subject: Re: How to get multi-language support for training/ 
>>> classifying
>>> text
>>> into classes through Mahout?
>>>
>>> You will need to make sure that the tokenization is done reasonable.
>>>
>>> There is an example program for a sequential classifier in
>>> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>>>
>>> It assumes data in the 20 news groups format and uses a Lucene  
>>> tokenizer.
>>>
>>> The NaiveBayes code also uses a Lucene tokenizer that you can  
>>> specify on
>>> the
>>> command line.
>>>
>>> Can you say which languages?  Are they easy to tokenize (like  
>>> French)?
>> Or
>>> medium (like German/Turkish)?
>>> Or hard (like Chinese/Japanese)?
>>>
>>> Can you say how much data?
>>>
>>> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in>
>>> wrote:
>>>
>>>> Dear All,
>>>>
>>>> I have a requirement where I need to classify text in a non-English
>>>> language. I
>>>> have heard that Mahout supports multi-language. Can anyone please  
>>>> tell
>> me
>>>> how do
>>>> I achieve this? Some documents/links where I can get some  
>>>> examples on
>>> this,
>>>> would be really really helpful.
>>>> Regards
>>>> Bhaskar Ghosh
>>>> Hyderabad, India
>>>>
>>>> http://www.google.com/profiles/bjgindia
>>>>
>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
> -- 
> Thanks and Regards
> Neil
> http://neilghosh.com

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Neil Ghosh <ne...@gmail.com>.

Ted , I think this is the latest tokenizer.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html

Do you have any suggestion , how do I see the intermediate tokens generated
? So that I can verify with Hindi text as string ?

Thanks
Neil

On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <te...@gmail.com> wrote:

> Hindi should be pretty good to go with the default Lucene analyzer.  You
> should look at the
> tokens to be sure they are reasonable.  Punctuation and some other work
> breaking characters
> in Hindi may not be handled well, but if the first five sentences work
> well,
> you should be OK.
>
> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bj...@yahoo.co.in>
> wrote:
>
> > Hi Ted,
> >
> > I need to tokenize Hindi, an Indian language. I learnt from Robin earlier
> > that
> > "Classifier supports non english tokens(its assumes string is Utf8
> > encoded)",
> > Does that mean that the Classifier would just tokenize based on unicode
> > encoding, so that we do not need to worry about the language? Or, we do
> > need to
> > make some configurations?
> >
> > I do not have a knowledge of factors that makes a language harder to
> > tokenize.
> > But, I have learnt from earlier conversations in this mailing list, that
> > languages in which a word is represented as multi-worded (sequence of
> > words),
> > are hard to tokenize. In that sense, I can assume that words in Hindi
> would
> > be
> > single words.
> >
> >  Thanks
> > Bhaskar Ghosh
> > Hyderabad, India
> >
> > http://www.google.com/profiles/bjgindia
> >
> > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> >
> >
> >
> >
> > ________________________________
> > From: Ted Dunning <te...@gmail.com>
> > To: user@mahout.apache.org
> > Sent: Sun, 3 October, 2010 12:53:37 AM
> > Subject: Re: How to get multi-language support for training/classifying
> > text
> > into classes through Mahout?
> >
> > You will need to make sure that the tokenization is done reasonable.
> >
> > There is an example program for a sequential classifier in
> > org.apache.mahout.classifiers.sgd.TrainNewsGroups
> >
> > It assumes data in the 20 news groups format and uses a Lucene tokenizer.
> >
> > The NaiveBayes code also uses a Lucene tokenizer that you can specify on
> > the
> > command line.
> >
> > Can you say which languages?  Are they easy to tokenize (like French)?
>  Or
> > medium (like German/Turkish)?
> > Or hard (like Chinese/Japanese)?
> >
> > Can you say how much data?
> >
> > On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in>
> > wrote:
> >
> > > Dear All,
> > >
> > > I have a requirement where I need to classify text in a non-English
> > > language. I
> > > have heard that Mahout supports multi-language. Can anyone please tell
> me
> > > how do
> > > I achieve this? Some documents/links where I can get some examples on
> > this,
> > > would be really really helpful.
> > >  Regards
> > > Bhaskar Ghosh
> > > Hyderabad, India
> > >
> > > http://www.google.com/profiles/bjgindia
> > >
> > > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> > >
> > >
> > >
> >
> >
> >
>



-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Bhaskar Ghosh <bj...@yahoo.co.in>.

Thanks a lot Ted. I would try it.
 Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org
Sent: Sun, 3 October, 2010 10:20:52 AM
Subject: Re: How to get multi-language support for training/classifying text 
into classes through Mahout?

Hindi should be pretty good to go with the default Lucene analyzer.  You
should look at the
tokens to be sure they are reasonable.  Punctuation and some other work
breaking characters
in Hindi may not be handled well, but if the first five sentences work well,
you should be OK.

On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Hi Ted,
>
> I need to tokenize Hindi, an Indian language. I learnt from Robin earlier
> that
> "Classifier supports non english tokens(its assumes string is Utf8
> encoded)",
> Does that mean that the Classifier would just tokenize based on unicode
> encoding, so that we do not need to worry about the language? Or, we do
> need to
> make some configurations?
>
> I do not have a knowledge of factors that makes a language harder to
> tokenize.
> But, I have learnt from earlier conversations in this mailing list, that
> languages in which a word is represented as multi-worded (sequence of
> words),
> are hard to tokenize. In that sense, I can assume that words in Hindi would
> be
> single words.
>
>  Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>
>
> ________________________________
> From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org
> Sent: Sun, 3 October, 2010 12:53:37 AM
> Subject: Re: How to get multi-language support for training/classifying
> text
> into classes through Mahout?
>
> You will need to make sure that the tokenization is done reasonable.
>
> There is an example program for a sequential classifier in
> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>
> It assumes data in the 20 news groups format and uses a Lucene tokenizer.
>
> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
> the
> command line.
>
> Can you say which languages?  Are they easy to tokenize (like French)?  Or
> medium (like German/Turkish)?
> Or hard (like Chinese/Japanese)?
>
> Can you say how much data?
>
> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in>
> wrote:
>
> > Dear All,
> >
> > I have a requirement where I need to classify text in a non-English
> > language. I
> > have heard that Mahout supports multi-language. Can anyone please tell me
> > how do
> > I achieve this? Some documents/links where I can get some examples on
> this,
> > would be really really helpful.
> >  Regards
> > Bhaskar Ghosh
> > Hyderabad, India
> >
> > http://www.google.com/profiles/bjgindia
> >
> > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> >
> >
> >
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Hindi should be pretty good to go with the default Lucene analyzer.  You
should look at the
tokens to be sure they are reasonable.  Punctuation and some other work
breaking characters
in Hindi may not be handled well, but if the first five sentences work well,
you should be OK.

On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Hi Ted,
>
> I need to tokenize Hindi, an Indian language. I learnt from Robin earlier
> that
> "Classifier supports non english tokens(its assumes string is Utf8
> encoded)",
> Does that mean that the Classifier would just tokenize based on unicode
> encoding, so that we do not need to worry about the language? Or, we do
> need to
> make some configurations?
>
> I do not have a knowledge of factors that makes a language harder to
> tokenize.
> But, I have learnt from earlier conversations in this mailing list, that
> languages in which a word is represented as multi-worded (sequence of
> words),
> are hard to tokenize. In that sense, I can assume that words in Hindi would
> be
> single words.
>
>  Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>
>
> ________________________________
> From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org
> Sent: Sun, 3 October, 2010 12:53:37 AM
> Subject: Re: How to get multi-language support for training/classifying
> text
> into classes through Mahout?
>
> You will need to make sure that the tokenization is done reasonable.
>
> There is an example program for a sequential classifier in
> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>
> It assumes data in the 20 news groups format and uses a Lucene tokenizer.
>
> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
> the
> command line.
>
> Can you say which languages?  Are they easy to tokenize (like French)?  Or
> medium (like German/Turkish)?
> Or hard (like Chinese/Japanese)?
>
> Can you say how much data?
>
> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in>
> wrote:
>
> > Dear All,
> >
> > I have a requirement where I need to classify text in a non-English
> > language. I
> > have heard that Mahout supports multi-language. Can anyone please tell me
> > how do
> > I achieve this? Some documents/links where I can get some examples on
> this,
> > would be really really helpful.
> >  Regards
> > Bhaskar Ghosh
> > Hyderabad, India
> >
> > http://www.google.com/profiles/bjgindia
> >
> > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> >
> >
> >
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Bhaskar Ghosh <bj...@yahoo.co.in>.

Hi Ted,

I need to tokenize Hindi, an Indian language. I learnt from Robin earlier that 
"Classifier supports non english tokens(its assumes string is Utf8 encoded)",
Does that mean that the Classifier would just tokenize based on unicode 
encoding, so that we do not need to worry about the language? Or, we do need to 
make some configurations?

I do not have a knowledge of factors that makes a language harder to tokenize. 
But, I have learnt from earlier conversations in this mailing list, that 
languages in which a word is represented as multi-worded (sequence of words), 
are hard to tokenize. In that sense, I can assume that words in Hindi would be 
single words.

 Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"

________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org
Sent: Sun, 3 October, 2010 12:53:37 AM
Subject: Re: How to get multi-language support for training/classifying text 
into classes through Mahout?

You will need to make sure that the tokenization is done reasonable.

There is an example program for a sequential classifier in
org.apache.mahout.classifiers.sgd.TrainNewsGroups

It assumes data in the 20 news groups format and uses a Lucene tokenizer.

The NaiveBayes code also uses a Lucene tokenizer that you can specify on the
command line.

Can you say which languages?  Are they easy to tokenize (like French)?  Or
medium (like German/Turkish)?
Or hard (like Chinese/Japanese)?

Can you say how much data?

On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Dear All,
>
> I have a requirement where I need to classify text in a non-English
> language. I
> have heard that Mahout supports multi-language. Can anyone please tell me
> how do
> I achieve this? Some documents/links where I can get some examples on this,
> would be really really helpful.
>  Regards
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Posted by Ted Dunning <te...@gmail.com>.

You will need to make sure that the tokenization is done reasonable.

There is an example program for a sequential classifier in
org.apache.mahout.classifiers.sgd.TrainNewsGroups

It assumes data in the 20 news groups format and uses a Lucene tokenizer.

The NaiveBayes code also uses a Lucene tokenizer that you can specify on the
command line.

Can you say which languages?  Are they easy to tokenize (like French)?  Or
medium (like German/Turkish)?
Or hard (like Chinese/Japanese)?

Can you say how much data?

On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Dear All,
>
> I have a requirement where I need to classify text in a non-English
> language. I
> have heard that Mahout supports multi-language. Can anyone please tell me
> how do
> I achieve this? Some documents/links where I can get some examples on this,
> would be really really helpful.
>  Regards
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>