You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by WaleedAzmy <wa...@tayait.com> on 2011/02/17 12:57:06 UTC
Arabic K-mean clustering
Dear All...
I tried to test Mahout K-Mean clustering on Arabic data. But -I think- there
is a problems in encoding...
I tried the following commands:
=======================
$ ./mahout seqdirectory -i "....\Arabic_data" -o
"....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
$ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
"....\ArabicTest\Arabic_data_out-seqdir"
$ ./mahout kmeans -i "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
-c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k 20
-ow
$ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
"....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
-b 100 -n 20
The clusterdump generate the following output
===================================
o HADOOP_HOME set, running locally
:VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
?????:5.627, ?????:5.627, ??
Top Terms:
???? => 11.830205917358398
????? => 10.808554649353027
??????? => 8.93863296508789
????? => 8.93863296508789
??????? => 8.93863296508789
??????? => 8.93863296508789
77 => 8.365219116210938
???? => 8.365219116210938
?????? => 8.365219116210938
??????????? => 8.365219116210938
69 => 7.958374977111816
????? => 7.6428022384643555
53 => 7.6428022384643555
??? => 7.6428022384643555
??? => 7.384960651397705
????? => 7.384960651397705
????? => 7.166958332061768
24 => 6.186699867248535
31 => 5.9121222496032715
????? => 5.627420902252197
:VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
??????:7.385, ????????:8.939, ?????
Top Terms:
???????? => 12.641136169433594
?????? => 9.422260284423828
????????? => 8.93863296508789
???? => 8.93863296508789
===============================================================
I think the meaningless (?) is a problem of encoding.... Can anyone help me
in this????
Also I want a tutorial describing the command for k-mean clustering and it
attributes and what is the output of clusterdump represent for?
Thank you....
--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Re: Arabic K-mean clustering
Posted by Lance Norskog <go...@gmail.com>.
Aren't the Hudson build messages about this?
On Fri, Feb 18, 2011 at 9:46 AM, Matthew Runo <ma...@gmail.com> wrote:
> This brings up a question I have:
>
> How often is trunk pushed up to the apache maven snapshot repo?
>
> <repository>
> <snapshots>
> <enabled>true</enabled>
> </snapshots>
> <name>Apache Snapshots</name>
> <id>apache-snapshots</id>
> <url>http://repository.apache.org/snapshots</url>
> </repository>
>
>
> <dependency>
> <groupId>org.apache.mahout</groupId>
> <artifactId>mahout</artifactId>
> <version>0.5-SNAPSHOT</version>
> </dependency>
>
> Thanks!
>
> Matthew
>
> On Thu, Feb 17, 2011 at 9:44 PM, Lance Norskog <go...@gmail.com> wrote:
>> Waleed: a fix for this was checked in on January 27. Are you using the
>> trunk, or the 0.4 release? Most people use the trunk, and they
>> generally recommend it. If you're on the trunk, it is time to do an
>> update to the latest code.
>>
>> Lance
>>
>> On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
>>> hi, I believe the following bug already addressed the issue:
>>> https://issues.apache.org/jira/browse/MAHOUT-594
>>>
>>> Thanks, -- Shige
>>>
>>> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>>>
>>>>
>>>> Dear All...
>>>>
>>>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>>>> there
>>>> is a problems in encoding...
>>>>
>>>> I tried the following commands:
>>>> =======================
>>>>
>>>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>>>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>>>
>>>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>>>> "....\ArabicTest\Arabic_data_out-seqdir"
>>>>
>>>> $ ./mahout kmeans -i
>>>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>>>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>>>> 20
>>>> -ow
>>>>
>>>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>>>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>>>> -b 100 -n 20
>>>>
>>>>
>>>> The clusterdump generate the following output
>>>> ===================================
>>>>
>>>> o HADOOP_HOME set, running locally
>>>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>>>> ?????:5.627, ?????:5.627, ??
>>>> Top Terms:
>>>> ???? =>
>>>> 11.830205917358398
>>>> ????? =>
>>>> 10.808554649353027
>>>> ??????? =>
>>>> 8.93863296508789
>>>> ????? =>
>>>> 8.93863296508789
>>>> ??????? =>
>>>> 8.93863296508789
>>>> ??????? =>
>>>> 8.93863296508789
>>>> 77 =>
>>>> 8.365219116210938
>>>> ???? =>
>>>> 8.365219116210938
>>>> ?????? =>
>>>> 8.365219116210938
>>>> ??????????? =>
>>>> 8.365219116210938
>>>> 69 =>
>>>> 7.958374977111816
>>>> ????? =>
>>>> 7.6428022384643555
>>>> 53 =>
>>>> 7.6428022384643555
>>>> ??? =>
>>>> 7.6428022384643555
>>>> ??? =>
>>>> 7.384960651397705
>>>> ????? =>
>>>> 7.384960651397705
>>>> ????? =>
>>>> 7.166958332061768
>>>> 24 =>
>>>> 6.186699867248535
>>>> 31 =>
>>>> 5.9121222496032715
>>>> ????? =>
>>>> 5.627420902252197
>>>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>>>> ??????:7.385, ????????:8.939, ?????
>>>> Top Terms:
>>>> ???????? =>
>>>> 12.641136169433594
>>>> ?????? =>
>>>> 9.422260284423828
>>>> ????????? =>
>>>> 8.93863296508789
>>>> ???? =>
>>>> 8.93863296508789
>>>>
>>>>
>>>> ===============================================================
>>>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>>>> in this????
>>>>
>>>> Also I want a tutorial describing the command for k-mean clustering and it
>>>> attributes and what is the output of clusterdump represent for?
>>>>
>>>> Thank you....
>>>> --
>>>> View this message in context:
>>>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
--
Lance Norskog
goksron@gmail.com
Re: Arabic K-mean clustering
Posted by Matthew Runo <ma...@gmail.com>.
This brings up a question I have:
How often is trunk pushed up to the apache maven snapshot repo?
<repository>
<snapshots>
<enabled>true</enabled>
</snapshots>
<name>Apache Snapshots</name>
<id>apache-snapshots</id>
<url>http://repository.apache.org/snapshots</url>
</repository>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout</artifactId>
<version>0.5-SNAPSHOT</version>
</dependency>
Thanks!
Matthew
On Thu, Feb 17, 2011 at 9:44 PM, Lance Norskog <go...@gmail.com> wrote:
> Waleed: a fix for this was checked in on January 27. Are you using the
> trunk, or the 0.4 release? Most people use the trunk, and they
> generally recommend it. If you're on the trunk, it is time to do an
> update to the latest code.
>
> Lance
>
> On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
>> hi, I believe the following bug already addressed the issue:
>> https://issues.apache.org/jira/browse/MAHOUT-594
>>
>> Thanks, -- Shige
>>
>> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>>
>>>
>>> Dear All...
>>>
>>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>>> there
>>> is a problems in encoding...
>>>
>>> I tried the following commands:
>>> =======================
>>>
>>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>>
>>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>>> "....\ArabicTest\Arabic_data_out-seqdir"
>>>
>>> $ ./mahout kmeans -i
>>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>>> 20
>>> -ow
>>>
>>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>>> -b 100 -n 20
>>>
>>>
>>> The clusterdump generate the following output
>>> ===================================
>>>
>>> o HADOOP_HOME set, running locally
>>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>>> ?????:5.627, ?????:5.627, ??
>>> Top Terms:
>>> ???? =>
>>> 11.830205917358398
>>> ????? =>
>>> 10.808554649353027
>>> ??????? =>
>>> 8.93863296508789
>>> ????? =>
>>> 8.93863296508789
>>> ??????? =>
>>> 8.93863296508789
>>> ??????? =>
>>> 8.93863296508789
>>> 77 =>
>>> 8.365219116210938
>>> ???? =>
>>> 8.365219116210938
>>> ?????? =>
>>> 8.365219116210938
>>> ??????????? =>
>>> 8.365219116210938
>>> 69 =>
>>> 7.958374977111816
>>> ????? =>
>>> 7.6428022384643555
>>> 53 =>
>>> 7.6428022384643555
>>> ??? =>
>>> 7.6428022384643555
>>> ??? =>
>>> 7.384960651397705
>>> ????? =>
>>> 7.384960651397705
>>> ????? =>
>>> 7.166958332061768
>>> 24 =>
>>> 6.186699867248535
>>> 31 =>
>>> 5.9121222496032715
>>> ????? =>
>>> 5.627420902252197
>>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>>> ??????:7.385, ????????:8.939, ?????
>>> Top Terms:
>>> ???????? =>
>>> 12.641136169433594
>>> ?????? =>
>>> 9.422260284423828
>>> ????????? =>
>>> 8.93863296508789
>>> ???? =>
>>> 8.93863296508789
>>>
>>>
>>> ===============================================================
>>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>>> in this????
>>>
>>> Also I want a tutorial describing the command for k-mean clustering and it
>>> attributes and what is the output of clusterdump represent for?
>>>
>>> Thank you....
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
Re: Arabic K-mean clustering
Posted by Lance Norskog <go...@gmail.com>.
Waleed: a fix for this was checked in on January 27. Are you using the
trunk, or the 0.4 release? Most people use the trunk, and they
generally recommend it. If you're on the trunk, it is time to do an
update to the latest code.
Lance
On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
> hi, I believe the following bug already addressed the issue:
> https://issues.apache.org/jira/browse/MAHOUT-594
>
> Thanks, -- Shige
>
> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>
>>
>> Dear All...
>>
>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>> there
>> is a problems in encoding...
>>
>> I tried the following commands:
>> =======================
>>
>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>
>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>> "....\ArabicTest\Arabic_data_out-seqdir"
>>
>> $ ./mahout kmeans -i
>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>> 20
>> -ow
>>
>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>> -b 100 -n 20
>>
>>
>> The clusterdump generate the following output
>> ===================================
>>
>> o HADOOP_HOME set, running locally
>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>> ?????:5.627, ?????:5.627, ??
>> Top Terms:
>> ???? =>
>> 11.830205917358398
>> ????? =>
>> 10.808554649353027
>> ??????? =>
>> 8.93863296508789
>> ????? =>
>> 8.93863296508789
>> ??????? =>
>> 8.93863296508789
>> ??????? =>
>> 8.93863296508789
>> 77 =>
>> 8.365219116210938
>> ???? =>
>> 8.365219116210938
>> ?????? =>
>> 8.365219116210938
>> ??????????? =>
>> 8.365219116210938
>> 69 =>
>> 7.958374977111816
>> ????? =>
>> 7.6428022384643555
>> 53 =>
>> 7.6428022384643555
>> ??? =>
>> 7.6428022384643555
>> ??? =>
>> 7.384960651397705
>> ????? =>
>> 7.384960651397705
>> ????? =>
>> 7.166958332061768
>> 24 =>
>> 6.186699867248535
>> 31 =>
>> 5.9121222496032715
>> ????? =>
>> 5.627420902252197
>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>> ??????:7.385, ????????:8.939, ?????
>> Top Terms:
>> ???????? =>
>> 12.641136169433594
>> ?????? =>
>> 9.422260284423828
>> ????????? =>
>> 8.93863296508789
>> ???? =>
>> 8.93863296508789
>>
>>
>> ===============================================================
>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>> in this????
>>
>> Also I want a tutorial describing the command for k-mean clustering and it
>> attributes and what is the output of clusterdump represent for?
>>
>> Thank you....
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>
--
Lance Norskog
goksron@gmail.com
Re: Arabic K-mean clustering
Posted by Shige Takeda <sm...@gmail.com>.
hi, I believe the following bug already addressed the issue:
https://issues.apache.org/jira/browse/MAHOUT-594
Thanks, -- Shige
On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>
> Dear All...
>
> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
> there
> is a problems in encoding...
>
> I tried the following commands:
> =======================
>
> $ ./mahout seqdirectory -i "....\Arabic_data" -o
> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>
> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
> "....\ArabicTest\Arabic_data_out-seqdir"
>
> $ ./mahout kmeans -i
> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
> 20
> -ow
>
> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
> -b 100 -n 20
>
>
> The clusterdump generate the following output
> ===================================
>
> o HADOOP_HOME set, running locally
> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
> ?????:5.627, ?????:5.627, ??
> Top Terms:
> ???? =>
> 11.830205917358398
> ????? =>
> 10.808554649353027
> ??????? =>
> 8.93863296508789
> ????? =>
> 8.93863296508789
> ??????? =>
> 8.93863296508789
> ??????? =>
> 8.93863296508789
> 77 =>
> 8.365219116210938
> ???? =>
> 8.365219116210938
> ?????? =>
> 8.365219116210938
> ??????????? =>
> 8.365219116210938
> 69 =>
> 7.958374977111816
> ????? =>
> 7.6428022384643555
> 53 =>
> 7.6428022384643555
> ??? =>
> 7.6428022384643555
> ??? =>
> 7.384960651397705
> ????? =>
> 7.384960651397705
> ????? =>
> 7.166958332061768
> 24 =>
> 6.186699867248535
> 31 =>
> 5.9121222496032715
> ????? =>
> 5.627420902252197
> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
> ??????:7.385, ????????:8.939, ?????
> Top Terms:
> ???????? =>
> 12.641136169433594
> ?????? =>
> 9.422260284423828
> ????????? =>
> 8.93863296508789
> ???? =>
> 8.93863296508789
>
>
> ===============================================================
> I think the meaningless (?) is a problem of encoding.... Can anyone help me
> in this????
>
> Also I want a tutorial describing the command for k-mean clustering and it
> attributes and what is the output of clusterdump represent for?
>
> Thank you....
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>