You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by WaleedAzmy <wa...@tayait.com> on 2011/02/17 12:57:06 UTC

Arabic K-mean clustering

Dear All...

I tried to test Mahout K-Mean clustering on Arabic data. But -I think- there
is a problems in encoding...

I tried the following commands:
=======================

$ ./mahout seqdirectory -i "....\Arabic_data" -o
"....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5

$ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
"....\ArabicTest\Arabic_data_out-seqdir"

$ ./mahout kmeans -i "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
-c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k 20
-ow

$ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
"....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
-b 100 -n 20


The clusterdump generate the following output
===================================

o HADOOP_HOME set, running locally
:VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
?????:5.627, ?????:5.627, ??
	Top Terms: 
		????                                    =>  11.830205917358398
		?????                                   =>  10.808554649353027
		???????                                 =>    8.93863296508789
		?????                                   =>    8.93863296508789
		???????                                 =>    8.93863296508789
		???????                                 =>    8.93863296508789
		77                                      =>   8.365219116210938
		????                                    =>   8.365219116210938
		??????                                  =>   8.365219116210938
		???????????                             =>   8.365219116210938
		69                                      =>   7.958374977111816
		?????                                   =>  7.6428022384643555
		53                                      =>  7.6428022384643555
		???                                     =>  7.6428022384643555
		???                                     =>   7.384960651397705
		?????                                   =>   7.384960651397705
		?????                                   =>   7.166958332061768
		24                                      =>   6.186699867248535
		31                                      =>  5.9121222496032715
		?????                                   =>   5.627420902252197
:VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
??????:7.385, ????????:8.939, ?????
	Top Terms: 
		????????                                =>  12.641136169433594
		??????                                  =>   9.422260284423828
		?????????                               =>    8.93863296508789
		????                                    =>    8.93863296508789


===============================================================
I think the meaningless (?) is a problem of encoding.... Can anyone help me
in this????

Also I want a tutorial describing the command for k-mean clustering and it
attributes and what is the output of clusterdump represent for?

Thank you....
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Arabic K-mean clustering

Posted by Lance Norskog <go...@gmail.com>.
Aren't the Hudson build messages about this?

On Fri, Feb 18, 2011 at 9:46 AM, Matthew Runo <ma...@gmail.com> wrote:
> This brings up a question I have:
>
> How often is trunk pushed up to the apache maven snapshot repo?
>
>       <repository>
>            <snapshots>
>                <enabled>true</enabled>
>            </snapshots>
>            <name>Apache Snapshots</name>
>            <id>apache-snapshots</id>
>            <url>http://repository.apache.org/snapshots</url>
>        </repository>
>
>
> <dependency>
>            <groupId>org.apache.mahout</groupId>
>            <artifactId>mahout</artifactId>
>            <version>0.5-SNAPSHOT</version>
>        </dependency>
>
> Thanks!
>
> Matthew
>
> On Thu, Feb 17, 2011 at 9:44 PM, Lance Norskog <go...@gmail.com> wrote:
>> Waleed: a fix for this was checked in on January 27. Are you using the
>> trunk, or the 0.4 release? Most people use the trunk, and they
>> generally recommend it. If you're on the trunk, it is time to do an
>> update to the latest code.
>>
>> Lance
>>
>> On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
>>> hi, I believe the following bug already addressed the issue:
>>> https://issues.apache.org/jira/browse/MAHOUT-594
>>>
>>> Thanks, -- Shige
>>>
>>> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>>>
>>>>
>>>> Dear All...
>>>>
>>>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>>>> there
>>>> is a problems in encoding...
>>>>
>>>> I tried the following commands:
>>>> =======================
>>>>
>>>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>>>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>>>
>>>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>>>> "....\ArabicTest\Arabic_data_out-seqdir"
>>>>
>>>> $ ./mahout kmeans -i
>>>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>>>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>>>> 20
>>>> -ow
>>>>
>>>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>>>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>>>> -b 100 -n 20
>>>>
>>>>
>>>> The clusterdump generate the following output
>>>> ===================================
>>>>
>>>> o HADOOP_HOME set, running locally
>>>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>>>> ?????:5.627, ?????:5.627, ??
>>>>        Top Terms:
>>>>                ????                                    =>
>>>>  11.830205917358398
>>>>                ?????                                   =>
>>>>  10.808554649353027
>>>>                ???????                                 =>
>>>>  8.93863296508789
>>>>                ?????                                   =>
>>>>  8.93863296508789
>>>>                ???????                                 =>
>>>>  8.93863296508789
>>>>                ???????                                 =>
>>>>  8.93863296508789
>>>>                77                                      =>
>>>> 8.365219116210938
>>>>                ????                                    =>
>>>> 8.365219116210938
>>>>                ??????                                  =>
>>>> 8.365219116210938
>>>>                ???????????                             =>
>>>> 8.365219116210938
>>>>                69                                      =>
>>>> 7.958374977111816
>>>>                ?????                                   =>
>>>>  7.6428022384643555
>>>>                53                                      =>
>>>>  7.6428022384643555
>>>>                ???                                     =>
>>>>  7.6428022384643555
>>>>                ???                                     =>
>>>> 7.384960651397705
>>>>                ?????                                   =>
>>>> 7.384960651397705
>>>>                ?????                                   =>
>>>> 7.166958332061768
>>>>                24                                      =>
>>>> 6.186699867248535
>>>>                31                                      =>
>>>>  5.9121222496032715
>>>>                ?????                                   =>
>>>> 5.627420902252197
>>>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>>>> ??????:7.385, ????????:8.939, ?????
>>>>        Top Terms:
>>>>                ????????                                =>
>>>>  12.641136169433594
>>>>                ??????                                  =>
>>>> 9.422260284423828
>>>>                ?????????                               =>
>>>>  8.93863296508789
>>>>                ????                                    =>
>>>>  8.93863296508789
>>>>
>>>>
>>>> ===============================================================
>>>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>>>> in this????
>>>>
>>>> Also I want a tutorial describing the command for k-mean clustering and it
>>>> attributes and what is the output of clusterdump represent for?
>>>>
>>>> Thank you....
>>>> --
>>>> View this message in context:
>>>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Arabic K-mean clustering

Posted by Matthew Runo <ma...@gmail.com>.
This brings up a question I have:

How often is trunk pushed up to the apache maven snapshot repo?

       <repository>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
            <name>Apache Snapshots</name>
            <id>apache-snapshots</id>
            <url>http://repository.apache.org/snapshots</url>
        </repository>


<dependency>
            <groupId>org.apache.mahout</groupId>
            <artifactId>mahout</artifactId>
            <version>0.5-SNAPSHOT</version>
        </dependency>

Thanks!

Matthew

On Thu, Feb 17, 2011 at 9:44 PM, Lance Norskog <go...@gmail.com> wrote:
> Waleed: a fix for this was checked in on January 27. Are you using the
> trunk, or the 0.4 release? Most people use the trunk, and they
> generally recommend it. If you're on the trunk, it is time to do an
> update to the latest code.
>
> Lance
>
> On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
>> hi, I believe the following bug already addressed the issue:
>> https://issues.apache.org/jira/browse/MAHOUT-594
>>
>> Thanks, -- Shige
>>
>> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>>
>>>
>>> Dear All...
>>>
>>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>>> there
>>> is a problems in encoding...
>>>
>>> I tried the following commands:
>>> =======================
>>>
>>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>>
>>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>>> "....\ArabicTest\Arabic_data_out-seqdir"
>>>
>>> $ ./mahout kmeans -i
>>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>>> 20
>>> -ow
>>>
>>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>>> -b 100 -n 20
>>>
>>>
>>> The clusterdump generate the following output
>>> ===================================
>>>
>>> o HADOOP_HOME set, running locally
>>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>>> ?????:5.627, ?????:5.627, ??
>>>        Top Terms:
>>>                ????                                    =>
>>>  11.830205917358398
>>>                ?????                                   =>
>>>  10.808554649353027
>>>                ???????                                 =>
>>>  8.93863296508789
>>>                ?????                                   =>
>>>  8.93863296508789
>>>                ???????                                 =>
>>>  8.93863296508789
>>>                ???????                                 =>
>>>  8.93863296508789
>>>                77                                      =>
>>> 8.365219116210938
>>>                ????                                    =>
>>> 8.365219116210938
>>>                ??????                                  =>
>>> 8.365219116210938
>>>                ???????????                             =>
>>> 8.365219116210938
>>>                69                                      =>
>>> 7.958374977111816
>>>                ?????                                   =>
>>>  7.6428022384643555
>>>                53                                      =>
>>>  7.6428022384643555
>>>                ???                                     =>
>>>  7.6428022384643555
>>>                ???                                     =>
>>> 7.384960651397705
>>>                ?????                                   =>
>>> 7.384960651397705
>>>                ?????                                   =>
>>> 7.166958332061768
>>>                24                                      =>
>>> 6.186699867248535
>>>                31                                      =>
>>>  5.9121222496032715
>>>                ?????                                   =>
>>> 5.627420902252197
>>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>>> ??????:7.385, ????????:8.939, ?????
>>>        Top Terms:
>>>                ????????                                =>
>>>  12.641136169433594
>>>                ??????                                  =>
>>> 9.422260284423828
>>>                ?????????                               =>
>>>  8.93863296508789
>>>                ????                                    =>
>>>  8.93863296508789
>>>
>>>
>>> ===============================================================
>>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>>> in this????
>>>
>>> Also I want a tutorial describing the command for k-mean clustering and it
>>> attributes and what is the output of clusterdump represent for?
>>>
>>> Thank you....
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Arabic K-mean clustering

Posted by Lance Norskog <go...@gmail.com>.
Waleed: a fix for this was checked in on January 27. Are you using the
trunk, or the 0.4 release? Most people use the trunk, and they
generally recommend it. If you're on the trunk, it is time to do an
update to the latest code.

Lance

On Thu, Feb 17, 2011 at 3:16 PM, Shige Takeda <sm...@gmail.com> wrote:
> hi, I believe the following bug already addressed the issue:
> https://issues.apache.org/jira/browse/MAHOUT-594
>
> Thanks, -- Shige
>
> On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:
>
>>
>> Dear All...
>>
>> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
>> there
>> is a problems in encoding...
>>
>> I tried the following commands:
>> =======================
>>
>> $ ./mahout seqdirectory -i "....\Arabic_data" -o
>> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>>
>> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
>> "....\ArabicTest\Arabic_data_out-seqdir"
>>
>> $ ./mahout kmeans -i
>> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
>> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
>> 20
>> -ow
>>
>> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
>> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
>> -b 100 -n 20
>>
>>
>> The clusterdump generate the following output
>> ===================================
>>
>> o HADOOP_HOME set, running locally
>> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
>> ?????:5.627, ?????:5.627, ??
>>        Top Terms:
>>                ????                                    =>
>>  11.830205917358398
>>                ?????                                   =>
>>  10.808554649353027
>>                ???????                                 =>
>>  8.93863296508789
>>                ?????                                   =>
>>  8.93863296508789
>>                ???????                                 =>
>>  8.93863296508789
>>                ???????                                 =>
>>  8.93863296508789
>>                77                                      =>
>> 8.365219116210938
>>                ????                                    =>
>> 8.365219116210938
>>                ??????                                  =>
>> 8.365219116210938
>>                ???????????                             =>
>> 8.365219116210938
>>                69                                      =>
>> 7.958374977111816
>>                ?????                                   =>
>>  7.6428022384643555
>>                53                                      =>
>>  7.6428022384643555
>>                ???                                     =>
>>  7.6428022384643555
>>                ???                                     =>
>> 7.384960651397705
>>                ?????                                   =>
>> 7.384960651397705
>>                ?????                                   =>
>> 7.166958332061768
>>                24                                      =>
>> 6.186699867248535
>>                31                                      =>
>>  5.9121222496032715
>>                ?????                                   =>
>> 5.627420902252197
>> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
>> ??????:7.385, ????????:8.939, ?????
>>        Top Terms:
>>                ????????                                =>
>>  12.641136169433594
>>                ??????                                  =>
>> 9.422260284423828
>>                ?????????                               =>
>>  8.93863296508789
>>                ????                                    =>
>>  8.93863296508789
>>
>>
>> ===============================================================
>> I think the meaningless (?) is a problem of encoding.... Can anyone help me
>> in this????
>>
>> Also I want a tutorial describing the command for k-mean clustering and it
>> attributes and what is the output of clusterdump represent for?
>>
>> Thank you....
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Arabic K-mean clustering

Posted by Shige Takeda <sm...@gmail.com>.
hi, I believe the following bug already addressed the issue:
https://issues.apache.org/jira/browse/MAHOUT-594

Thanks, -- Shige

On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <wa...@tayait.com> wrote:

>
> Dear All...
>
> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
> there
> is a problems in encoding...
>
> I tried the following commands:
> =======================
>
> $ ./mahout seqdirectory -i "....\Arabic_data" -o
> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>
> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
> "....\ArabicTest\Arabic_data_out-seqdir"
>
> $ ./mahout kmeans -i
> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
> 20
> -ow
>
> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
> -b 100 -n 20
>
>
> The clusterdump generate the following output
> ===================================
>
> o HADOOP_HOME set, running locally
> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
> ?????:5.627, ?????:5.627, ??
>        Top Terms:
>                ????                                    =>
>  11.830205917358398
>                ?????                                   =>
>  10.808554649353027
>                ???????                                 =>
>  8.93863296508789
>                ?????                                   =>
>  8.93863296508789
>                ???????                                 =>
>  8.93863296508789
>                ???????                                 =>
>  8.93863296508789
>                77                                      =>
> 8.365219116210938
>                ????                                    =>
> 8.365219116210938
>                ??????                                  =>
> 8.365219116210938
>                ???????????                             =>
> 8.365219116210938
>                69                                      =>
> 7.958374977111816
>                ?????                                   =>
>  7.6428022384643555
>                53                                      =>
>  7.6428022384643555
>                ???                                     =>
>  7.6428022384643555
>                ???                                     =>
> 7.384960651397705
>                ?????                                   =>
> 7.384960651397705
>                ?????                                   =>
> 7.166958332061768
>                24                                      =>
> 6.186699867248535
>                31                                      =>
>  5.9121222496032715
>                ?????                                   =>
> 5.627420902252197
> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
> ??????:7.385, ????????:8.939, ?????
>        Top Terms:
>                ????????                                =>
>  12.641136169433594
>                ??????                                  =>
> 9.422260284423828
>                ?????????                               =>
>  8.93863296508789
>                ????                                    =>
>  8.93863296508789
>
>
> ===============================================================
> I think the meaningless (?) is a problem of encoding.... Can anyone help me
> in this????
>
> Also I want a tutorial describing the command for k-mean clustering and it
> attributes and what is the output of clusterdump represent for?
>
> Thank you....
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>