You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by 熊田聖也 <se...@cct-inc.co.jp> on 2015/07/14 10:20:35 UTC

how to interpret the result of the clustering by “mahout kmeans”

Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?

RE: how to interpret the result of the clustering by “mahout kmeans”

Posted by 熊田聖也 <se...@cct-inc.co.jp>.

Dear Ankit Goel

Thank you for your detailed answer.
I'm going to check the correctness of my sequential files according to your answer.

Regards, 
S. Kumada



________________________________________
差出人: Ankit Goel <an...@gmail.com>
送信日時: 2015年7月23日 9:24
宛先: user@mahout.apache.org
件名: Re: how to interpret the result of the clustering by “mahout kmeans”

Hi Kumada,
I had the same problem till 2 days ago. Heres a few things I figured out
which I think would help. However I'm working with mahout 0.10.0, so I
might be very slightly off on what I"m saying.

Firstly from your results the format for 0.09 does seem to miss the column
id like you mentioned. Pat thinks there might be a problem with the way
data was entered I think. The work around this is to access it through java
as opposed to commandline. I was quite confused with some things (my
dictionary had over 3400 terms) so using java helped me get clarity on a
lot of things. Though java code, you will be able to extract the values of
the columns properly.

Mahout is built on hadoop, which uses a file system called sequential
files. They have multiple storage benefits, of which I dont know any cept
that they save data in a more concise manner. So any program you write
deals with sequential files. In fact you have come across them when you
were saving your data from text file to mahout vector format. You probably
used *mahout seqdirectory* in the very start. You can explore sequence
files with *mahout seqdumper*. So what Pat is asking (correct me if i'm
wrong) is after converting your raw data to mahout readable format, did u
check to see if they were right.

On Thu, Jul 23, 2015 at 4:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Clusterdump is a tool for examining the output. The sequencefiles *are*
> the output.
>
> run “mahout kmeans” and get a list of the options and where output is
> stored.
>
> On Jul 21, 2015, at 5:49 PM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:
>
> Thank for your reply.
> I uses Amazon ElasticMapReduce(EMR).
> It supports mahout-0.9/0.8, but not 0.7.
> In the case of mahout-0.9/0.8,  the result obtained by “mahout
> clusterdump” does not contain the column id, but the result by 0.7 contains
> it.
>
> I have one question on your statement "Are the results in the sequence
> files correct? ."
> What do the sequence files mean?
> Which command of "mahout" yields them?
>
> Sincerely yours
> S.Kumada
>
>
> ________________________________________
> 差出人: Pat Ferrel <pa...@occamsmachete.com>
> 送信日時: 2015年7月19日 3:00
> 宛先: user@mahout.apache.org
> 件名: Re: how to interpret the result of the clustering by “mahout kmeans”
>
> This is probably a clusterdump formatting problem in Mahout 0.9, have you
> tried Mahout 0.10.1, which is the latest version?
>
> Are the results in the sequence files correct? They are sparse vectors so
> must contain the column id.
>
>
> On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:
>
>
> Grad to see you.
>
> This is my first question in the mahout mailing list.
>
>
> I’m now calculating the clustering by using “mahout means.”
>
> My data is as follows:
>
>
> @RELATION rfm
>
> @ATTRIBUTE recency NUMERIC
>
> @ATTRIBUTE frequency NUMERIC
>
> @ATTRIBUTE money NUMERIC
>
> @ATTRIBUTE location NUMERIC
>
> @ATTRIBUTE position NUMERIC
>
> @DATA
>
> 0.472,0.275,0.099,0.952,0.047,
>
> 0.000,0.824,0.936,0.214,0.000,
>
> 0.000,0.537,0.656,0.591,0.000,
>
> ....
>
> 0.908,0.000,0.000,0.078,0.136,
>
> 0.134,0.000,0.000,0.781,0.160,
>
> 0.302,0.000,0.000,0.513,0.715,
>
> 0.472,0.000,0.000,0.749,0.047,
>
>
> The file is the ARFF format.
>
> Each row is the 5-dimensional vector and the most of rows contain zero
> values.
>
> I converted the ARFF to the Vector format for the purpose of "mahout
> kmeans."
>
> The resultant file is as follows:
>
>
> Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}
>
> Key: 1: Value: {1:0.824,2:0.936,3:0.214}
>
> Key: 2: Value: {1:0.537,2:0.656,3:0.591}
>
> Key: 3: Value: {1:0.954,2:0.253,3:0.721}
>
> Key: 4: Value: {1:0.187,2:0.735,3:0.782}
>
> Key: 5: Value: {1:0.517,2:0.276,3:0.096}
>
> Key: 6: Value: {1:0.189,2:0.127,3:0.517}
>
> ...
>
> Key: 993: Value: {0:0.662,3:0.218,4:0.69}
>
> Key: 994: Value: {0:0.56,3:0.682,4:0.153}
>
> Key: 995: Value: {0:0.788,3:0.929,4:0.967}
>
> Key: 996: Value: {0:0.908,3:0.078,4:0.136}
>
> Key: 997: Value: {0:0.134,3:0.781,4:0.16}
>
> Key: 998: Value: {0:0.302,3:0.513,4:0.715}
>
> Key: 999: Value: {0:0.472,3:0.749,4:0.047}
>
>
> In the above result, each vector is represented by the dictionary format,
> e.g.
>
> {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.
>
>
> Using the file, I carried out "mahout kmeans."
>
> (The current version of the mahout is 0.9.)
>
> After the calculation, I typed “mahout clusterdump”
>
> and got the result as shown below:
>
>
> VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}
>
> VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}
>
> VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}
>
> VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023,
> 0.137, 0.155]}
>
> VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}
>
> VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}
>
> VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}
>
> VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}
>
> VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}
>
> VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025,
> 0.133, 0.192]}
>
>
> On the other hand, when the same calculation is done by the mahout with
> version 0.7, the result is as follows:
>
>
> VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157,
> 0.115, 0.168]}
>
> VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}
>
> VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}
>
> VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}
>
> VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}
>
> VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}
>
> VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}
>
> VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}
>
> VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}
>
> VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026,
> 0.170, 0.135]}
>
>
> In the result by the version 0.7, the centroid coordinate is represented
> by the dictionary format, e.g.
>
> c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].
>
> However, in the result by version 0.9, we can not correctly know the
> centroid coordinate,
>
> because we can not know zero positions.
>
>
> Cloud you tell me how to interpret the result by the version 0.9 ?
>
>
>


--
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: how to interpret the result of the clustering by “mahout kmeans”

Posted by Ankit Goel <an...@gmail.com>.

Hi Kumada,
I had the same problem till 2 days ago. Heres a few things I figured out
which I think would help. However I'm working with mahout 0.10.0, so I
might be very slightly off on what I"m saying.

Firstly from your results the format for 0.09 does seem to miss the column
id like you mentioned. Pat thinks there might be a problem with the way
data was entered I think. The work around this is to access it through java
as opposed to commandline. I was quite confused with some things (my
dictionary had over 3400 terms) so using java helped me get clarity on a
lot of things. Though java code, you will be able to extract the values of
the columns properly.

Mahout is built on hadoop, which uses a file system called sequential
files. They have multiple storage benefits, of which I dont know any cept
that they save data in a more concise manner. So any program you write
deals with sequential files. In fact you have come across them when you
were saving your data from text file to mahout vector format. You probably
used *mahout seqdirectory* in the very start. You can explore sequence
files with *mahout seqdumper*. So what Pat is asking (correct me if i'm
wrong) is after converting your raw data to mahout readable format, did u
check to see if they were right.

On Thu, Jul 23, 2015 at 4:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Clusterdump is a tool for examining the output. The sequencefiles *are*
> the output.
>
> run “mahout kmeans” and get a list of the options and where output is
> stored.
>
> On Jul 21, 2015, at 5:49 PM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:
>
> Thank for your reply.
> I uses Amazon ElasticMapReduce(EMR).
> It supports mahout-0.9/0.8, but not 0.7.
> In the case of mahout-0.9/0.8,  the result obtained by “mahout
> clusterdump” does not contain the column id, but the result by 0.7 contains
> it.
>
> I have one question on your statement "Are the results in the sequence
> files correct? ."
> What do the sequence files mean?
> Which command of "mahout" yields them?
>
> Sincerely yours
> S.Kumada
>
>
> ________________________________________
> 差出人: Pat Ferrel <pa...@occamsmachete.com>
> 送信日時: 2015年7月19日 3:00
> 宛先: user@mahout.apache.org
> 件名: Re: how to interpret the result of the clustering by “mahout kmeans”
>
> This is probably a clusterdump formatting problem in Mahout 0.9, have you
> tried Mahout 0.10.1, which is the latest version?
>
> Are the results in the sequence files correct? They are sparse vectors so
> must contain the column id.
>
>
> On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:
>
>
> Grad to see you.
>
> This is my first question in the mahout mailing list.
>
>
> I’m now calculating the clustering by using “mahout means.”
>
> My data is as follows:
>
>
> @RELATION rfm
>
> @ATTRIBUTE recency NUMERIC
>
> @ATTRIBUTE frequency NUMERIC
>
> @ATTRIBUTE money NUMERIC
>
> @ATTRIBUTE location NUMERIC
>
> @ATTRIBUTE position NUMERIC
>
> @DATA
>
> 0.472,0.275,0.099,0.952,0.047,
>
> 0.000,0.824,0.936,0.214,0.000,
>
> 0.000,0.537,0.656,0.591,0.000,
>
> ....
>
> 0.908,0.000,0.000,0.078,0.136,
>
> 0.134,0.000,0.000,0.781,0.160,
>
> 0.302,0.000,0.000,0.513,0.715,
>
> 0.472,0.000,0.000,0.749,0.047,
>
>
> The file is the ARFF format.
>
> Each row is the 5-dimensional vector and the most of rows contain zero
> values.
>
> I converted the ARFF to the Vector format for the purpose of "mahout
> kmeans."
>
> The resultant file is as follows:
>
>
> Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}
>
> Key: 1: Value: {1:0.824,2:0.936,3:0.214}
>
> Key: 2: Value: {1:0.537,2:0.656,3:0.591}
>
> Key: 3: Value: {1:0.954,2:0.253,3:0.721}
>
> Key: 4: Value: {1:0.187,2:0.735,3:0.782}
>
> Key: 5: Value: {1:0.517,2:0.276,3:0.096}
>
> Key: 6: Value: {1:0.189,2:0.127,3:0.517}
>
> ...
>
> Key: 993: Value: {0:0.662,3:0.218,4:0.69}
>
> Key: 994: Value: {0:0.56,3:0.682,4:0.153}
>
> Key: 995: Value: {0:0.788,3:0.929,4:0.967}
>
> Key: 996: Value: {0:0.908,3:0.078,4:0.136}
>
> Key: 997: Value: {0:0.134,3:0.781,4:0.16}
>
> Key: 998: Value: {0:0.302,3:0.513,4:0.715}
>
> Key: 999: Value: {0:0.472,3:0.749,4:0.047}
>
>
> In the above result, each vector is represented by the dictionary format,
> e.g.
>
> {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.
>
>
> Using the file, I carried out "mahout kmeans."
>
> (The current version of the mahout is 0.9.)
>
> After the calculation, I typed “mahout clusterdump”
>
> and got the result as shown below:
>
>
> VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}
>
> VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}
>
> VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}
>
> VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023,
> 0.137, 0.155]}
>
> VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}
>
> VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}
>
> VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}
>
> VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}
>
> VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}
>
> VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025,
> 0.133, 0.192]}
>
>
> On the other hand, when the same calculation is done by the mahout with
> version 0.7, the result is as follows:
>
>
> VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157,
> 0.115, 0.168]}
>
> VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}
>
> VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}
>
> VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}
>
> VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}
>
> VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}
>
> VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}
>
> VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}
>
> VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}
>
> VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026,
> 0.170, 0.135]}
>
>
> In the result by the version 0.7, the centroid coordinate is represented
> by the dictionary format, e.g.
>
> c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].
>
> However, in the result by version 0.9, we can not correctly know the
> centroid coordinate,
>
> because we can not know zero positions.
>
>
> Cloud you tell me how to interpret the result by the version 0.9 ?
>
>
>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: how to interpret the result of the clustering by “mahout kmeans”

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Clusterdump is a tool for examining the output. The sequencefiles *are* the output.

run “mahout kmeans” and get a list of the options and where output is stored.

On Jul 21, 2015, at 5:49 PM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:

Thank for your reply.
I uses Amazon ElasticMapReduce(EMR).
It supports mahout-0.9/0.8, but not 0.7.
In the case of mahout-0.9/0.8,  the result obtained by “mahout clusterdump” does not contain the column id, but the result by 0.7 contains it.

I have one question on your statement "Are the results in the sequence files correct? ."
What do the sequence files mean?
Which command of "mahout" yields them?

Sincerely yours
S.Kumada 


________________________________________
差出人: Pat Ferrel <pa...@occamsmachete.com>
送信日時: 2015年7月19日 3:00
宛先: user@mahout.apache.org
件名: Re: how to interpret the result of the clustering by “mahout kmeans”

This is probably a clusterdump formatting problem in Mahout 0.9, have you tried Mahout 0.10.1, which is the latest version?

Are the results in the sequence files correct? They are sparse vectors so must contain the column id.


On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:


Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?

RE: how to interpret the result of the clustering by “mahout kmeans”

Posted by 熊田聖也 <se...@cct-inc.co.jp>.

Thank for your reply.
I uses Amazon ElasticMapReduce(EMR).
It supports mahout-0.9/0.8, but not 0.7.
In the case of mahout-0.9/0.8,  the result obtained by “mahout clusterdump” does not contain the column id, but the result by 0.7 contains it.

I have one question on your statement "Are the results in the sequence files correct? ."
What do the sequence files mean?
Which command of "mahout" yields them?

Sincerely yours
S.Kumada 


________________________________________
差出人: Pat Ferrel <pa...@occamsmachete.com>
送信日時: 2015年7月19日 3:00
宛先: user@mahout.apache.org
件名: Re: how to interpret the result of the clustering by “mahout kmeans”

This is probably a clusterdump formatting problem in Mahout 0.9, have you tried Mahout 0.10.1, which is the latest version?

Are the results in the sequence files correct? They are sparse vectors so must contain the column id.


On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:


Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?

Re: how to interpret the result of the clustering by “mahout kmeans”

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This is probably a clusterdump formatting problem in Mahout 0.9, have you tried Mahout 0.10.1, which is the latest version?

Are the results in the sequence files correct? They are sparse vectors so must contain the column id.


On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <se...@cct-inc.co.jp> wrote:


Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?