You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Matt Tanquary <ma...@gmail.com> on 2010/09/29 20:26:14 UTC

kmeans vectors

I was able to run the tutorials, etc. Now I would like to generate my
own small test.

I have created a data.dat file and put these contents:
22 21
19 20
18 22
1 3
3 2

Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir

This created kmeans/seqdir/chunk-o in my dfs with the following content:
¼/%
        /data.dat22 21
19 20
18 22
1 3
3 2

Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input

This generated several things in kmeans/input including the
'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
which contains:
øÏân
        /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
     /data.dat@@

It does not seem to have the numeric data at this point.

I am hoping someone can shed some light on how I can get my datapoint
file into the proper vector format for running mahout kmeans.

Just fyi, when I run kmeans against that file (mahout kmeans -i
kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
-w) I get:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
1, Size: 1
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)

which tells me it was unable to find even 1 vector in the given input folder.

Thanks for any comments you provide.
-M@
-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Matt Tanquary <ma...@gmail.com>.

I used the InputDriver that Jeff placed in Utils to convert my input
to a SeqFile and ran it through mahout kmeans, now I can specify the
'k' arg. Jeff - I know you tried to tell me, it just didn't sink in
until now. :-)

On Fri, Oct 1, 2010 at 7:22 AM, Matt Tanquary <ma...@gmail.com> wrote:
> I played around with the t1 and t2 until I got a k that I expected
> with my small set, but if I want to ensure say 3 clusters on a large
> set of data, then how to I use t1 and t2 to set k? Is there a formula
> for that?
>
> On Thu, Sep 30, 2010 at 8:24 PM, Lahiru Samarakoon <la...@gmail.com> wrote:
>> Hi Matt,
>>
>> As Jeff has mentioned earlier, you have to choose t1 and t2 to get the k
>> when you are using * syntheticcontrol.kmeans.Job* program. So what you have
>> experienced is correct.
>>
>> Thanks,
>> Lahiru
>>
>
>
>
> --
> Have you thanked a teacher today? ---> http://www.liftateacher.org
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Ted Dunning <te...@gmail.com>.

No there isn't.  Your other option is to use kmeans directly and set k (as
you seem to do now).

t1 and t2 can also be quite delicate parameters.

My own tendency is to try to use a good initialization scheme such as
kmeans++ (which we don't
yet have) and just specify the number of clusters.  If none of the clusters
are small, I increase k.
If some have just a very few points, I decrease k.  Then I look for temporal
stability of cluster
size.  At that point, the clusters are the clusters and I rarely change
them.

My justification for this is that clustering for me is just a way to do
recoding of input variables
along the lines of a volume quantization (coding each point with just the
cluster) or near
diagonalization (coding each point with distance to all clusters).

On Fri, Oct 1, 2010 at 7:22 AM, Matt Tanquary <ma...@gmail.com>wrote:

> I played around with the t1 and t2 until I got a k that I expected
> with my small set, but if I want to ensure say 3 clusters on a large
> set of data, then how to I use t1 and t2 to set k? Is there a formula
> for that?
>
> On Thu, Sep 30, 2010 at 8:24 PM, Lahiru Samarakoon <la...@gmail.com>
> wrote:
> > Hi Matt,
> >
> > As Jeff has mentioned earlier, you have to choose t1 and t2 to get the k
> > when you are using * syntheticcontrol.kmeans.Job* program. So what you
> have
> > experienced is correct.
> >
> > Thanks,
> > Lahiru
> >
>
>
>
> --
> Have you thanked a teacher today? ---> http://www.liftateacher.org
>

Re: kmeans vectors

Posted by Matt Tanquary <ma...@gmail.com>.

I played around with the t1 and t2 until I got a k that I expected
with my small set, but if I want to ensure say 3 clusters on a large
set of data, then how to I use t1 and t2 to set k? Is there a formula
for that?

On Thu, Sep 30, 2010 at 8:24 PM, Lahiru Samarakoon <la...@gmail.com> wrote:
> Hi Matt,
>
> As Jeff has mentioned earlier, you have to choose t1 and t2 to get the k
> when you are using * syntheticcontrol.kmeans.Job* program. So what you have
> experienced is correct.
>
> Thanks,
> Lahiru
>

-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Lahiru Samarakoon <la...@gmail.com>.

Hi Matt,

As Jeff has mentioned earlier, you have to choose t1 and t2 to get the k
when you are using * syntheticcontrol.kmeans.Job* program. So what you have
experienced is correct.

Thanks,
Lahiru

Re: kmeans vectors

Posted by Matt Tanquary <ma...@gmail.com>.

I tried to use -k with the syntheticcontrol.kmeans.Job program, but it
didn't recognize that argument.

On Thu, Sep 30, 2010 at 6:18 AM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  Not using the synthetic control jobs. They always run Canopy over the
> converted data and you need to choose t1 and t2 to get the initial k. Once
> you have run it once; however, copy the data file from output into another
> folder. From there you can run k-means or any of the other clustering
> programs on that data using their normal jobs and normal parameters.
>
> When you run k-means on the data, you can supply a -k argument and your
> input points will be randomly-sampled to prime the initial cluster centers
> for the subsequent iterations.
>
> I'm going to move the InputDriver and Mapper to utils since it has general
> utility outside of the synthetic control example. Its driver can be run
> directly from the command line and you can do that too.
>
> Smooth sailing,
> Jeff
>
>
> On 9/30/10 1:40 AM, Lahiru Samarakoon wrote:
>>
>> Hi Jeff,
>>
>> If we do this for Kmeans, How can we specify the k (number of clusters)
>> and
>> initial seeds for the algorithm?
>>
>> I understand that canopy is used for this.
>>
>> Does Mahout has the flexibility to use Kmeans/Fuzzy Kmeans independent of
>> Canopy by inputing k and initial seeds externally?
>>
>> Thanks,
>> Lahiru
>>
>
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Not using the synthetic control jobs. They always run Canopy over the 
converted data and you need to choose t1 and t2 to get the initial k. 
Once you have run it once; however, copy the data file from output into 
another folder. From there you can run k-means or any of the other 
clustering programs on that data using their normal jobs and normal 
parameters.

When you run k-means on the data, you can supply a -k argument and your 
input points will be randomly-sampled to prime the initial cluster 
centers for the subsequent iterations.

I'm going to move the InputDriver and Mapper to utils since it has 
general utility outside of the synthetic control example. Its driver can 
be run directly from the command line and you can do that too.

Smooth sailing,
Jeff

On 9/30/10 1:40 AM, Lahiru Samarakoon wrote:
> Hi Jeff,
>
> If we do this for Kmeans, How can we specify the k (number of clusters) and
> initial seeds for the algorithm?
>
> I understand that canopy is used for this.
>
> Does Mahout has the flexibility to use Kmeans/Fuzzy Kmeans independent of
> Canopy by inputing k and initial seeds externally?
>
> Thanks,
> Lahiru
>

Re: kmeans vectors

Posted by Lahiru Samarakoon <la...@gmail.com>.

Hi Jeff,

If we do this for Kmeans, How can we specify the k (number of clusters) and
initial seeds for the algorithm?

I understand that canopy is used for this.

Does Mahout has the flexibility to use Kmeans/Fuzzy Kmeans independent of
Canopy by inputing k and initial seeds externally?

Thanks,
Lahiru

Re: kmeans vectors

Posted by Matt Tanquary <ma...@gmail.com>.

Thanks,

It was a permission issue. I had to change the group owner to the
current user's group, it's now building. I  moved the build from one
server to another (which caused the user sync problem).

2010/9/30 Jeff Eastman <jd...@windwardsolutions.com>:
>  Don't think so. Try "mvn clean install" and let me know what happens.
>
> On 9/30/10 12:48 PM, Matt Tanquary wrote:
>>
>> Hi Jeff,
>>
>> Thanks for your reply. I just got trunk and started the install. It
>> ended with this error:
>>
>> Error loading supplemental data models: Cannot create file-based resource.
>> org.codehaus.plexus.resource.loader.FileResourceCreationException:
>> Cannot create file-based resource.
>>
>>
>> A lot built, so I went ahead and tried your command-line example, but got:
>>
>> ERROR: Could not find mahout-examples-*.job in
>> /mnt/install/tools/mahout or
>> /mnt/install/tools/mahout/examples/target, please run 'mvn install' to
>> create the .job file
>>
>> I retrieved trunk as follows: svn co
>> http://svn.apache.org/repos/asf/mahout/trunk
>>
>> Then ran 'mvn install' in the trunk folder.
>>
>> Any issues with trunk today?
>>
>> Thanks,
>> Matt
>>
>> On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>>
>>>  Hi Matt,
>>>
>>>  From your command arguments, it looks like you are running 0.3. Due to
>>> the
>>> rate of change in Mahout we recommend you check out trunk and use that
>>> instead. With a little tweaking (added a --charset ASCII on seqdirectory)
>>> I
>>> was able to get as far as you did on trunk but seq2sparse is not what you
>>> want to use.
>>>
>>> The utilities you are using are intended for text preprocessing, to get
>>> documents word-counted, into term vector sequenceFiles and then running
>>> TF
>>> and/or TF-IDF processing on the results to produce VectorWritable
>>> sequence
>>> files suitable for clustering. For your problem, I suggest you instead
>>> look
>>> at the Synthetic Control clustering examples, starting with Canopy. These
>>> use an InputDriver to process text files containing space-delimited
>>> numbers
>>> like your data.dat file and produce the VectorWritable sequence files
>>> directly.
>>>
>>> I was able to run this on your data using trunk and it produced 3
>>> clusters.
>>> You should be able to run the other synthetic control jobs on it too:
>>>
>>> CommandLine:
>>> ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
>>> -i data \
>>> -o output \
>>> -t1 3 \
>>> -t2 2 \
>>> -ow \
>>> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
>>>
>>> Clusters output:
>>> C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
>>>    Weight:  Point:
>>>    1.0: [22.000, 21.000]
>>> C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
>>>    Weight:  Point:
>>>    1.0: [19.000, 20.000]
>>>    1.0: [18.000, 22.000]
>>> C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
>>>    Weight:  Point:
>>>    1.0: [1.000, 3.000]
>>>    1.0: [3.000, 2.000]
>>>
>>>
>>> Good hunting,
>>> Jeff
>>>
>>> On 9/29/10 2:26 PM, Matt Tanquary wrote:
>>>>
>>>> I was able to run the tutorials, etc. Now I would like to generate my
>>>> own small test.
>>>>
>>>> I have created a data.dat file and put these contents:
>>>> 22 21
>>>> 19 20
>>>> 18 22
>>>> 1 3
>>>> 3 2
>>>>
>>>> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o
>>>> kmeans/seqdir
>>>>
>>>> This created kmeans/seqdir/chunk-o in my dfs with the following content:
>>>> ź/%
>>>>         /data.dat22 21
>>>> 19 20
>>>> 18 22
>>>> 1 3
>>>> 3 2
>>>>
>>>> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>>>>
>>>> This generated several things in kmeans/input including the
>>>> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
>>>> which contains:
>>>> řĎân
>>>>
>>>> /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>>>>      /data.dat@@
>>>>
>>>> It does not seem to have the numeric data at this point.
>>>>
>>>> I am hoping someone can shed some light on how I can get my datapoint
>>>> file into the proper vector format for running mahout kmeans.
>>>>
>>>> Just fyi, when I run kmeans against that file (mahout kmeans -i
>>>> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
>>>> -w) I get:
>>>>
>>>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>>>> 1, Size: 1
>>>>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>>
>>>> which tells me it was unable to find even 1 vector in the given input
>>>> folder.
>>>>
>>>> Thanks for any comments you provide.
>>>> -M@
>>>
>>
>>
>
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Don't think so. Try "mvn clean install" and let me know what happens.

On 9/30/10 12:48 PM, Matt Tanquary wrote:
> Hi Jeff,
>
> Thanks for your reply. I just got trunk and started the install. It
> ended with this error:
>
> Error loading supplemental data models: Cannot create file-based resource.
> org.codehaus.plexus.resource.loader.FileResourceCreationException:
> Cannot create file-based resource.
>
>
> A lot built, so I went ahead and tried your command-line example, but got:
>
> ERROR: Could not find mahout-examples-*.job in
> /mnt/install/tools/mahout or
> /mnt/install/tools/mahout/examples/target, please run 'mvn install' to
> create the .job file
>
> I retrieved trunk as follows: svn co
> http://svn.apache.org/repos/asf/mahout/trunk
>
> Then ran 'mvn install' in the trunk folder.
>
> Any issues with trunk today?
>
> Thanks,
> Matt
>
> On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>>   Hi Matt,
>>
>>  From your command arguments, it looks like you are running 0.3. Due to the
>> rate of change in Mahout we recommend you check out trunk and use that
>> instead. With a little tweaking (added a --charset ASCII on seqdirectory) I
>> was able to get as far as you did on trunk but seq2sparse is not what you
>> want to use.
>>
>> The utilities you are using are intended for text preprocessing, to get
>> documents word-counted, into term vector sequenceFiles and then running TF
>> and/or TF-IDF processing on the results to produce VectorWritable sequence
>> files suitable for clustering. For your problem, I suggest you instead look
>> at the Synthetic Control clustering examples, starting with Canopy. These
>> use an InputDriver to process text files containing space-delimited numbers
>> like your data.dat file and produce the VectorWritable sequence files
>> directly.
>>
>> I was able to run this on your data using trunk and it produced 3 clusters.
>> You should be able to run the other synthetic control jobs on it too:
>>
>> CommandLine:
>> ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
>> -i data \
>> -o output \
>> -t1 3 \
>> -t2 2 \
>> -ow \
>> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
>>
>> Clusters output:
>> C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
>>     Weight:  Point:
>>     1.0: [22.000, 21.000]
>> C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
>>     Weight:  Point:
>>     1.0: [19.000, 20.000]
>>     1.0: [18.000, 22.000]
>> C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
>>     Weight:  Point:
>>     1.0: [1.000, 3.000]
>>     1.0: [3.000, 2.000]
>>
>>
>> Good hunting,
>> Jeff
>>
>> On 9/29/10 2:26 PM, Matt Tanquary wrote:
>>> I was able to run the tutorials, etc. Now I would like to generate my
>>> own small test.
>>>
>>> I have created a data.dat file and put these contents:
>>> 22 21
>>> 19 20
>>> 18 22
>>> 1 3
>>> 3 2
>>>
>>> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
>>>
>>> This created kmeans/seqdir/chunk-o in my dfs with the following content:
>>> ź/%
>>>          /data.dat22 21
>>> 19 20
>>> 18 22
>>> 1 3
>>> 3 2
>>>
>>> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>>>
>>> This generated several things in kmeans/input including the
>>> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
>>> which contains:
>>> řĎân
>>>          /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>>>       /data.dat@@
>>>
>>> It does not seem to have the numeric data at this point.
>>>
>>> I am hoping someone can shed some light on how I can get my datapoint
>>> file into the proper vector format for running mahout kmeans.
>>>
>>> Just fyi, when I run kmeans against that file (mahout kmeans -i
>>> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
>>> -w) I get:
>>>
>>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>>> 1, Size: 1
>>>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>
>>> which tells me it was unable to find even 1 vector in the given input
>>> folder.
>>>
>>> Thanks for any comments you provide.
>>> -M@
>>
>
>

Re: kmeans vectors

Posted by Matt Tanquary <ma...@gmail.com>.

Hi Jeff,

Thanks for your reply. I just got trunk and started the install. It
ended with this error:

Error loading supplemental data models: Cannot create file-based resource.
org.codehaus.plexus.resource.loader.FileResourceCreationException:
Cannot create file-based resource.


A lot built, so I went ahead and tried your command-line example, but got:

ERROR: Could not find mahout-examples-*.job in
/mnt/install/tools/mahout or
/mnt/install/tools/mahout/examples/target, please run 'mvn install' to
create the .job file

I retrieved trunk as follows: svn co
http://svn.apache.org/repos/asf/mahout/trunk

Then ran 'mvn install' in the trunk folder.

Any issues with trunk today?

Thanks,
Matt

On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  Hi Matt,
>
> From your command arguments, it looks like you are running 0.3. Due to the
> rate of change in Mahout we recommend you check out trunk and use that
> instead. With a little tweaking (added a --charset ASCII on seqdirectory) I
> was able to get as far as you did on trunk but seq2sparse is not what you
> want to use.
>
> The utilities you are using are intended for text preprocessing, to get
> documents word-counted, into term vector sequenceFiles and then running TF
> and/or TF-IDF processing on the results to produce VectorWritable sequence
> files suitable for clustering. For your problem, I suggest you instead look
> at the Synthetic Control clustering examples, starting with Canopy. These
> use an InputDriver to process text files containing space-delimited numbers
> like your data.dat file and produce the VectorWritable sequence files
> directly.
>
> I was able to run this on your data using trunk and it produced 3 clusters.
> You should be able to run the other synthetic control jobs on it too:
>
> CommandLine:
> ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
> -i data \
> -o output \
> -t1 3 \
> -t2 2 \
> -ow \
> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
>
> Clusters output:
> C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
>    Weight:  Point:
>    1.0: [22.000, 21.000]
> C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
>    Weight:  Point:
>    1.0: [19.000, 20.000]
>    1.0: [18.000, 22.000]
> C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
>    Weight:  Point:
>    1.0: [1.000, 3.000]
>    1.0: [3.000, 2.000]
>
>
> Good hunting,
> Jeff
>
> On 9/29/10 2:26 PM, Matt Tanquary wrote:
>>
>> I was able to run the tutorials, etc. Now I would like to generate my
>> own small test.
>>
>> I have created a data.dat file and put these contents:
>> 22 21
>> 19 20
>> 18 22
>> 1 3
>> 3 2
>>
>> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
>>
>> This created kmeans/seqdir/chunk-o in my dfs with the following content:
>> ź/%
>>         /data.dat22 21
>> 19 20
>> 18 22
>> 1 3
>> 3 2
>>
>> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>>
>> This generated several things in kmeans/input including the
>> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
>> which contains:
>> řĎân
>>         /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>>      /data.dat@@
>>
>> It does not seem to have the numeric data at this point.
>>
>> I am hoping someone can shed some light on how I can get my datapoint
>> file into the proper vector format for running mahout kmeans.
>>
>> Just fyi, when I run kmeans against that file (mahout kmeans -i
>> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
>> -w) I get:
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>> 1, Size: 1
>>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> which tells me it was unable to find even 1 vector in the given input
>> folder.
>>
>> Thanks for any comments you provide.
>> -M@
>
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Matt,

 From your command arguments, it looks like you are running 0.3. Due to 
the rate of change in Mahout we recommend you check out trunk and use 
that instead. With a little tweaking (added a --charset ASCII on 
seqdirectory) I was able to get as far as you did on trunk but 
seq2sparse is not what you want to use.

The utilities you are using are intended for text preprocessing, to get 
documents word-counted, into term vector sequenceFiles and then running 
TF and/or TF-IDF processing on the results to produce VectorWritable 
sequence files suitable for clustering. For your problem, I suggest you 
instead look at the Synthetic Control clustering examples, starting with 
Canopy. These use an InputDriver to process text files containing 
space-delimited numbers like your data.dat file and produce the 
VectorWritable sequence files directly.

I was able to run this on your data using trunk and it produced 3 
clusters. You should be able to run the other synthetic control jobs on 
it too:

CommandLine:
./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
-i data \
-o output \
-t1 3 \
-t2 2 \
-ow \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

Clusters output:
C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
     Weight:  Point:
     1.0: [22.000, 21.000]
C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
     Weight:  Point:
     1.0: [19.000, 20.000]
     1.0: [18.000, 22.000]
C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
     Weight:  Point:
     1.0: [1.000, 3.000]
     1.0: [3.000, 2.000]

Good hunting,
Jeff

On 9/29/10 2:26 PM, Matt Tanquary wrote:
> I was able to run the tutorials, etc. Now I would like to generate my
> own small test.
>
> I have created a data.dat file and put these contents:
> 22 21
> 19 20
> 18 22
> 1 3
> 3 2
>
> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
>
> This created kmeans/seqdir/chunk-o in my dfs with the following content:
> ¼/%
>          /data.dat22 21
> 19 20
> 18 22
> 1 3
> 3 2
>
> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>
> This generated several things in kmeans/input including the
> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
> which contains:
> øÏân
>          /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>       /data.dat@@
>
> It does not seem to have the numeric data at this point.
>
> I am hoping someone can shed some light on how I can get my datapoint
> file into the proper vector format for running mahout kmeans.
>
> Just fyi, when I run kmeans against that file (mahout kmeans -i
> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
> -w) I get:
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1, Size: 1
>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
> which tells me it was unable to find even 1 vector in the given input folder.
>
> Thanks for any comments you provide.
> -M@