You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Radek Maciaszek <ra...@gmail.com> on 2010/09/06 14:45:52 UTC

Tranforming data for k-means analysis

Hi,

I am trying to use Mahout for my MSc project. I successfully run all
clustering examples and I now am trying to analyse some of my data,
unfortunately without much of success.

Input data which I want to cluster is a list of vectors in a tab separated
format:
1.2   0.0   0.0  3.414
0.0   0.4   0.0   0.3
16.2  0.0   0.0   0.0
etc.
I generated this file in python and can easily change it to be in comma
separated format or make any necessary changes. It is rather a large file
with many thousands of dimensions and millions of rows and it contains
TF/IDF numbers calculated for users and URLs they visited (each row is a
user and each column a URL). Each rows is a sparse vector.

I would like to cluster users using kmeans into 20+ clusters. Now I am
having problems with running clustering on this data. On the beginning I
tried simply to put this file instead of a "testdata" filename on hadoop
(originally synthetic_control.data) and was running "mahout
org.apache.mahout.clustering.syntheticcontrol.canopy.Job". I was hoping to
reuse the existing scripts but that unfortunately gives me some null pointer
exceptions.

What would be the fastest/best way of analysing this matrix in order to
group the rows into clusters?

Many thanks for your advice,
Radek

Re: Tranforming data for k-means analysis

Posted by Ted Dunning <te...@gmail.com>.
Glad we could help.

On Tue, Jul 5, 2011 at 7:09 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:

> Hello,
>
> I worked in the past on MSc project which involved quite a lot of Mahout
> calculation. I finished it a while ago but only recently got my head around
> posting it somewhere online.
>
> It would be much more difficult to finish this work without the help from
> this list so I wanted to say thank you! I thought that perhaps someone will
> find my code and research interesting so here it is.
>
> The paper is on "How much behavioural targeting can help online
> advertising". There are quite many calculations involved which were written
> mostly in Python and Hadoop/Hive and the clustering was performed by
> Mahout.
> http://www.dataminelab.com/blog/behavioural-targeting-online/
>
> Many thanks!
> Radek
>
> On 8 September 2010 09:52, rmx <ru...@hotmail.com> wrote:
>
> >
> > Hi Radek,
> > If you could post a tutorial, it would be fantastic.
> > I am a Machine Learning researcher without enough java programming skills
> > to
> > dig the code.
> > I found Mahout potential really impressive and if I could manage to work
> it
> > I would be up to convince the rest of research group to use it.
> >
> > Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
> > weeks
> > ago I tried to install Truck but I got some errors on the installation
> > tests. I will try to do it again, since probably there is a new version.
> >
> > Thanks
> > Rui
>

Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hello,

I worked in the past on MSc project which involved quite a lot of Mahout
calculation. I finished it a while ago but only recently got my head around
posting it somewhere online.

It would be much more difficult to finish this work without the help from
this list so I wanted to say thank you! I thought that perhaps someone will
find my code and research interesting so here it is.

The paper is on "How much behavioural targeting can help online
advertising". There are quite many calculations involved which were written
mostly in Python and Hadoop/Hive and the clustering was performed by Mahout.
http://www.dataminelab.com/blog/behavioural-targeting-online/

Many thanks!
Radek

On 8 September 2010 09:52, rmx <ru...@hotmail.com> wrote:

>
> Hi Radek,
> If you could post a tutorial, it would be fantastic.
> I am a Machine Learning researcher without enough java programming skills
> to
> dig the code.
> I found Mahout potential really impressive and if I could manage to work it
> I would be up to convince the rest of research group to use it.
>
> Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
> weeks
> ago I tried to install Truck but I got some errors on the installation
> tests. I will try to do it again, since probably there is a new version.
>
> Thanks
> Rui

Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Jeff, you were absolutely right! By mistake I specified a wrong folder name
here and as a result kmeans was unable to find sequence files. Seeing array
exception I just assumed it was an error somewhere in the sequence and spent
hours trying to hunt it down. One good thing which came out of it is that I
finally configured debugging on by dev box - makes testing much easier
comparing to testing from the command line.

Thanks again,
Radek

On 15 September 2010 01:04, Jeff Eastman <jd...@windwardsolutions.com> wrote:

>  Hmn, this may also be caused by simply using the wrong input path to
> kmeans. It should be the directory containing your input vector sequence
> files, not the name of the file itself. This code is the first place in
> kmeans that the input path is queried. Judging by the fact that the index
> exception was for index 0 I don't think it found any of your sequence files.
>
> We've had similar reports before of this nature. I'll see if I can create a
> unit test to duplicate your problem and then fix it.
>
>
> On 9/14/10 4:43 PM, Jeff Eastman wrote:
>
>>  Hi Radek,
>>
>> Looking over your mapper code it looks mostly ok but I am curious why you
>> are writing the vector size in the context write's Text argument? Aren't
>> they all the same size? The Mahout document processing jobs generally put
>> the document ID in the key slot (and also sometimes in a NV in the value
>> slot). If you look at line 107 in the file; however, you will see that the
>> exception is likely the result of "chosenTexts.get(i)". Looking upward at
>> the reader loop above, it is scanning through all the input vectors so the
>> only way I can see a bounds exception is if "k" is greater than the number
>> of input vectors. What value did you specify in mahout kmeans?
>>
>> Have you tried running this in a debugger? Maybe a simple test to check
>> that k > chosenTexts.size (and also chosenClusters)? Probably by now you've
>> found the cause on your own...
>>
>>
>> On 9/14/10 3:45 PM, Radek Maciaszek wrote:
>>
>>> Hi Jeff,
>>>
>>> Thanks again for your help, I am starting to see the light on the end of
>>> my
>>> MSc eventually. I think I broke something in Mahout again ;) Due to the
>>> number of dimensions (around 14,000) I need to use sparse vectors (which
>>> are
>>> wrapped inside namedvectors). I used the logic from syntheticdata
>>> InputMapper and the sequence files appear to be generated correctly -
>>> well
>>> at least I cannot see any errors in that process. However as soon as I am
>>> trying to pass that data to k-means clustering the RandomSeedGenerator
>>> class
>>> gives me following error:
>>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
>>> Size: 0
>>>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>         at java.util.ArrayList.get(ArrayList.java:322)
>>>         at
>>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
>>>
>>>
>>> It appears that the following line generates the
>>> IndexOutOfBoundsException:
>>> writer.append(chosenTexts.get(i), chosenClusters.get(i));
>>>
>>> I believe this is probably because the sequence files are not created
>>> correctly or perhaps the code in RandomSeedGenerator is not compatible
>>> with
>>> the sequence which I used... Here is a quick essence of the code I am
>>> using
>>> to create sequence file, that is the map() method from InputMapper:
>>>
>>>   protected void map(LongWritable key, Text values, Context context)
>>> throws
>>> IOException, InterruptedException {
>>>
>>>     String[] numbers = InputMapper.SPACE.split(values.toString());
>>>     SequentialAccessSparseVector sparseVector = null;
>>>     String keyName = "";
>>>     int vectorSize = -1;
>>>     for (String value : numbers) {
>>>       if (keyName.equals("")) {
>>>           keyName = value;
>>>           continue;
>>>       } else if (vectorSize == -1) {
>>>           vectorSize = Integer.parseInt(value);
>>>           sparseVector = new SequentialAccessSparseVector(vectorSize);
>>>           continue;
>>>       } else if (value.length()>  0) {
>>>           String[] valuePair = InputMapper.COLON.split(value);
>>>           if (!valuePair[1].equals("NULL")) {
>>>             sparseVector.setQuick(Integer.parseInt(valuePair[0]),
>>> Double.valueOf(valuePair[1]));
>>>           }
>>>       }
>>>     }
>>>     try {
>>>       Vector result = new NamedVector(sparseVector, keyName);
>>>       VectorWritable vectorWritable = new VectorWritable(result);
>>>       context.write(new Text(String.valueOf(vectorSize)),
>>> vectorWritable);
>>>
>>>     } catch (Exception e) {
>>>       throw new IllegalStateException(e);
>>>     }
>>>   }
>>>
>>> My input data which mappers analyzes is in the format:
>>> ID NumberOfDimensionsInVector Index1:Value1 Index5:Value5
>>> IndexY:ValueY...
>>>
>>> I am still trying to get my head around the implementation details of
>>> Mahout
>>> and I find it a bit difficult to debug some things. Thank you in advance
>>> for
>>> any tips.
>>>
>>> Best,
>>> Radek
>>>
>>> On 9 September 2010 16:26, Jeff Eastman<jd...@windwardsolutions.com>
>>>  wrote:
>>>
>>>   It's alive, how marvelous! On the number of clusters, I am uncertain.
>>>> You
>>>> may indeed have uncovered a<choke>  defect. Let's work on characterizing
>>>> that a bit more. I get that you ran the mahout kmeans command with -k
>>>> 600
>>>> and only found 175 clusterIds referenced in the clusteredPoints
>>>> directory.
>>>> How many clusters were in your -c directory? That would be the initial
>>>> clusters produced by the RandomSeedGenerator. Try running cluster dumper
>>>> on
>>>> that directory. If there are still only 175 clusters then the generator
>>>> has
>>>> a problem.
>>>>
>>>> Canopy is a little hard to parametrize. If you are only getting a single
>>>> cluster out then the T2 distance you are using is too large. Try a
>>>> smaller
>>>> value and the number of clusters should increase dramatically at some
>>>> point
>>>> (in the limit to the number of vectors if T2=0). I use a binary search
>>>> to
>>>> converge on this value. T1 is less fussy and needs only to be larger
>>>> than
>>>> T2. It influences the number of nearby points that are not within T2
>>>> that
>>>> also need to contribute to the cluster center. For subsequent k-Means
>>>> processing, this is not so important.
>>>>
>>>> Finally, you should not have had to modify the cluster dumper to handle
>>>> NamedVectors, as the AbstractCluster.formatVector(dictionary) method it
>>>> calls should handle it. I would expect to see the name produced by
>>>> AbstractCluster.formatVector(v,bindings) in the<clusterId,<weight,
>>>> vector>>  tuples it outputs after the cluster description. Can you
>>>> verify
>>>> that this is not the case? If so can you help to further characterize
>>>> that?
>>>>
>>>> I understand you are in the middle of your MSc. Good luck with that!
>>>> Jeff
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Hmn, this may also be caused by simply using the wrong input path to 
kmeans. It should be the directory containing your input vector sequence 
files, not the name of the file itself. This code is the first place in 
kmeans that the input path is queried. Judging by the fact that the 
index exception was for index 0 I don't think it found any of your 
sequence files.

We've had similar reports before of this nature. I'll see if I can 
create a unit test to duplicate your problem and then fix it.

On 9/14/10 4:43 PM, Jeff Eastman wrote:
>  Hi Radek,
>
> Looking over your mapper code it looks mostly ok but I am curious why 
> you are writing the vector size in the context write's Text argument? 
> Aren't they all the same size? The Mahout document processing jobs 
> generally put the document ID in the key slot (and also sometimes in a 
> NV in the value slot). If you look at line 107 in the file; however, 
> you will see that the exception is likely the result of 
> "chosenTexts.get(i)". Looking upward at the reader loop above, it is 
> scanning through all the input vectors so the only way I can see a 
> bounds exception is if "k" is greater than the number of input 
> vectors. What value did you specify in mahout kmeans?
>
> Have you tried running this in a debugger? Maybe a simple test to 
> check that k > chosenTexts.size (and also chosenClusters)? Probably by 
> now you've found the cause on your own...
>
>
> On 9/14/10 3:45 PM, Radek Maciaszek wrote:
>> Hi Jeff,
>>
>> Thanks again for your help, I am starting to see the light on the end 
>> of my
>> MSc eventually. I think I broke something in Mahout again ;) Due to the
>> number of dimensions (around 14,000) I need to use sparse vectors 
>> (which are
>> wrapped inside namedvectors). I used the logic from syntheticdata
>> InputMapper and the sequence files appear to be generated correctly - 
>> well
>> at least I cannot see any errors in that process. However as soon as 
>> I am
>> trying to pass that data to k-means clustering the 
>> RandomSeedGenerator class
>> gives me following error:
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: 
>> Index: 0,
>> Size: 0
>>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>          at java.util.ArrayList.get(ArrayList.java:322)
>>          at
>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107) 
>>
>>
>> It appears that the following line generates the 
>> IndexOutOfBoundsException:
>> writer.append(chosenTexts.get(i), chosenClusters.get(i));
>>
>> I believe this is probably because the sequence files are not created
>> correctly or perhaps the code in RandomSeedGenerator is not 
>> compatible with
>> the sequence which I used... Here is a quick essence of the code I am 
>> using
>> to create sequence file, that is the map() method from InputMapper:
>>
>>    protected void map(LongWritable key, Text values, Context context) 
>> throws
>> IOException, InterruptedException {
>>
>>      String[] numbers = InputMapper.SPACE.split(values.toString());
>>      SequentialAccessSparseVector sparseVector = null;
>>      String keyName = "";
>>      int vectorSize = -1;
>>      for (String value : numbers) {
>>        if (keyName.equals("")) {
>>            keyName = value;
>>            continue;
>>        } else if (vectorSize == -1) {
>>            vectorSize = Integer.parseInt(value);
>>            sparseVector = new SequentialAccessSparseVector(vectorSize);
>>            continue;
>>        } else if (value.length()>  0) {
>>            String[] valuePair = InputMapper.COLON.split(value);
>>            if (!valuePair[1].equals("NULL")) {
>>              sparseVector.setQuick(Integer.parseInt(valuePair[0]),
>> Double.valueOf(valuePair[1]));
>>            }
>>        }
>>      }
>>      try {
>>        Vector result = new NamedVector(sparseVector, keyName);
>>        VectorWritable vectorWritable = new VectorWritable(result);
>>        context.write(new Text(String.valueOf(vectorSize)), 
>> vectorWritable);
>>
>>      } catch (Exception e) {
>>        throw new IllegalStateException(e);
>>      }
>>    }
>>
>> My input data which mappers analyzes is in the format:
>> ID NumberOfDimensionsInVector Index1:Value1 Index5:Value5 
>> IndexY:ValueY...
>>
>> I am still trying to get my head around the implementation details of 
>> Mahout
>> and I find it a bit difficult to debug some things. Thank you in 
>> advance for
>> any tips.
>>
>> Best,
>> Radek
>>
>> On 9 September 2010 16:26, Jeff Eastman<jd...@windwardsolutions.com>  
>> wrote:
>>
>>>   It's alive, how marvelous! On the number of clusters, I am 
>>> uncertain. You
>>> may indeed have uncovered a<choke>  defect. Let's work on 
>>> characterizing
>>> that a bit more. I get that you ran the mahout kmeans command with 
>>> -k 600
>>> and only found 175 clusterIds referenced in the clusteredPoints 
>>> directory.
>>> How many clusters were in your -c directory? That would be the initial
>>> clusters produced by the RandomSeedGenerator. Try running cluster 
>>> dumper on
>>> that directory. If there are still only 175 clusters then the 
>>> generator has
>>> a problem.
>>>
>>> Canopy is a little hard to parametrize. If you are only getting a 
>>> single
>>> cluster out then the T2 distance you are using is too large. Try a 
>>> smaller
>>> value and the number of clusters should increase dramatically at 
>>> some point
>>> (in the limit to the number of vectors if T2=0). I use a binary 
>>> search to
>>> converge on this value. T1 is less fussy and needs only to be larger 
>>> than
>>> T2. It influences the number of nearby points that are not within T2 
>>> that
>>> also need to contribute to the cluster center. For subsequent k-Means
>>> processing, this is not so important.
>>>
>>> Finally, you should not have had to modify the cluster dumper to handle
>>> NamedVectors, as the AbstractCluster.formatVector(dictionary) method it
>>> calls should handle it. I would expect to see the name produced by
>>> AbstractCluster.formatVector(v,bindings) in the<clusterId,<weight,
>>> vector>>  tuples it outputs after the cluster description. Can you 
>>> verify
>>> that this is not the case? If so can you help to further 
>>> characterize that?
>>>
>>> I understand you are in the middle of your MSc. Good luck with that!
>>> Jeff
>>>
>>>
>>>
>


Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Hi Radek,

Looking over your mapper code it looks mostly ok but I am curious why 
you are writing the vector size in the context write's Text argument? 
Aren't they all the same size? The Mahout document processing jobs 
generally put the document ID in the key slot (and also sometimes in a 
NV in the value slot). If you look at line 107 in the file; however, you 
will see that the exception is likely the result of 
"chosenTexts.get(i)". Looking upward at the reader loop above, it is 
scanning through all the input vectors so the only way I can see a 
bounds exception is if "k" is greater than the number of input vectors. 
What value did you specify in mahout kmeans?

Have you tried running this in a debugger? Maybe a simple test to check 
that k > chosenTexts.size (and also chosenClusters)? Probably by now 
you've found the cause on your own...


On 9/14/10 3:45 PM, Radek Maciaszek wrote:
> Hi Jeff,
>
> Thanks again for your help, I am starting to see the light on the end of my
> MSc eventually. I think I broke something in Mahout again ;) Due to the
> number of dimensions (around 14,000) I need to use sparse vectors (which are
> wrapped inside namedvectors). I used the logic from syntheticdata
> InputMapper and the sequence files appear to be generated correctly - well
> at least I cannot see any errors in that process. However as soon as I am
> trying to pass that data to k-means clustering the RandomSeedGenerator class
> gives me following error:
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
> Size: 0
>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>          at java.util.ArrayList.get(ArrayList.java:322)
>          at
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
>
> It appears that the following line generates the IndexOutOfBoundsException:
> writer.append(chosenTexts.get(i), chosenClusters.get(i));
>
> I believe this is probably because the sequence files are not created
> correctly or perhaps the code in RandomSeedGenerator is not compatible with
> the sequence which I used... Here is a quick essence of the code I am using
> to create sequence file, that is the map() method from InputMapper:
>
>    protected void map(LongWritable key, Text values, Context context) throws
> IOException, InterruptedException {
>
>      String[] numbers = InputMapper.SPACE.split(values.toString());
>      SequentialAccessSparseVector sparseVector = null;
>      String keyName = "";
>      int vectorSize = -1;
>      for (String value : numbers) {
>        if (keyName.equals("")) {
>            keyName = value;
>            continue;
>        } else if (vectorSize == -1) {
>            vectorSize = Integer.parseInt(value);
>            sparseVector = new SequentialAccessSparseVector(vectorSize);
>            continue;
>        } else if (value.length()>  0) {
>            String[] valuePair = InputMapper.COLON.split(value);
>            if (!valuePair[1].equals("NULL")) {
>              sparseVector.setQuick(Integer.parseInt(valuePair[0]),
> Double.valueOf(valuePair[1]));
>            }
>        }
>      }
>      try {
>        Vector result = new NamedVector(sparseVector, keyName);
>        VectorWritable vectorWritable = new VectorWritable(result);
>        context.write(new Text(String.valueOf(vectorSize)), vectorWritable);
>
>      } catch (Exception e) {
>        throw new IllegalStateException(e);
>      }
>    }
>
> My input data which mappers analyzes is in the format:
> ID NumberOfDimensionsInVector Index1:Value1 Index5:Value5 IndexY:ValueY...
>
> I am still trying to get my head around the implementation details of Mahout
> and I find it a bit difficult to debug some things. Thank you in advance for
> any tips.
>
> Best,
> Radek
>
> On 9 September 2010 16:26, Jeff Eastman<jd...@windwardsolutions.com>  wrote:
>
>>   It's alive, how marvelous! On the number of clusters, I am uncertain. You
>> may indeed have uncovered a<choke>  defect. Let's work on characterizing
>> that a bit more. I get that you ran the mahout kmeans command with -k 600
>> and only found 175 clusterIds referenced in the clusteredPoints directory.
>> How many clusters were in your -c directory? That would be the initial
>> clusters produced by the RandomSeedGenerator. Try running cluster dumper on
>> that directory. If there are still only 175 clusters then the generator has
>> a problem.
>>
>> Canopy is a little hard to parametrize. If you are only getting a single
>> cluster out then the T2 distance you are using is too large. Try a smaller
>> value and the number of clusters should increase dramatically at some point
>> (in the limit to the number of vectors if T2=0). I use a binary search to
>> converge on this value. T1 is less fussy and needs only to be larger than
>> T2. It influences the number of nearby points that are not within T2 that
>> also need to contribute to the cluster center. For subsequent k-Means
>> processing, this is not so important.
>>
>> Finally, you should not have had to modify the cluster dumper to handle
>> NamedVectors, as the AbstractCluster.formatVector(dictionary) method it
>> calls should handle it. I would expect to see the name produced by
>> AbstractCluster.formatVector(v,bindings) in the<clusterId,<weight,
>> vector>>  tuples it outputs after the cluster description. Can you verify
>> that this is not the case? If so can you help to further characterize that?
>>
>> I understand you are in the middle of your MSc. Good luck with that!
>> Jeff
>>
>>
>>


Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hi Jeff,

Thanks again for your help, I am starting to see the light on the end of my
MSc eventually. I think I broke something in Mahout again ;) Due to the
number of dimensions (around 14,000) I need to use sparse vectors (which are
wrapped inside namedvectors). I used the logic from syntheticdata
InputMapper and the sequence files appear to be generated correctly - well
at least I cannot see any errors in that process. However as soon as I am
trying to pass that data to k-means clustering the RandomSeedGenerator class
gives me following error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)

It appears that the following line generates the IndexOutOfBoundsException:
writer.append(chosenTexts.get(i), chosenClusters.get(i));

I believe this is probably because the sequence files are not created
correctly or perhaps the code in RandomSeedGenerator is not compatible with
the sequence which I used... Here is a quick essence of the code I am using
to create sequence file, that is the map() method from InputMapper:

  protected void map(LongWritable key, Text values, Context context) throws
IOException, InterruptedException {

    String[] numbers = InputMapper.SPACE.split(values.toString());
    SequentialAccessSparseVector sparseVector = null;
    String keyName = "";
    int vectorSize = -1;
    for (String value : numbers) {
      if (keyName.equals("")) {
          keyName = value;
          continue;
      } else if (vectorSize == -1) {
          vectorSize = Integer.parseInt(value);
          sparseVector = new SequentialAccessSparseVector(vectorSize);
          continue;
      } else if (value.length() > 0) {
          String[] valuePair = InputMapper.COLON.split(value);
          if (!valuePair[1].equals("NULL")) {
            sparseVector.setQuick(Integer.parseInt(valuePair[0]),
Double.valueOf(valuePair[1]));
          }
      }
    }
    try {
      Vector result = new NamedVector(sparseVector, keyName);
      VectorWritable vectorWritable = new VectorWritable(result);
      context.write(new Text(String.valueOf(vectorSize)), vectorWritable);

    } catch (Exception e) {
      throw new IllegalStateException(e);
    }
  }

My input data which mappers analyzes is in the format:
ID NumberOfDimensionsInVector Index1:Value1 Index5:Value5 IndexY:ValueY...

I am still trying to get my head around the implementation details of Mahout
and I find it a bit difficult to debug some things. Thank you in advance for
any tips.

Best,
Radek

On 9 September 2010 16:26, Jeff Eastman <jd...@windwardsolutions.com> wrote:

>  It's alive, how marvelous! On the number of clusters, I am uncertain. You
> may indeed have uncovered a <choke> defect. Let's work on characterizing
> that a bit more. I get that you ran the mahout kmeans command with -k 600
> and only found 175 clusterIds referenced in the clusteredPoints directory.
> How many clusters were in your -c directory? That would be the initial
> clusters produced by the RandomSeedGenerator. Try running cluster dumper on
> that directory. If there are still only 175 clusters then the generator has
> a problem.
>
> Canopy is a little hard to parametrize. If you are only getting a single
> cluster out then the T2 distance you are using is too large. Try a smaller
> value and the number of clusters should increase dramatically at some point
> (in the limit to the number of vectors if T2=0). I use a binary search to
> converge on this value. T1 is less fussy and needs only to be larger than
> T2. It influences the number of nearby points that are not within T2 that
> also need to contribute to the cluster center. For subsequent k-Means
> processing, this is not so important.
>
> Finally, you should not have had to modify the cluster dumper to handle
> NamedVectors, as the AbstractCluster.formatVector(dictionary) method it
> calls should handle it. I would expect to see the name produced by
> AbstractCluster.formatVector(v,bindings) in the <clusterId, <weight,
> vector>> tuples it outputs after the cluster description. Can you verify
> that this is not the case? If so can you help to further characterize that?
>
> I understand you are in the middle of your MSc. Good luck with that!
> Jeff
>
>
>

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  It's alive, how marvelous! On the number of clusters, I am uncertain. 
You may indeed have uncovered a <choke> defect. Let's work on 
characterizing that a bit more. I get that you ran the mahout kmeans 
command with -k 600 and only found 175 clusterIds referenced in the 
clusteredPoints directory. How many clusters were in your -c directory? 
That would be the initial clusters produced by the RandomSeedGenerator. 
Try running cluster dumper on that directory. If there are still only 
175 clusters then the generator has a problem.

Canopy is a little hard to parametrize. If you are only getting a single 
cluster out then the T2 distance you are using is too large. Try a 
smaller value and the number of clusters should increase dramatically at 
some point (in the limit to the number of vectors if T2=0). I use a 
binary search to converge on this value. T1 is less fussy and needs only 
to be larger than T2. It influences the number of nearby points that are 
not within T2 that also need to contribute to the cluster center. For 
subsequent k-Means processing, this is not so important.

Finally, you should not have had to modify the cluster dumper to handle 
NamedVectors, as the AbstractCluster.formatVector(dictionary) method it 
calls should handle it. I would expect to see the name produced by 
AbstractCluster.formatVector(v,bindings) in the <clusterId, <weight, 
vector>> tuples it outputs after the cluster description. Can you verify 
that this is not the case? If so can you help to further characterize that?

I understand you are in the middle of your MSc. Good luck with that!
Jeff


On 9/9/10 4:49 AM, Radek Maciaszek wrote:
> Hi Jeff,
>
> Phew! I managed to wrap vectors with NamedVector. I needed as well to
> slightly modify the ClusterDumper to make it aware of the NamedVector and in
> order to get both the userId and clusterId in the output. The most important
> thing is that it seems to work! I will stress test it with more data and
> will let you know the results.
>
> One thing which I noticed is that instead of expected 600 clusters I can see
> only 175 in the clusteredPoints. So far I tested it with about 81k vectors.
> Is it possible or it should not happen and is caused by some error?
>
> I was planning to use Canopy for preprocessing, however I am not sure how to
> select the parameters for canopy in order to get for example 600
> clusters. It is rather difficult for me to estimate the distance between
> points with thousands of dimensions. Are you familiar with some rules of
> thumb which can help here? I tried various parameters but I've always got
> just one cluster no matter what I tried.
>
> Jeff, many thanks for all your help! Rui, as promised I will write up a
> quick tutorial in few weeks time - my MSc has a priority at the moment.
>
> Best,
> Radek
>
> On 8 September 2010 17:53, Jeff Eastman<jd...@windwardsolutions.com>  wrote:
>
>>   Hi Radek,
>>
>> The clustering code is pretty stable but we have been having some unit test
>> failures in unrelated code that may frustrate you. I suggest you can do a
>> trunk checkout and then run "mvn clean install -DskipTests=true" to get a
>> build without running all the tests. After that, I suggest running
>> "examples/bin/build-reuters.sh" which will get you a dataset that you can
>> explore using the mahout command line API. If you are already past that and
>> still are having problems let me know and I will try to help.
>>
>> Jeff
>>
>>
>>
>> On 9/8/10 1:52 AM, rmx wrote:
>>
>>> Hi Radek,
>>> If you could post a tutorial, it would be fantastic.
>>> I am a Machine Learning researcher without enough java programming skills
>>> to
>>> dig the code.
>>> I found Mahout potential really impressive and if I could manage to work
>>> it
>>> I would be up to convince the rest of research group to use it.
>>>
>>> Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
>>> weeks
>>> ago I tried to install Truck but I got some errors on the installation
>>> tests. I will try to do it again, since probably there is a new version.
>>>
>>> Thanks
>>> Rui
>>>
>>


Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@gmail.com>.
Hi Jeff,

Phew! I managed to wrap vectors with NamedVector. I needed as well to
slightly modify the ClusterDumper to make it aware of the NamedVector and in
order to get both the userId and clusterId in the output. The most important
thing is that it seems to work! I will stress test it with more data and
will let you know the results.

One thing which I noticed is that instead of expected 600 clusters I can see
only 175 in the clusteredPoints. So far I tested it with about 81k vectors.
Is it possible or it should not happen and is caused by some error?

I was planning to use Canopy for preprocessing, however I am not sure how to
select the parameters for canopy in order to get for example 600
clusters. It is rather difficult for me to estimate the distance between
points with thousands of dimensions. Are you familiar with some rules of
thumb which can help here? I tried various parameters but I've always got
just one cluster no matter what I tried.

Jeff, many thanks for all your help! Rui, as promised I will write up a
quick tutorial in few weeks time - my MSc has a priority at the moment.

Best,
Radek

On 8 September 2010 17:53, Jeff Eastman <jd...@windwardsolutions.com> wrote:

>  Hi Radek,
>
> The clustering code is pretty stable but we have been having some unit test
> failures in unrelated code that may frustrate you. I suggest you can do a
> trunk checkout and then run "mvn clean install -DskipTests=true" to get a
> build without running all the tests. After that, I suggest running
> "examples/bin/build-reuters.sh" which will get you a dataset that you can
> explore using the mahout command line API. If you are already past that and
> still are having problems let me know and I will try to help.
>
> Jeff
>
>
>
> On 9/8/10 1:52 AM, rmx wrote:
>
>> Hi Radek,
>> If you could post a tutorial, it would be fantastic.
>> I am a Machine Learning researcher without enough java programming skills
>> to
>> dig the code.
>> I found Mahout potential really impressive and if I could manage to work
>> it
>> I would be up to convince the rest of research group to use it.
>>
>> Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
>> weeks
>> ago I tried to install Truck but I got some errors on the installation
>> tests. I will try to do it again, since probably there is a new version.
>>
>> Thanks
>> Rui
>>
>
>

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Hi Radek,

The clustering code is pretty stable but we have been having some unit 
test failures in unrelated code that may frustrate you. I suggest you 
can do a trunk checkout and then run "mvn clean install 
-DskipTests=true" to get a build without running all the tests. After 
that, I suggest running "examples/bin/build-reuters.sh" which will get 
you a dataset that you can explore using the mahout command line API. If 
you are already past that and still are having problems let me know and 
I will try to help.

Jeff


On 9/8/10 1:52 AM, rmx wrote:
> Hi Radek,
> If you could post a tutorial, it would be fantastic.
> I am a Machine Learning researcher without enough java programming skills to
> dig the code.
> I found Mahout potential really impressive and if I could manage to work it
> I would be up to convince the rest of research group to use it.
>
> Hi Jeff, yes the problems I got was from the non truck version. 2 or 3 weeks
> ago I tried to install Truck but I got some errors on the installation
> tests. I will try to do it again, since probably there is a new version.
>
> Thanks
> Rui


Re: Tranforming data for k-means analysis

Posted by rmx <ru...@hotmail.com>.
Hi Radek,
If you could post a tutorial, it would be fantastic.
I am a Machine Learning researcher without enough java programming skills to
dig the code. 
I found Mahout potential really impressive and if I could manage to work it
I would be up to convince the rest of research group to use it.

Hi Jeff, yes the problems I got was from the non truck version. 2 or 3 weeks
ago I tried to install Truck but I got some errors on the installation
tests. I will try to do it again, since probably there is a new version.

Thanks
Rui
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tranforming-data-for-k-means-analysis-tp1426037p1438098.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  On 9/7/10 11:43 AM, Radek Maciaszek wrote:
> Hi Jeff, Rui,
>
> Jeff, thanks for a prompt reply. I tried your suggestion and I eventually
> succeeded! I tried two things. First I run all analysis in "trunk" on
> syntheticdata example but with custom parameters. However, whenether I tried
> to pass some custom parameters to kmeans job I was getting errors (I will
> include them below). Then I tried mahout 0.3 but there were some issues with
> this as well.
With Mahout in such a state of flux you are almost always better on 
trunk. I'd need to see your actual command line invocation to help with 
the number format exception. It appears to have found the argument but 
the value was null?
> It appears that (in trunk) the parseInt code produces following error:
> Exception in thread "main" java.lang.NumberFormatException: null
>          at java.lang.Integer.parseInt(Integer.java:417)
>          at java.lang.Integer.parseInt(Integer.java:499)
>          at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:217)
>          at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:49)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>          at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>          at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> I slightly changed that file so line numbers may be wrong but the error
> seems to be caused by this line:
> clusters = RandomSeedGenerator.buildRandom(input, clusters,
> Integer.parseInt(argMap.get(DefaultOptionCreator.NUM_CLUSTERS_OPTION)),
> measure);
>
> for now I just hardcoded here the actual number of clusters I need and this
> seems to work good enough.
>
> My current issue is that I am trying to get the list of points from my
> clusters. That is I can see the clusters output from mahout clusterdump but
> I don't know how to read that data in order to see which points belong to
> which cluster. Here is sample output from clusterdump:
> VL-1484{n=192 c=[99:-3.837] r=[99:1.138]}
> VL-80153{n=36 c=[3804:5.833] r=[3804:2.263]}
> VL-10247{n=1 c=[1725:8.296] r=[0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
> 0.000, 0.000, 0.000, 0.000, 0.000.....
> (those zero vectors tend to be really long)
>
> So what I am looking for is something like:
> clusterId | row number in the input data
The clustering jobs all produce sequence files of clusteredPoints that 
have key=clusterId and value=WeightedVectorWritable. If you modify your 
input data processing to wrap NamedVectors (name = userId) around your 
input data points then this will propagate through to the output 
clustering and the WeightedVectorWritables will contain your 
NamedVectors with your userIds.
> Each row is defined for specific user so it would be even better if I can
> somehow map them as:
> clusterId | user Id. I guess I would have to use for this dictionary? If you
> can put me in any direction with some sample code which shows how to read
> the clustered output that would be great.
>
> P.S. Rui, I may put up later some quick tutorial on how I managed to run my
> analysis if you think it would be useful.
>
> Once again thank you for your help and for any further suggestions,
> Radek
>
You bet. Glad you are making progress, even if it is slow.
Jeff

Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hi Jeff, Rui,

Jeff, thanks for a prompt reply. I tried your suggestion and I eventually
succeeded! I tried two things. First I run all analysis in "trunk" on
syntheticdata example but with custom parameters. However, whenether I tried
to pass some custom parameters to kmeans job I was getting errors (I will
include them below). Then I tried mahout 0.3 but there were some issues with
this as well.

It appears that (in trunk) the parseInt code produces following error:
Exception in thread "main" java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:417)
        at java.lang.Integer.parseInt(Integer.java:499)
        at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:217)
        at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:49)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I slightly changed that file so line numbers may be wrong but the error
seems to be caused by this line:
clusters = RandomSeedGenerator.buildRandom(input, clusters,
Integer.parseInt(argMap.get(DefaultOptionCreator.NUM_CLUSTERS_OPTION)),
measure);

for now I just hardcoded here the actual number of clusters I need and this
seems to work good enough.

My current issue is that I am trying to get the list of points from my
clusters. That is I can see the clusters output from mahout clusterdump but
I don't know how to read that data in order to see which points belong to
which cluster. Here is sample output from clusterdump:
VL-1484{n=192 c=[99:-3.837] r=[99:1.138]}
VL-80153{n=36 c=[3804:5.833] r=[3804:2.263]}
VL-10247{n=1 c=[1725:8.296] r=[0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
0.000, 0.000, 0.000, 0.000, 0.000.....
(those zero vectors tend to be really long)

So what I am looking for is something like:
clusterId | row number in the input data

Each row is defined for specific user so it would be even better if I can
somehow map them as:
clusterId | user Id. I guess I would have to use for this dictionary? If you
can put me in any direction with some sample code which shows how to read
the clustered output that would be great.

P.S. Rui, I may put up later some quick tutorial on how I managed to run my
analysis if you think it would be useful.

Once again thank you for your help and for any further suggestions,
Radek


On 7 September 2010 18:30, Jeff Eastman <jd...@windwardsolutions.com> wrote:

>  When you run kmeans from the command line with a -k value, the run()
> method calls the RandomSeedGenerator before calling the job() method to run
> the iterations. It's only when using the job() method directly from user
> code that you would perhaps want to use the RandomSeedGenerator (or Canopy)
> to populate the clusters in the -ci directory. So, yes, from the command
> line the driver already does it.
>
> I suggested looking at the InputDriver code as that is what converts the
> space-delimited synthetic control text file to Mahout sequence of
> VectorWritable file format. Once you have data in that format you should be
> good to go with any of the clustering implementations.
>
>
> On 9/7/10 10:17 AM, rmx wrote:
>
>> Hi Radek,
>>
>> If you do not want to use the script, you can run the kmeans drive
>> directly
>> from the command line.
>> I think first you need to convert your dataset to a mahout vector format.
>> Then you need to convert to sequence file format. Only after it you can
>> run
>> the driver over your sequence file.
>> I have been trying to do this but I never been successful. Tell me if you
>> will...
>>
>> Jeff: when using kmeans drive from the command line with a -k value, you
>> need to use RandomSeedGenerator.buildRandom()? I thought the driver
>> already
>> does it.
>>
>> Best,
>> Rui
>>
>
>

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  When you run kmeans from the command line with a -k value, the run() 
method calls the RandomSeedGenerator before calling the job() method to 
run the iterations. It's only when using the job() method directly from 
user code that you would perhaps want to use the RandomSeedGenerator (or 
Canopy) to populate the clusters in the -ci directory. So, yes, from the 
command line the driver already does it.

I suggested looking at the InputDriver code as that is what converts the 
space-delimited synthetic control text file to Mahout sequence of 
VectorWritable file format. Once you have data in that format you should 
be good to go with any of the clustering implementations.

On 9/7/10 10:17 AM, rmx wrote:
> Hi Radek,
>
> If you do not want to use the script, you can run the kmeans drive directly
> from the command line.
> I think first you need to convert your dataset to a mahout vector format.
> Then you need to convert to sequence file format. Only after it you can run
> the driver over your sequence file.
> I have been trying to do this but I never been successful. Tell me if you
> will...
>
> Jeff: when using kmeans drive from the command line with a -k value, you
> need to use RandomSeedGenerator.buildRandom()? I thought the driver already
> does it.
>
> Best,
> Rui


Re: Tranforming data for k-means analysis

Posted by rmx <ru...@hotmail.com>.
Hi Radek,

If you do not want to use the script, you can run the kmeans drive directly
from the command line. 
I think first you need to convert your dataset to a mahout vector format.
Then you need to convert to sequence file format. Only after it you can run
the driver over your sequence file.
I have been trying to do this but I never been successful. Tell me if you
will...

Jeff: when using kmeans drive from the command line with a -k value, you
need to use RandomSeedGenerator.buildRandom()? I thought the driver already
does it.

Best,
Rui
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tranforming-data-for-k-means-analysis-tp1426037p1434137.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Tranforming data for k-means analysis

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Hi Radek,

I think you are on the right track building off of the synthetic control 
example. It has an initial pre-processing step (canopy.InputDriver) that 
converts space-delimited text files into Mahout VectorWritable sequence 
files that are suitable for input to Canopy and k-Means. It could be as 
simple as changing your delimiter from tab to space or you might need to 
write your own pre-processor. The kmeans.Job file runs this job then 
fires off Canopy to produce the initial clusters. You will need to play 
with the T1 and T2 values in this step in order to get the number of 
clusters you want (~20). You can skip this step if you know a value of k 
that you want; simply add the -k argument to the mahout kmeans command 
and run it from the command line. That will random sample your dataset 
to determine the initial cluster centers. (Sorry the KMeansDriver public 
methods  expect the initial clusters to be in the -ci directory already 
and don't allow the sampling, but there is 
RandomSeedGenerator.buildRandom() which you can use to produce these 
from your input data).

Let me know how this works for you,
Jeff


On 9/6/10 5:45 AM, Radek Maciaszek wrote:
> Hi,
>
> I am trying to use Mahout for my MSc project. I successfully run all
> clustering examples and I now am trying to analyse some of my data,
> unfortunately without much of success.
>
> Input data which I want to cluster is a list of vectors in a tab separated
> format:
> 1.2   0.0   0.0  3.414
> 0.0   0.4   0.0   0.3
> 16.2  0.0   0.0   0.0
> etc.
> I generated this file in python and can easily change it to be in comma
> separated format or make any necessary changes. It is rather a large file
> with many thousands of dimensions and millions of rows and it contains
> TF/IDF numbers calculated for users and URLs they visited (each row is a
> user and each column a URL). Each rows is a sparse vector.
>
> I would like to cluster users using kmeans into 20+ clusters. Now I am
> having problems with running clustering on this data. On the beginning I
> tried simply to put this file instead of a "testdata" filename on hadoop
> (originally synthetic_control.data) and was running "mahout
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job". I was hoping to
> reuse the existing scripts but that unfortunately gives me some null pointer
> exceptions.
>
> What would be the fastest/best way of analysing this matrix in order to
> group the rows into clusters?
>
> Many thanks for your advice,
> Radek
>