You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Radek Maciaszek <ra...@gmail.com> on 2010/09/06 14:45:52 UTC

Tranforming data for k-means analysis

Hi,

I am trying to use Mahout for my MSc project. I successfully run all
clustering examples and I now am trying to analyse some of my data,
unfortunately without much of success.

Input data which I want to cluster is a list of vectors in a tab separated
format:
1.2   0.0   0.0  3.414
0.0   0.4   0.0   0.3
16.2  0.0   0.0   0.0
etc.
I generated this file in python and can easily change it to be in comma
separated format or make any necessary changes. It is rather a large file
with many thousands of dimensions and millions of rows and it contains
TF/IDF numbers calculated for users and URLs they visited (each row is a
user and each column a URL). Each rows is a sparse vector.

I would like to cluster users using kmeans into 20+ clusters. Now I am
having problems with running clustering on this data. On the beginning I
tried simply to put this file instead of a "testdata" filename on hadoop
(originally synthetic_control.data) and was running "mahout
org.apache.mahout.clustering.syntheticcontrol.canopy.Job". I was hoping to
reuse the existing scripts but that unfortunately gives me some null pointer
exceptions.

What would be the fastest/best way of analysing this matrix in order to
group the rows into clusters?

Many thanks for your advice,
Radek

Re: Tranforming data for k-means analysis

Posted by Ted Dunning <te...@gmail.com>.

Glad we could help.

On Tue, Jul 5, 2011 at 7:09 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:

> Hello,
>
> I worked in the past on MSc project which involved quite a lot of Mahout
> calculation. I finished it a while ago but only recently got my head around
> posting it somewhere online.
>
> It would be much more difficult to finish this work without the help from
> this list so I wanted to say thank you! I thought that perhaps someone will
> find my code and research interesting so here it is.
>
> The paper is on "How much behavioural targeting can help online
> advertising". There are quite many calculations involved which were written
> mostly in Python and Hadoop/Hive and the clustering was performed by
> Mahout.
> http://www.dataminelab.com/blog/behavioural-targeting-online/
>
> Many thanks!
> Radek
>
> On 8 September 2010 09:52, rmx <ru...@hotmail.com> wrote:
>
> >
> > Hi Radek,
> > If you could post a tutorial, it would be fantastic.
> > I am a Machine Learning researcher without enough java programming skills
> > to
> > dig the code.
> > I found Mahout potential really impressive and if I could manage to work
> it
> > I would be up to convince the rest of research group to use it.
> >
> > Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
> > weeks
> > ago I tried to install Truck but I got some errors on the installation
> > tests. I will try to do it again, since probably there is a new version.
> >
> > Thanks
> > Rui
>

Re: Tranforming data for k-means analysis

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.

Hello,

I worked in the past on MSc project which involved quite a lot of Mahout
calculation. I finished it a while ago but only recently got my head around
posting it somewhere online.

It would be much more difficult to finish this work without the help from
this list so I wanted to say thank you! I thought that perhaps someone will
find my code and research interesting so here it is.

The paper is on "How much behavioural targeting can help online
advertising". There are quite many calculations involved which were written
mostly in Python and Hadoop/Hive and the clustering was performed by Mahout.
http://www.dataminelab.com/blog/behavioural-targeting-online/

Many thanks!
Radek

On 8 September 2010 09:52, rmx <ru...@hotmail.com> wrote:

>
> Hi Radek,
> If you could post a tutorial, it would be fantastic.
> I am a Machine Learning researcher without enough java programming skills
> to
> dig the code.
> I found Mahout potential really impressive and if I could manage to work it
> I would be up to convince the rest of research group to use it.
>
> Hi Jeff, yes the problems I got was from the non truck version. 2 or 3
> weeks
> ago I tried to install Truck but I got some errors on the installation
> tests. I will try to do it again, since probably there is a new version.
>
> Thanks
> Rui