You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vishal Danech <vi...@gmail.com> on 2013/09/06 06:27:52 UTC

Mahout readable output

Hi

I have a custom log data which contains following details.

1) UserName
2) MachineId
3) DateTime
4) Data - which contains text - search term etc

I would like to use this data to know
     #) how much time they are spending on browsing etc.
     #) User based search pattern

First problem can be addressed using Hive query.

For second problem, I suppose clustering can be applied and for this I have
converted data to vectors. I have used dense vector and applied Canopy
algorithm on it. I got an output which I provided as an input to
ClusterDump utility but the out I got was not in readable form, I figured
out that I need to use named vectors so that Key can be displayed as a
output. Here I am facing issue, how to use NamedVector ?

I am performing following steps to generate vectors..
     #) Created custom VectorIterable by inheriting Iterable<Vector>.
     #) Created custom VectorItertor by inheriting AbstractIterator<Vector>
     #) Model class which will be responsible to pass attribute values
(username or data etc) to custom VectorIterator
     #) Custom VectorIterator.computeNext() will read line, create dense
vector having size equal to number of attribute in a row.

Please let me know how to add NamedVector here so that I can get some
readable output from ClusterDump utility.

-- 
Thanks and Regards
Vishal Danech

Re: Mahout readable output

Posted by Ted Dunning <te...@gmail.com>.

Darius comments are good.

You also have to think about what similar means to you.  From the data you
describe, I see several possibilities:

- geo-location from machine id (if it includes IP address)

- content from the query

- frequency of posting

- diurnal phase of posting (tells us time zone)

Once you know what similar means, you can meaningfully talk about next
steps.

If you assume that only query content matters, then I would go towards
several ways.

- cluster directly based on query histories using IDF weighting (likely to
be kinda sorta lousy results)

- use cooccurrence analysis to augment query histories and repeat the
clustering

- use SVD or ALS to generate user vectors and query term vectors and
cluster users using user vectors and then look for coherence.

If you want to use geo, the question of scaling comes in.

If you want to use time, you have to derive some sort of features.  I find
latent variable methods useful for this.



On Fri, Sep 6, 2013 at 1:25 AM, Darius Miliauskas <
dariui.miliauskui@gmail.com> wrote:

> Dear Vishal,
>
> can you give some code how you performed your mentioned steps:
>
>  #) Created custom VectorIterable by inheriting Iterable<Vector>.
>  #) Created custom VectorItertor by inheriting AbstractIterator<Vector>
>  #) Model class which will be responsible to pass attribute values
> (username or data etc) to custom VectorIterator
>  #) Custom VectorIterator.computeNext() will read line, create dense
> vector having size equal to number of attribute in a row.
>
> Can you compile the code?
>
>
> Best,
>
> Darius
>
>
>
> 2013/9/6 Vishal Danech <vi...@gmail.com>
>
> > Hi
> >
> > I have a custom log data which contains following details.
> >
> > 1) UserName
> > 2) MachineId
> > 3) DateTime
> > 4) Data - which contains text - search term etc
> >
> > I would like to use this data to know
> >      #) how much time they are spending on browsing etc.
> >      #) User based search pattern
> >
> > First problem can be addressed using Hive query.
> >
> > For second problem, I suppose clustering can be applied and for this I
> have
> > converted data to vectors. I have used dense vector and applied Canopy
> > algorithm on it. I got an output which I provided as an input to
> > ClusterDump utility but the out I got was not in readable form, I figured
> > out that I need to use named vectors so that Key can be displayed as a
> > output. Here I am facing issue, how to use NamedVector ?
> >
> > I am performing following steps to generate vectors..
> >      #) Created custom VectorIterable by inheriting Iterable<Vector>.
> >      #) Created custom VectorItertor by inheriting
> AbstractIterator<Vector>
> >      #) Model class which will be responsible to pass attribute values
> > (username or data etc) to custom VectorIterator
> >      #) Custom VectorIterator.computeNext() will read line, create dense
> > vector having size equal to number of attribute in a row.
> >
> > Please let me know how to add NamedVector here so that I can get some
> > readable output from ClusterDump utility.
> >
> > --
> > Thanks and Regards
> > Vishal Danech
> >
>

Re: Mahout readable output

Posted by Vishal Danech <vi...@gmail.com>.

Hi Darius

Thanks for your reply.

I have created my program based on sample tool provided by Mahout to create
vectors from wekas ARFF format. I am able to compile the code and also able
to generate vectors. I have also used those vectors file to apply Canopy
algorithm. The problem I am facing is how to interpret result of Canopy.

"Creating vectors from wekas ARFF format"
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-utils/0.2/org/apache/mahout/utils/vectors/arff/Driver.java


Attribute-Relation File Format
(ARFF)<http://www.cs.waikato.ac.nz/~ml/weka/arff.html>

Please let me know if you required more information.

Thanks

Vishal



On Fri, Sep 6, 2013 at 1:55 PM, Darius Miliauskas <
dariui.miliauskui@gmail.com> wrote:

> Dear Vishal,
>
> can you give some code how you performed your mentioned steps:
>
>  #) Created custom VectorIterable by inheriting Iterable<Vector>.
>  #) Created custom VectorItertor by inheriting AbstractIterator<Vector>
>  #) Model class which will be responsible to pass attribute values
> (username or data etc) to custom VectorIterator
>  #) Custom VectorIterator.computeNext() will read line, create dense
> vector having size equal to number of attribute in a row.
>
> Can you compile the code?
>
>
> Best,
>
> Darius
>
>
>
> 2013/9/6 Vishal Danech <vi...@gmail.com>
>
> > Hi
> >
> > I have a custom log data which contains following details.
> >
> > 1) UserName
> > 2) MachineId
> > 3) DateTime
> > 4) Data - which contains text - search term etc
> >
> > I would like to use this data to know
> >      #) how much time they are spending on browsing etc.
> >      #) User based search pattern
> >
> > First problem can be addressed using Hive query.
> >
> > For second problem, I suppose clustering can be applied and for this I
> have
> > converted data to vectors. I have used dense vector and applied Canopy
> > algorithm on it. I got an output which I provided as an input to
> > ClusterDump utility but the out I got was not in readable form, I figured
> > out that I need to use named vectors so that Key can be displayed as a
> > output. Here I am facing issue, how to use NamedVector ?
> >
> > I am performing following steps to generate vectors..
> >      #) Created custom VectorIterable by inheriting Iterable<Vector>.
> >      #) Created custom VectorItertor by inheriting
> AbstractIterator<Vector>
> >      #) Model class which will be responsible to pass attribute values
> > (username or data etc) to custom VectorIterator
> >      #) Custom VectorIterator.computeNext() will read line, create dense
> > vector having size equal to number of attribute in a row.
> >
> > Please let me know how to add NamedVector here so that I can get some
> > readable output from ClusterDump utility.
> >
> > --
> > Thanks and Regards
> > Vishal Danech
> >
>



-- 
Thanks and Regards
Vishal Danech

Re: Mahout readable output

Posted by Darius Miliauskas <da...@gmail.com>.

Dear Vishal,

can you give some code how you performed your mentioned steps:

 #) Created custom VectorIterable by inheriting Iterable<Vector>.
 #) Created custom VectorItertor by inheriting AbstractIterator<Vector>
 #) Model class which will be responsible to pass attribute values
(username or data etc) to custom VectorIterator
 #) Custom VectorIterator.computeNext() will read line, create dense
vector having size equal to number of attribute in a row.

Can you compile the code?


Best,

Darius



2013/9/6 Vishal Danech <vi...@gmail.com>

> Hi
>
> I have a custom log data which contains following details.
>
> 1) UserName
> 2) MachineId
> 3) DateTime
> 4) Data - which contains text - search term etc
>
> I would like to use this data to know
>      #) how much time they are spending on browsing etc.
>      #) User based search pattern
>
> First problem can be addressed using Hive query.
>
> For second problem, I suppose clustering can be applied and for this I have
> converted data to vectors. I have used dense vector and applied Canopy
> algorithm on it. I got an output which I provided as an input to
> ClusterDump utility but the out I got was not in readable form, I figured
> out that I need to use named vectors so that Key can be displayed as a
> output. Here I am facing issue, how to use NamedVector ?
>
> I am performing following steps to generate vectors..
>      #) Created custom VectorIterable by inheriting Iterable<Vector>.
>      #) Created custom VectorItertor by inheriting AbstractIterator<Vector>
>      #) Model class which will be responsible to pass attribute values
> (username or data etc) to custom VectorIterator
>      #) Custom VectorIterator.computeNext() will read line, create dense
> vector having size equal to number of attribute in a row.
>
> Please let me know how to add NamedVector here so that I can get some
> readable output from ClusterDump utility.
>
> --
> Thanks and Regards
> Vishal Danech
>