You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Haddad Said <ha...@gmail.com> on 2013/01/10 07:10:24 UTC

Representing key value dataset into Mahout vector

Hi,

I have a data set in CSV which is a set of key value pairs, the data set is
huge and the values are a mixture of integers and short strings (i.e. not
lengthy texts, but rather key words) and I want to process it using
Mahout's clustering algorithms.

The issue is in converting this CSV into vectors that can be consumed by
Mahout. I have been reading "Mahout In Action" and there seems to be two
options for vectorizing, using numeric values with Mahout's DenseVector,
RandomAccessSparseVector, and SequentialAccessSparseVector implementation
or use Vector Space Model to vectorize text documents.

The data I want to vectorize it not really a text document, but as it is a
huge data set with many different keys and values it is difficult to map it
to numeric values. What is the best way to vectorize this kind of data for
use in Mahout?

Any pointers would be appreciated.

Thanks

Re: Representing key value dataset into Mahout vector

Posted by Haddad Said <ha...@gmail.com>.

Hi Ted

Thanks for the response. I had a quick look at chapter 14 and that part of
the book is about classification, i.e. supervised learning that involves
training. I am looking to run some unsupervised learning algorithm on the
data, I don't have any training data. Hence why I was looking at clustering.

Actually from reading, it seems to me that Apriori or FP-growth are the
most useful algorithms for me to come up with useful information about this
data, but it seems these algorithms have not been implemented in Mahout
yet. So I guess the question to ask is given I have some data in key values
where both keys and values are strings what
unsupervised algorithms are available in Mahout that I can use to learn
about this data?

Many thanks

Haddad

On 10 January 2013 07:05, Ted Dunning <te...@gmail.com> wrote:

> Look at the last third of the book, especially chapter 14.
>
> One important thing to check is whether your integers represent codes or
> actually represent numbers.  Codes should be encoded as key words.
>
> Hashed vector encoding should work quite well.
>
> On Wed, Jan 9, 2013 at 10:10 PM, Haddad Said <ha...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have a data set in CSV which is a set of key value pairs, the data set
> is
> > huge and the values are a mixture of integers and short strings (i.e. not
> > lengthy texts, but rather key words) and I want to process it using
> > Mahout's clustering algorithms.
> >
> > The issue is in converting this CSV into vectors that can be consumed by
> > Mahout. I have been reading "Mahout In Action" and there seems to be two
> > options for vectorizing, using numeric values with Mahout's DenseVector,
> > RandomAccessSparseVector, and SequentialAccessSparseVector implementation
> > or use Vector Space Model to vectorize text documents.
> >
> > The data I want to vectorize it not really a text document, but as it is
> a
> > huge data set with many different keys and values it is difficult to map
> it
> > to numeric values. What is the best way to vectorize this kind of data
> for
> > use in Mahout?
> >
> > Any pointers would be appreciated.
> >
> > Thanks
> >
>

Re: Representing key value dataset into Mahout vector

Posted by Ted Dunning <te...@gmail.com>.

Look at the last third of the book, especially chapter 14.

One important thing to check is whether your integers represent codes or
actually represent numbers.  Codes should be encoded as key words.

Hashed vector encoding should work quite well.

On Wed, Jan 9, 2013 at 10:10 PM, Haddad Said <ha...@gmail.com> wrote:

> Hi,
>
> I have a data set in CSV which is a set of key value pairs, the data set is
> huge and the values are a mixture of integers and short strings (i.e. not
> lengthy texts, but rather key words) and I want to process it using
> Mahout's clustering algorithms.
>
> The issue is in converting this CSV into vectors that can be consumed by
> Mahout. I have been reading "Mahout In Action" and there seems to be two
> options for vectorizing, using numeric values with Mahout's DenseVector,
> RandomAccessSparseVector, and SequentialAccessSparseVector implementation
> or use Vector Space Model to vectorize text documents.
>
> The data I want to vectorize it not really a text document, but as it is a
> huge data set with many different keys and values it is difficult to map it
> to numeric values. What is the best way to vectorize this kind of data for
> use in Mahout?
>
> Any pointers would be appreciated.
>
> Thanks
>