You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chirag Lakhani <cl...@zaloni.com> on 2014/03/02 19:31:06 UTC
sparsification of a Mahout vector
Hi,
I was wondering if there is a simple way to sparsify a vector in Mahout. I
basically have an n-dimensional vector (currently a DenseVector) and I want
to develop a method that sparsifies it by keeping only the largest s values
of the vector and setting the rest to 0. Is there a simple solution to
this given all that is included in the Vector class or do I need to create
my own method?
Chirag
--
*Chirag Lakhani*
Data Scientist
Zaloni, Inc. | www.zaloni.com
633 Davis Dr., Suite 200
Durham, NC 27713
e: clakhani@zaloni.com
p: 919.602.4965 x7020
Re: sparsification of a Mahout vector
Posted by Ted Dunning <te...@gmail.com>.
Chirag,
There isn't a fully baked answer to your needs, but there are components
that can help you. For instance, the OnlineSummarizer can help you find a
particular quantile. Iterating over the vector to fill that is easy enough:
For example:
Vector v; // original data
OnlineSummarizer s = new OnlineSummarizer();
for (Vector.Element e : v.all()) {
s.add(e.get());
}
// pick any cutoff you like
double cutoff = s.quantile(0.99);
Then you can use this cutoff to copy only the items you need:
Vector r = new RandomAccessSparseVector(v.size());
for (Vector.Element e : v.all()) {
double vi = e.get();
if (vi > cutoff) {
r.set(e.index(), vi);
}
}
Note that if you really want a sparse result, you really have to perform a
selective copy because even if you set elements of a DenseVector to zero,
you still will have the same amount of storage. Only by copying
selectively to a new vector with the right type can you get the desired
effect.
On Sun, Mar 2, 2014 at 10:31 AM, Chirag Lakhani <cl...@zaloni.com> wrote:
> Hi,
>
> I was wondering if there is a simple way to sparsify a vector in Mahout. I
> basically have an n-dimensional vector (currently a DenseVector) and I want
> to develop a method that sparsifies it by keeping only the largest s values
> of the vector and setting the rest to 0. Is there a simple solution to
> this given all that is included in the Vector class or do I need to create
> my own method?
>
> Chirag
>
> --
>
> *Chirag Lakhani*
>
> Data Scientist
>
> Zaloni, Inc. | www.zaloni.com
>
> 633 Davis Dr., Suite 200
>
> Durham, NC 27713
> e: clakhani@zaloni.com
> p: 919.602.4965 x7020
>