You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ashwini P <as...@gmail.com> on 2013/08/12 06:24:02 UTC

Help regarding Seq2sparse utility

Hello,

I am new to mahout. I want to know how I can get the list of features that
where extracted from the corpus by seq2sparse and the count of the total
number of features.

My problem is that when I view the clustering output using clusterdumper I
get only dense vectors  for each point that belongs in the cluster but I
want the sparse vector for each point. What I want to know is that are the
vectors output from the clustering algorithm stored as dense vector or is
the clusterdumper  converting the vectors to dense vectors. If the
clustering algorithm generates sparse vectors I can directly use them or
else I will have to convert the vectors from dense to sparse for which I
need the information mentioned in the above paragraph.

Your suggestions on this are welcome.

Thanks,
Ashvini

Re: Help regarding Seq2sparse utility

Posted by Ted Dunning <te...@gmail.com>.

Ah.

I get it.  Ish.

I think, but am not entirely sure that there are two outputs possible that
you might be seeing.

One is the centroids of the vectors themselves.  These tend to densify, but
I am not sure if these actually are dense vectors (I would tend to think
so).  That might be what you are seeing.

The second is the assignment of your original vectors to the nearest
cluster.  Here, the vector is just your original vector.  This output could
be in the form of a cluster id followed by the id's on all the vectors in
that cluster.  That doesn't look like what you are seeing.

Can you say what the actual commands you are running?  Without that, it is
a bit hard to say what you are seeing.






On Sun, Aug 11, 2013 at 10:57 PM, Ashwini P <as...@gmail.com> wrote:

> Hi Ted,
>
> My apologies for not framing the question on clusterdumper properly. I am
> getting the output from clusterdumper in the expected format.  A sample
> vector from the  clusterdumper output is as shown below:
>
>     1.0: /all-exchanges-strings.lc.txt = [amex:0.161, ase:0.161, asx:0.161,
> biffex:0.161, bse:0.161, cboe:0.161, cbt:0.161, cme:0.161, comex:0.161,
> cse:0.161, fox:0.136, fse:0.161, hkse:0.161, ipe:0.161, jse:0.161,
> klce:0.161, klse:0.161, liffe:0.161, lme:0.161, lse:0.161, mase:0.161,
> mise:0.161, mnse:0.161, mose:0.161, nasdaq:0.161, nyce:0.161, nycsce:0.161,
> nymex:0.161, nyse:0.161, ose:0.161, pse:0.161, set:0.136, simex:0.161,
> sse:0.161, stse:0.161, tose:0.161, tse:0.161, wce:0.161, zse:0.161]
>
> What I originally wanted to know is that are this vectors just the way
> clusterdumper prints them( i.e. are they dense vectors) or are they sparse
> vectors and  the clusterdumper iterates over the non-zero values and prints
> only those values. If they are sparse vectors, Can you kindly tell me in
> which directory are the vectors generated by the algorithm so I can read
> them.
>
> If the vectors are in dense format then I need to convert them to sparse
> vectors. As can be seen from the clusterdump outsput sample above,only the
> features which have non-zero values for each vector are being printed. the
> set of features which have non-zero values will differ from vector to
> vector. Consider we have 3 vectors f1,f2,f3 each with a set of nonzero
> features s1,s2 and s3 respectively. What I want is a set
>              S={s1 U s2 U s3}
> i.e. S is the union of the sets of non-zero features for each vector so
> that I can convert the dense vectors to sparse vectors.
>
> Your thoughts on this are welcome.
>
> Thanks,
> Ashvini
>
>
>
> On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Aside from your issues with clusterdumper, the values you want can be had
> > from a sparse vector using v.iterateNonZero() and v.norm(0).
> >
> > The issue with clusterdumper is odd.
> >
> > Are you saying that the display shows all the components of the vector?
>  Or
> > that there is an in-memory representation that has been densified?
> >
> >
> >
> > On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P <as...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I am new to mahout. I want to know how I can get the list of features
> > that
> > > where extracted from the corpus by seq2sparse and the count of the
> total
> > > number of features.
> > >
> > > My problem is that when I view the clustering output using
> clusterdumper
> > I
> > > get only dense vectors  for each point that belongs in the cluster but
> I
> > > want the sparse vector for each point. What I want to know is that are
> > the
> > > vectors output from the clustering algorithm stored as dense vector or
> is
> > > the clusterdumper  converting the vectors to dense vectors. If the
> > > clustering algorithm generates sparse vectors I can directly use them
> or
> > > else I will have to convert the vectors from dense to sparse for which
> I
> > > need the information mentioned in the above paragraph.
> > >
> > > Your suggestions on this are welcome.
> > >
> > > Thanks,
> > > Ashvini
> > >
> >
>

Re: Help regarding Seq2sparse utility

Posted by Ashwini P <as...@gmail.com>.

Hi Ted,

My apologies for not framing the question on clusterdumper properly. I am
getting the output from clusterdumper in the expected format.  A sample
vector from the  clusterdumper output is as shown below:

    1.0: /all-exchanges-strings.lc.txt = [amex:0.161, ase:0.161, asx:0.161,
biffex:0.161, bse:0.161, cboe:0.161, cbt:0.161, cme:0.161, comex:0.161,
cse:0.161, fox:0.136, fse:0.161, hkse:0.161, ipe:0.161, jse:0.161,
klce:0.161, klse:0.161, liffe:0.161, lme:0.161, lse:0.161, mase:0.161,
mise:0.161, mnse:0.161, mose:0.161, nasdaq:0.161, nyce:0.161, nycsce:0.161,
nymex:0.161, nyse:0.161, ose:0.161, pse:0.161, set:0.136, simex:0.161,
sse:0.161, stse:0.161, tose:0.161, tse:0.161, wce:0.161, zse:0.161]

What I originally wanted to know is that are this vectors just the way
clusterdumper prints them( i.e. are they dense vectors) or are they sparse
vectors and  the clusterdumper iterates over the non-zero values and prints
only those values. If they are sparse vectors, Can you kindly tell me in
which directory are the vectors generated by the algorithm so I can read
them.

If the vectors are in dense format then I need to convert them to sparse
vectors. As can be seen from the clusterdump outsput sample above,only the
features which have non-zero values for each vector are being printed. the
set of features which have non-zero values will differ from vector to
vector. Consider we have 3 vectors f1,f2,f3 each with a set of nonzero
features s1,s2 and s3 respectively. What I want is a set
             S={s1 U s2 U s3}
i.e. S is the union of the sets of non-zero features for each vector so
that I can convert the dense vectors to sparse vectors.

Your thoughts on this are welcome.

Thanks,
Ashvini



On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning <te...@gmail.com> wrote:

> Aside from your issues with clusterdumper, the values you want can be had
> from a sparse vector using v.iterateNonZero() and v.norm(0).
>
> The issue with clusterdumper is odd.
>
> Are you saying that the display shows all the components of the vector?  Or
> that there is an in-memory representation that has been densified?
>
>
>
> On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P <as...@gmail.com> wrote:
>
> > Hello,
> >
> > I am new to mahout. I want to know how I can get the list of features
> that
> > where extracted from the corpus by seq2sparse and the count of the total
> > number of features.
> >
> > My problem is that when I view the clustering output using clusterdumper
> I
> > get only dense vectors  for each point that belongs in the cluster but I
> > want the sparse vector for each point. What I want to know is that are
> the
> > vectors output from the clustering algorithm stored as dense vector or is
> > the clusterdumper  converting the vectors to dense vectors. If the
> > clustering algorithm generates sparse vectors I can directly use them or
> > else I will have to convert the vectors from dense to sparse for which I
> > need the information mentioned in the above paragraph.
> >
> > Your suggestions on this are welcome.
> >
> > Thanks,
> > Ashvini
> >
>

Re: Help regarding Seq2sparse utility

Posted by Ted Dunning <te...@gmail.com>.

Aside from your issues with clusterdumper, the values you want can be had
from a sparse vector using v.iterateNonZero() and v.norm(0).

The issue with clusterdumper is odd.

Are you saying that the display shows all the components of the vector?  Or
that there is an in-memory representation that has been densified?

On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P <as...@gmail.com> wrote:

> Hello,
>
> I am new to mahout. I want to know how I can get the list of features that
> where extracted from the corpus by seq2sparse and the count of the total
> number of features.
>
> My problem is that when I view the clustering output using clusterdumper I
> get only dense vectors  for each point that belongs in the cluster but I
> want the sparse vector for each point. What I want to know is that are the
> vectors output from the clustering algorithm stored as dense vector or is
> the clusterdumper  converting the vectors to dense vectors. If the
> clustering algorithm generates sparse vectors I can directly use them or
> else I will have to convert the vectors from dense to sparse for which I
> need the information mentioned in the above paragraph.
>
> Your suggestions on this are welcome.
>
> Thanks,
> Ashvini
>