You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vckay <da...@gmail.com> on 2011/05/05 05:54:41 UTC

Question Regarding Distributed Row Matrix

Hello all,
  I am trying to create a distributed row matrix of my data which is
currently available as text input with each line supposed to become a line
of the distributed row. I am using the Spectral KMeans code as a way of
understanding how DistributedRowMatrix works and I am sort of confused.
Specifically: Does DistributedRowMatrix require that the SequenceFiles have
the row ID as the "Key" ?
( The Spectral Kmeans code implements that which is easy because their
input's first word has that information. However, since as far as I can see
TextInputFormat just renders a unique byte offset (not necessarily the line
number), I cant recover the line number from my data. Furthermore, suppose I
do change my data to say a bunch of images living in a flat directory, I am
thinking of having "key" being some combination of the file number and this
byte offset. )

Thanks

Re: Question Regarding Distributed Row Matrix

Posted by Vckay <da...@gmail.com>.

On Thu, May 5, 2011 at 12:19 PM, Jake Mannix <ja...@gmail.com> wrote:

> Vckay,
>
>  People don't typically take a raw Text file which has no keys, and build
> a DistributedRowMatrix from it.  You typically have something you want
> to key on (file name, guid from a database, embedded timestamp, etc).
> If you don't have any ids for your rows, you'll need to generate some.
>
>

 If you look at what we do in RowIdJob, it maps over a SequenceFile
> of Text -> VectorWritable (which is the output of the seqdirectory
> script: filename -> vector), and turns this into a pair of sequence files,
> Int -> Text, and Int -> VectorWritable.  The first is a "dictionary" of
> what ints (docId) maps to what filename, and the latter is a true
> DistributedRowMatrix, ready for working with transpose, svd, etc.
>
>  Note that RowIdJob is not truly scalable: it iterates over your entire
> text directly, so it does not use any parallelism.
>
>
Ah OK. Thanks a lot. That sounds exactly what I was looking for. To clarify
why I was working with a raw text file: I wanted to make sure I got
everything working on a small file that I could compare with sequentially. I
eventually plan to test out the algorithm on image data where I guess I can
use the file name of the image to identify a row.

Re: Question Regarding Distributed Row Matrix

Posted by Jake Mannix <ja...@gmail.com>.

Vckay,

  People don't typically take a raw Text file which has no keys, and build
a DistributedRowMatrix from it.  You typically have something you want
to key on (file name, guid from a database, embedded timestamp, etc).
If you don't have any ids for your rows, you'll need to generate some.

  If you look at what we do in RowIdJob, it maps over a SequenceFile
of Text -> VectorWritable (which is the output of the seqdirectory
script: filename -> vector), and turns this into a pair of sequence files,
Int -> Text, and Int -> VectorWritable.  The first is a "dictionary" of
what ints (docId) maps to what filename, and the latter is a true
DistributedRowMatrix, ready for working with transpose, svd, etc.

  Note that RowIdJob is not truly scalable: it iterates over your entire
text directly, so it does not use any parallelism.

  -jake


On Thu, May 5, 2011 at 6:57 AM, Vckay <da...@gmail.com> wrote:

> OK. I do plan to use SVD and transpose. Assuming you are correct, I am
> curious then: How are people solving this problem? (Surely not all data has
> row tags in it). A solution I had in mind was to use a single reducer (have
> one key coming in from mapper) so that the single reducer is able to put in
> a row number. However, this is not a clean solution since it appears to
> have
> to do it serially.
>
> On Thu, May 5, 2011 at 12:49 AM, Dmitriy Lyubimov <dlyubimov@apache.org
> >wrote:
>
> > The interpretation of key in sequence files is subject to restrictions
> > of a particular algorithm. We held a discussion on this recently, and
> > i think the consensus was that we don't want to lock DRM as a format
> > to a particular interpretation of keys in the file -- it is left to
> > client's code to interpret those and for ultimate goal of
> > vectorization.
> >
> > However, different algorithms may interpret it differently. E.g.
> > stochastic SVD is agnostic of both the key and its class and just
> > copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
> > think) requires them to be IntWritable (and may also require them to
> > be unique -- i am not 100% sure). Similarly, matrix transpose (I
> > think) would also require them to be IntWritable and on top of them
> > interpret them as row numbers for the sake of transposition. (I might
> > be wrong about that last one).
> >
> > I am not sure about KMeans code.
> >
> > On Wed, May 4, 2011 at 8:54 PM, Vckay <da...@gmail.com> wrote:
> > > Hello all,
> > >  I am trying to create a distributed row matrix of my data which is
> > > currently available as text input with each line supposed to become a
> > line
> > > of the distributed row. I am using the Spectral KMeans code as a way of
> > > understanding how DistributedRowMatrix works and I am sort of confused.
> > > Specifically: Does DistributedRowMatrix require that the SequenceFiles
> > have
> > > the row ID as the "Key" ?
> > > ( The Spectral Kmeans code implements that which is easy because their
> > > input's first word has that information. However, since as far as I can
> > see
> > > TextInputFormat just renders a unique byte offset (not necessarily the
> > line
> > > number), I cant recover the line number from my data. Furthermore,
> > suppose I
> > > do change my data to say a bunch of images living in a flat directory,
> I
> > am
> > > thinking of having "key" being some combination of the file number and
> > this
> > > byte offset. )
> > >
> > > Thanks
> > >
> >
>

Re: Question Regarding Distributed Row Matrix

Posted by Vckay <da...@gmail.com>.

OK. I do plan to use SVD and transpose. Assuming you are correct, I am
curious then: How are people solving this problem? (Surely not all data has
row tags in it). A solution I had in mind was to use a single reducer (have
one key coming in from mapper) so that the single reducer is able to put in
a row number. However, this is not a clean solution since it appears to have
to do it serially.

On Thu, May 5, 2011 at 12:49 AM, Dmitriy Lyubimov <dl...@apache.org>wrote:

> The interpretation of key in sequence files is subject to restrictions
> of a particular algorithm. We held a discussion on this recently, and
> i think the consensus was that we don't want to lock DRM as a format
> to a particular interpretation of keys in the file -- it is left to
> client's code to interpret those and for ultimate goal of
> vectorization.
>
> However, different algorithms may interpret it differently. E.g.
> stochastic SVD is agnostic of both the key and its class and just
> copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
> think) requires them to be IntWritable (and may also require them to
> be unique -- i am not 100% sure). Similarly, matrix transpose (I
> think) would also require them to be IntWritable and on top of them
> interpret them as row numbers for the sake of transposition. (I might
> be wrong about that last one).
>
> I am not sure about KMeans code.
>
> On Wed, May 4, 2011 at 8:54 PM, Vckay <da...@gmail.com> wrote:
> > Hello all,
> >  I am trying to create a distributed row matrix of my data which is
> > currently available as text input with each line supposed to become a
> line
> > of the distributed row. I am using the Spectral KMeans code as a way of
> > understanding how DistributedRowMatrix works and I am sort of confused.
> > Specifically: Does DistributedRowMatrix require that the SequenceFiles
> have
> > the row ID as the "Key" ?
> > ( The Spectral Kmeans code implements that which is easy because their
> > input's first word has that information. However, since as far as I can
> see
> > TextInputFormat just renders a unique byte offset (not necessarily the
> line
> > number), I cant recover the line number from my data. Furthermore,
> suppose I
> > do change my data to say a bunch of images living in a flat directory, I
> am
> > thinking of having "key" being some combination of the file number and
> this
> > byte offset. )
> >
> > Thanks
> >
>

Re: Question Regarding Distributed Row Matrix

Posted by Dmitriy Lyubimov <dl...@apache.org>.

The interpretation of key in sequence files is subject to restrictions
of a particular algorithm. We held a discussion on this recently, and
i think the consensus was that we don't want to lock DRM as a format
to a particular interpretation of keys in the file -- it is left to
client's code to interpret those and for ultimate goal of
vectorization.

However, different algorithms may interpret it differently. E.g.
stochastic SVD is agnostic of both the key and its class and just
copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
think) requires them to be IntWritable (and may also require them to
be unique -- i am not 100% sure). Similarly, matrix transpose (I
think) would also require them to be IntWritable and on top of them
interpret them as row numbers for the sake of transposition. (I might
be wrong about that last one).

I am not sure about KMeans code.

On Wed, May 4, 2011 at 8:54 PM, Vckay <da...@gmail.com> wrote:
> Hello all,
>  I am trying to create a distributed row matrix of my data which is
> currently available as text input with each line supposed to become a line
> of the distributed row. I am using the Spectral KMeans code as a way of
> understanding how DistributedRowMatrix works and I am sort of confused.
> Specifically: Does DistributedRowMatrix require that the SequenceFiles have
> the row ID as the "Key" ?
> ( The Spectral Kmeans code implements that which is easy because their
> input's first word has that information. However, since as far as I can see
> TextInputFormat just renders a unique byte offset (not necessarily the line
> number), I cant recover the line number from my data. Furthermore, suppose I
> do change my data to say a bunch of images living in a flat directory, I am
> thinking of having "key" being some combination of the file number and this
> byte offset. )
>
> Thanks
>