You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/11/13 00:23:02 UTC

Multi-file Matrices?

Is there a convention for multi-file matrices? For example, the
DistributedRowMatrix?

-- 
Lance Norskog
goksron@gmail.com

Re: Multi-file Matrices?

Posted by Ted Dunning <te...@gmail.com>.
I should have said "don't forget which row is which".

On Mon, Nov 14, 2011 at 12:06 AM, Jake Mannix <ja...@gmail.com> wrote:

> The ordering *can* be chosen to be that.  But nothing in our api
> documentation
> implies we will always do this, and in fact it completely depends on
> whether the
> MR job used to create the matrix had reducer outputs creating row numbers
> sequentially.
>
>  -jake
>
> On Sun, Nov 13, 2011 at 11:28 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > So, a DRM is a set of one or more files, where each SequenceFile
> int/vector
> > pair is a row number and a fully wide vector? Then ordering is in the
> > IntWritable keys.
> >
> > On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > I don't think we currently make any guarantees about sort-order of the
> > > parts
> > > themselves, or among the various part-files, as the may be created by
> any
> > > number of map-reduce jobs, and are then consumed by map-reduce jobs
> > > which have no inter-process communication.
> > >
> > > What would ordering even *mean* among map-inputs?  Or are you just
> > > referring to in each chunk itself?  Or for non-MR use of the files?
> > >
> > >  -jake
> > >
> > > On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Make sure that the files can be ordered, of course.  Losing the
> > ordering
> > > > can be really bad.
> > > >
> > > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <jake.mannix@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Yeah, in particular, DistributedRowMatrix "is" simply a
> > > > > SequenceFile<IntWritable,VectorWritable>, when in its serialized
> > form.
> > > >  As
> > > > > such,
> > > > > this "file" can be (and typically is) a series of part-* files in a
> > > > > directory (typically
> > > > > on HDFS).
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > It's my understanding drm can be multifile. In fact, stuff like
> > > > > seq2sparse
> > > > > > will produce multifile output, being a MR job itself.
> > > > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Is there a convention for multi-file matrices? For example, the
> > > > > > > DistributedRowMatrix?
> > > > > > >
> > > > > > > --
> > > > > > > Lance Norskog
> > > > > > > goksron@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Multi-file Matrices?

Posted by Jake Mannix <ja...@gmail.com>.
The ordering *can* be chosen to be that.  But nothing in our api
documentation
implies we will always do this, and in fact it completely depends on
whether the
MR job used to create the matrix had reducer outputs creating row numbers
sequentially.

  -jake

On Sun, Nov 13, 2011 at 11:28 PM, Lance Norskog <go...@gmail.com> wrote:

> So, a DRM is a set of one or more files, where each SequenceFile int/vector
> pair is a row number and a fully wide vector? Then ordering is in the
> IntWritable keys.
>
> On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > I don't think we currently make any guarantees about sort-order of the
> > parts
> > themselves, or among the various part-files, as the may be created by any
> > number of map-reduce jobs, and are then consumed by map-reduce jobs
> > which have no inter-process communication.
> >
> > What would ordering even *mean* among map-inputs?  Or are you just
> > referring to in each chunk itself?  Or for non-MR use of the files?
> >
> >  -jake
> >
> > On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Make sure that the files can be ordered, of course.  Losing the
> ordering
> > > can be really bad.
> > >
> > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <ja...@gmail.com>
> > > wrote:
> > >
> > > > Yeah, in particular, DistributedRowMatrix "is" simply a
> > > > SequenceFile<IntWritable,VectorWritable>, when in its serialized
> form.
> > >  As
> > > > such,
> > > > this "file" can be (and typically is) a series of part-* files in a
> > > > directory (typically
> > > > on HDFS).
> > > >
> > > >  -jake
> > > >
> > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > > > >wrote:
> > > >
> > > > > It's my understanding drm can be multifile. In fact, stuff like
> > > > seq2sparse
> > > > > will produce multifile output, being a MR job itself.
> > > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com>
> wrote:
> > > > >
> > > > > > Is there a convention for multi-file matrices? For example, the
> > > > > > DistributedRowMatrix?
> > > > > >
> > > > > > --
> > > > > > Lance Norskog
> > > > > > goksron@gmail.com
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Multi-file Matrices?

Posted by Lance Norskog <go...@gmail.com>.
So, a DRM is a set of one or more files, where each SequenceFile int/vector
pair is a row number and a fully wide vector? Then ordering is in the
IntWritable keys.

On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <ja...@gmail.com> wrote:

> I don't think we currently make any guarantees about sort-order of the
> parts
> themselves, or among the various part-files, as the may be created by any
> number of map-reduce jobs, and are then consumed by map-reduce jobs
> which have no inter-process communication.
>
> What would ordering even *mean* among map-inputs?  Or are you just
> referring to in each chunk itself?  Or for non-MR use of the files?
>
>  -jake
>
> On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Make sure that the files can be ordered, of course.  Losing the ordering
> > can be really bad.
> >
> > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Yeah, in particular, DistributedRowMatrix "is" simply a
> > > SequenceFile<IntWritable,VectorWritable>, when in its serialized form.
> >  As
> > > such,
> > > this "file" can be (and typically is) a series of part-* files in a
> > > directory (typically
> > > on HDFS).
> > >
> > >  -jake
> > >
> > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > > >wrote:
> > >
> > > > It's my understanding drm can be multifile. In fact, stuff like
> > > seq2sparse
> > > > will produce multifile output, being a MR job itself.
> > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com> wrote:
> > > >
> > > > > Is there a convention for multi-file matrices? For example, the
> > > > > DistributedRowMatrix?
> > > > >
> > > > > --
> > > > > Lance Norskog
> > > > > goksron@gmail.com
> > > > >
> > > >
> > >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Multi-file Matrices?

Posted by Jake Mannix <ja...@gmail.com>.
I don't think we currently make any guarantees about sort-order of the parts
themselves, or among the various part-files, as the may be created by any
number of map-reduce jobs, and are then consumed by map-reduce jobs
which have no inter-process communication.

What would ordering even *mean* among map-inputs?  Or are you just
referring to in each chunk itself?  Or for non-MR use of the files?

  -jake

On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <te...@gmail.com> wrote:

> Make sure that the files can be ordered, of course.  Losing the ordering
> can be really bad.
>
> On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Yeah, in particular, DistributedRowMatrix "is" simply a
> > SequenceFile<IntWritable,VectorWritable>, when in its serialized form.
>  As
> > such,
> > this "file" can be (and typically is) a series of part-* files in a
> > directory (typically
> > on HDFS).
> >
> >  -jake
> >
> > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> >
> > > It's my understanding drm can be multifile. In fact, stuff like
> > seq2sparse
> > > will produce multifile output, being a MR job itself.
> > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com> wrote:
> > >
> > > > Is there a convention for multi-file matrices? For example, the
> > > > DistributedRowMatrix?
> > > >
> > > > --
> > > > Lance Norskog
> > > > goksron@gmail.com
> > > >
> > >
> >
>

Re: Multi-file Matrices?

Posted by Ted Dunning <te...@gmail.com>.
Make sure that the files can be ordered, of course.  Losing the ordering
can be really bad.

On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <ja...@gmail.com> wrote:

> Yeah, in particular, DistributedRowMatrix "is" simply a
> SequenceFile<IntWritable,VectorWritable>, when in its serialized form.  As
> such,
> this "file" can be (and typically is) a series of part-* files in a
> directory (typically
> on HDFS).
>
>  -jake
>
> On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > It's my understanding drm can be multifile. In fact, stuff like
> seq2sparse
> > will produce multifile output, being a MR job itself.
> > On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com> wrote:
> >
> > > Is there a convention for multi-file matrices? For example, the
> > > DistributedRowMatrix?
> > >
> > > --
> > > Lance Norskog
> > > goksron@gmail.com
> > >
> >
>

Re: Multi-file Matrices?

Posted by Jake Mannix <ja...@gmail.com>.
Yeah, in particular, DistributedRowMatrix "is" simply a
SequenceFile<IntWritable,VectorWritable>, when in its serialized form.  As
such,
this "file" can be (and typically is) a series of part-* files in a
directory (typically
on HDFS).

  -jake

On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> It's my understanding drm can be multifile. In fact, stuff like seq2sparse
> will produce multifile output, being a MR job itself.
> On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com> wrote:
>
> > Is there a convention for multi-file matrices? For example, the
> > DistributedRowMatrix?
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Multi-file Matrices?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
It's my understanding drm can be multifile. In fact, stuff like seq2sparse
will produce multifile output, being a MR job itself.
On Nov 12, 2011 3:23 PM, "Lance Norskog" <go...@gmail.com> wrote:

> Is there a convention for multi-file matrices? For example, the
> DistributedRowMatrix?
>
> --
> Lance Norskog
> goksron@gmail.com
>