You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/08/03 00:13:08 UTC

Re: M/R over two matrices, and computing the median

>
>
> Accessing a separate SequenceFile from within a Mapper is *way inefficient*
> (orders of magnitude slower).
>
> You want to do a map-side join.  This is what is done in MatrixMultiplyJob
> -
> your Mapper gets IntWritable as key, and the value is a Pair of
> VectorWritables -
> one from each matrix.
>

Excellent. Any idea what the Hadoop 0.20.2 equivalent for
CompositeInputFormat is? :)

Re: M/R over two matrices, and computing the median

Posted by Ted Dunning <te...@gmail.com>.

Yes.  The general principle is to label the records coming in with the file
that they came from.  That can be done in the same data-structure that
provides the polymorphism that you want.  These labelled records carry their
labels so that you can inspect the labels in the reducer and decide what to
do.  It is also common to sort in some clever way so that you know the order
of the labels.

On Tue, Aug 3, 2010 at 2:09 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Store the Path.getName() of the two separate Paths prior to the job, then
> within the Mapper, look up context.getWorkingDirectory().getName() and
> compare it to the two variables set before. Whichever one matches, you know
> you're working with that specific SequenceFile.
>
> Would that even work?
>

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

I don't know how dirty of a hack this might be, but what about this:

Store the Path.getName() of the two separate Paths prior to the job, then
within the Mapper, look up context.getWorkingDirectory().getName() and
compare it to the two variables set before. Whichever one matches, you know
you're working with that specific SequenceFile.

Would that even work?

On Tue, Aug 3, 2010 at 3:50 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Here's my next question, then: within the Mapper itself, how do I know the
> source SequenceFile of the VectorWritable I'm currently holding, A or B?
>
>
> On Tue, Aug 3, 2010 at 1:33 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Well if both vectors are the same size, then a map-reduce on vector number
>> is the natural solution here.
>>
>> Map-side reduce is only useful when one or the other operand is relatively
>> small.
>>
>> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>>
>> > but how to gain access
>> > to the two original vectors at the same time is beyond me.
>> >
>>
>
>

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

Here's my next question, then: within the Mapper itself, how do I know the
source SequenceFile of the VectorWritable I'm currently holding, A or B?

On Tue, Aug 3, 2010 at 1:33 PM, Ted Dunning <te...@gmail.com> wrote:

> Well if both vectors are the same size, then a map-reduce on vector number
> is the natural solution here.
>
> Map-side reduce is only useful when one or the other operand is relatively
> small.
>
> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>
> > but how to gain access
> > to the two original vectors at the same time is beyond me.
> >
>

Re: M/R over two matrices, and computing the median

Posted by Ted Dunning <te...@gmail.com>.

Well if both vectors are the same size, then a map-reduce on vector number
is the natural solution here.

Map-side reduce is only useful when one or the other operand is relatively
small.

On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> but how to gain access
> to the two original vectors at the same time is beyond me.
>

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

And the key in each map() would correspond to the row in whichever
SequenceFile it's parsing, so as long the two files line up their keys, I'll
have exactly two VectorWritables (or whatever Writable) per key in the
Reducer.

Oy. That's about as simple as it gets. Thank you very much!!

Shannon

On Tue, Aug 3, 2010 at 1:17 PM, Sean Owen <sr...@gmail.com> wrote:

> You want row N from matrix A and B?
>
> Map A to (row # -> row vector) and likewise for B. Both are input paths.
> Then the reducer has, for each row, both row vectors.
>
> You can add a custom Writable with more info about, say, which vector
> is which if you like.
>
> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> > Right, that's the concept I'd had in mind, but to me it always seem to
> come
> > down to having access to two distinct vectors at the same time, and I'm
> not
> > sure how you would do that. In my case, both the dimensions and the data
> > types of the two vectors are identical, so we're talking a merged vector
> of
> > floats that's simply twice as long as the original, but how to gain
> access
> > to the two original vectors at the same time is beyond me.
> >
> > But still, the data types I need that would do this for me are in a newer
> > Hadoop commit, I'm just trying to figure out how to build the commit
> > manually and integrate it to the core Hadoop .jar file.
> >
> > Any suggestions that would speed along either of these options are most
> > welcome.
> >
> > Shannon
>

Re: M/R over two matrices, and computing the median

Posted by Sean Owen <sr...@gmail.com>.

You want row N from matrix A and B?

Map A to (row # -> row vector) and likewise for B. Both are input paths.
Then the reducer has, for each row, both row vectors.

You can add a custom Writable with more info about, say, which vector
is which if you like.

On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> Right, that's the concept I'd had in mind, but to me it always seem to come
> down to having access to two distinct vectors at the same time, and I'm not
> sure how you would do that. In my case, both the dimensions and the data
> types of the two vectors are identical, so we're talking a merged vector of
> floats that's simply twice as long as the original, but how to gain access
> to the two original vectors at the same time is beyond me.
>
> But still, the data types I need that would do this for me are in a newer
> Hadoop commit, I'm just trying to figure out how to build the commit
> manually and integrate it to the core Hadoop .jar file.
>
> Any suggestions that would speed along either of these options are most
> welcome.
>
> Shannon

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

Right, that's the concept I'd had in mind, but to me it always seem to come
down to having access to two distinct vectors at the same time, and I'm not
sure how you would do that. In my case, both the dimensions and the data
types of the two vectors are identical, so we're talking a merged vector of
floats that's simply twice as long as the original, but how to gain access
to the two original vectors at the same time is beyond me.

But still, the data types I need that would do this for me are in a newer
Hadoop commit, I'm just trying to figure out how to build the commit
manually and integrate it to the core Hadoop .jar file.

Any suggestions that would speed along either of these options are most
welcome.

Shannon

On Tue, Aug 3, 2010 at 11:50 AM, Sean Owen <sr...@gmail.com> wrote:

> What I ended up doing in this case, IIRC, is to use another phase to
> convert inputs 1 and 2 into some contrived new single Writable format.
> Then both sets of input are merely fed into one mapper. So I'd
> literally have Writable classes that contained, inside, either a
> FooWritable or BarWritable. A little ugly but not bad.
>
> On Mon, Aug 2, 2010 at 3:24 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> > CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> > job.setInputFormatClass() is expecting a class that extends a
> > hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
> > package, too.
> >
> > Still hunting around the API for the newer equivalent; there has to be a
> way
> > of doing this?
> >
> > On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> >> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <sq...@gatech.edu>
> wrote:
> >> >
> >> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> >> > CompositeInputFormat is? :)
> >> >
> >>
> >> Ah, there is that part.  Hmm... it's really really annoying to not have
> >> that
> >> in 0.20.2.
> >>
> >> This is actually why I haven't migrated the distributed matrix stuff to
> the
> >> newest
> >> Hadoop API - map-side join is pretty seriously useful sometimes.
> >>
> >> Does the old CompositeInputFormat work with the new API, does anyone
> know?
> >>
> >>  -jake
> >>
> >
>

Re: M/R over two matrices, and computing the median

Posted by Sean Owen <sr...@gmail.com>.

What I ended up doing in this case, IIRC, is to use another phase to
convert inputs 1 and 2 into some contrived new single Writable format.
Then both sets of input are merely fed into one mapper. So I'd
literally have Writable classes that contained, inside, either a
FooWritable or BarWritable. A little ugly but not bad.

On Mon, Aug 2, 2010 at 3:24 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> job.setInputFormatClass() is expecting a class that extends a
> hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
> package, too.
>
> Still hunting around the API for the newer equivalent; there has to be a way
> of doing this?
>
> On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>> >
>> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
>> > CompositeInputFormat is? :)
>> >
>>
>> Ah, there is that part.  Hmm... it's really really annoying to not have
>> that
>> in 0.20.2.
>>
>> This is actually why I haven't migrated the distributed matrix stuff to the
>> newest
>> Hadoop API - map-side join is pretty seriously useful sometimes.
>>
>> Does the old CompositeInputFormat work with the new API, does anyone know?
>>
>>  -jake
>>
>

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

Ok, turns out those same classes do exist in the new API, but weren't
included in the 0.20.2 release for some reason - they're in a much more
recent SVN commit:

http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/join/

I've had to checkout the latest mapreduce revision and build it manually; it
looks like the "zipgroupfileset" jar attribute is my best bet for flattening
and merging the new mapreduce jar with the overall core hadoop jar.
Hopefully this won't cause any flagrant problems...

Shannon

On Mon, Aug 2, 2010 at 6:24 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> job.setInputFormatClass() is expecting a class that extends a hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred package, too.
>
> Still hunting around the API for the newer equivalent; there has to be a
> way of doing this?
>
> On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>> >
>> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
>> > CompositeInputFormat is? :)
>> >
>>
>> Ah, there is that part.  Hmm... it's really really annoying to not have
>> that
>> in 0.20.2.
>>
>> This is actually why I haven't migrated the distributed matrix stuff to
>> the
>> newest
>> Hadoop API - map-side join is pretty seriously useful sometimes.
>>
>> Does the old CompositeInputFormat work with the new API, does anyone know?
>>
>>  -jake
>>
>
>

Re: M/R over two matrices, and computing the median

Posted by Shannon Quinn <sq...@gatech.edu>.

CompositeInputFormat implements a hadoop.mapred.join interface, whereas
job.setInputFormatClass() is expecting a class that extends a
hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
package, too.

Still hunting around the API for the newer equivalent; there has to be a way
of doing this?

On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> >
> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> > CompositeInputFormat is? :)
> >
>
> Ah, there is that part.  Hmm... it's really really annoying to not have
> that
> in 0.20.2.
>
> This is actually why I haven't migrated the distributed matrix stuff to the
> newest
> Hadoop API - map-side join is pretty seriously useful sometimes.
>
> Does the old CompositeInputFormat work with the new API, does anyone know?
>
>  -jake
>

Re: M/R over two matrices, and computing the median

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
> Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> CompositeInputFormat is? :)
>

Ah, there is that part.  Hmm... it's really really annoying to not have that
in 0.20.2.

This is actually why I haven't migrated the distributed matrix stuff to the
newest
Hadoop API - map-side join is pretty seriously useful sometimes.

Does the old CompositeInputFormat work with the new API, does anyone know?

  -jake