You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shannon Quinn (JIRA)" <ji...@apache.org> on 2010/11/03 22:26:25 UTC

[jira] Created: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
-------------------------------------------------------------

                 Key: MAHOUT-537
                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
             Project: Mahout
          Issue Type: Improvement
    Affects Versions: 0.4
            Reporter: Shannon Quinn
            Assignee: Shannon Quinn


Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

Ok, silly question...how do I go about plugging in a different version of
Hadoop? I moved the 0.21 version (the tar.gz from the Hadoop site) into the
same path as HADOOP_HOME, and I wiped out the .m2/ repository, so on the
next Mahout build all the dependencies were rebuilt. Still getting the
0.20.2 packages.

Grant has good points, just want to see if I can get this running...

On Sat, Nov 6, 2010 at 12:12 PM, Ted Dunning <te...@gmail.com> wrote:

> Remember Flume != FlumeJava.
>
> Flume is Cloudera's semi-proprietary ETL system.
>
> FlumeJava is a high level API for creating map-reduce programs in Java.
>  The
> level of abstraction is similar to Pig.
>
> Plume is an open source project I started to clone FlumeJava by filling in
> the details omitted from the Google paper.  As an
> example of how high level Plume is, word count in raw map-reduce is >200
> lines of code.  In Plume, it is about 20 and you
> can't tell which version of Hadoop, if any, your code is running on.
>
> On Fri, Nov 5, 2010 at 7:09 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> > The Plume/Flume stuff seems promising for helping with that as well as
> > giving some other benefits, but that relies on us having an open source
> > version of Flume (which Ted and others have started).  I don't know that
> it
> > is all that practical in short term and I'm not proposing any rewrites at
> > this point, but we should consider it as working at that layer might
> allow
> > the ability to plugin different backends that are better performing given
> > certain setups (local, small cluster, large cluster).  Such a bit of
> > insulation might allow us to plug in other capabilities as well.  One of
> the
> > things Hadoop has spawned is a whole lot more interest in these kind of
> > capabilities and I fully expect to see new/related paradigms coming out.
> >  Obviously, we aren't just going to jump on anything, but if we can think
> > about ways we might be able to plug them in.  Thoughts?
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

Remember Flume != FlumeJava.

Flume is Cloudera's semi-proprietary ETL system.

FlumeJava is a high level API for creating map-reduce programs in Java.  The
level of abstraction is similar to Pig.

Plume is an open source project I started to clone FlumeJava by filling in
the details omitted from the Google paper.  As an
example of how high level Plume is, word count in raw map-reduce is >200
lines of code.  In Plume, it is about 20 and you
can't tell which version of Hadoop, if any, your code is running on.

On Fri, Nov 5, 2010 at 7:09 AM, Grant Ingersoll <gs...@apache.org> wrote:

> The Plume/Flume stuff seems promising for helping with that as well as
> giving some other benefits, but that relies on us having an open source
> version of Flume (which Ted and others have started).  I don't know that it
> is all that practical in short term and I'm not proposing any rewrites at
> this point, but we should consider it as working at that layer might allow
> the ability to plugin different backends that are better performing given
> certain setups (local, small cluster, large cluster).  Such a bit of
> insulation might allow us to plug in other capabilities as well.  One of the
> things Hadoop has spawned is a whole lot more interest in these kind of
> capabilities and I fully expect to see new/related paradigms coming out.
>  Obviously, we aren't just going to jump on anything, but if we can think
> about ways we might be able to plug them in.  Thoughts?

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Second that.

As far as i know ( and our company is following that too), pretty much
everybody is using CDH3b3 in production, which is technically based on 0.20
tree. If you move to 0.21, that would render mahout incompatible with a lot
of stuff running in production IMO.

Also, it's not just hadoop. it's hive, pig, hbase etc. Pig, for once, has
not been ported to 0.21 and from what i heard there's not even an effort on
horizon to break ground in that direction. That alone would preclude a lot
of folks from moving on to 0.21. A lot of people are locked in to cloudera's
distro and ecosystem stuff that has various degrees of readiness (or none at
all).

I personally prefer to use the new api from CDH3b3 (append api and hbase
enhancements are especially hard to ignore) but i imagine we will not switch
to 0.21 until there's at least a stable pig version for it. My guess this
reasoning is pretty typical around.

Thanks.

-Dmitriy

On Fri, Nov 5, 2010 at 7:09 AM, Grant Ingersoll <gs...@apache.org> wrote:

> I didn't get a strong sense from the Hadoop community that 0.21 is all that
> well baked.  To quote the website:
> "This release contains many improvements, new features, bug fixes and
> optimizations. It has not undergone testing at scale and should not be
> considered stable or suitable for production. This release is being
> classified as a minor release, which means that it should be API compatible
> with 0.20.2."
>
> If they can't give it a vote of confidence, then I don't think we should
> either.
>
> It also reminds me that I think we should at a minimum have a conversation
> about ways we might insulate ourselves a little bit from Hadoop while still
> harnessing all of it's power.  Ted and I talked about it a bit at the Bay
> Area meetup we had a few months ago.  The Plume/Flume stuff seems promising
> for helping with that as well as giving some other benefits, but that relies
> on us having an open source version of Flume (which Ted and others have
> started).  I don't know that it is all that practical in short term and I'm
> not proposing any rewrites at this point, but we should consider it as
> working at that layer might allow the ability to plugin different backends
> that are better performing given certain setups (local, small cluster, large
> cluster).  Such a bit of insulation might allow us to plug in other
> capabilities as well.  One of the things Hadoop has spawned is a whole lot
> more interest in these kind of capabilities and I fully expect to see
> new/related paradigms coming out.  Obviously, we aren't just going to jump
> on anything, but if we can think about ways we might be able to plug them
> in.  Thoughts?
>
> -Grant
>
> On Nov 4, 2010, at 3:35 PM, Jeff Eastman wrote:
>
> > We have historically tracked the latest versions of Hadoop pretty soon
> after they have been available. If the tests run on 0.21 and it has the
> CompositeInputFormat then I'd be +1 to move forward. Hopefully there will be
> a Cloudera version that tracks it pretty soon too, else users will have to
> build their own AMIs again.
> >
> > -----Original Message-----
> > From: Shannon Quinn (JIRA) [mailto:jira@apache.org]
> > Sent: Thursday, November 04, 2010 12:27 PM
> > To: dev@mahout.apache.org
> > Subject: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into
> compliance with Hadoop 0.20.2
> >
> >
> >    [
> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928314#action_12928314]
> >
> > Shannon Quinn commented on MAHOUT-537:
> > --------------------------------------
> >
> > Something worth discussing: Hadoop just released version 0.21.0, which
> re-includes the updated CompositeInputFormat that was missing in 0.20.2 and
> deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the
> trunk, but provided they do then I'm wondering if I should go ahead and
> implement this patch using Hadoop 0.21. Any thoughts?
> >
> >> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> >> -------------------------------------------------------------
> >>
> >>                Key: MAHOUT-537
> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
> >>            Project: Mahout
> >>         Issue Type: Improvement
> >>   Affects Versions: 0.4
> >>           Reporter: Shannon Quinn
> >>           Assignee: Shannon Quinn
> >>        Attachments: MAHOUT-537.patch
> >>
> >>
> >> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
> API, in particular eliminate dependence on the deprecated JobConf, using
> instead the separate Job and Configuration objects.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Grant Ingersoll <gs...@apache.org>.

I didn't get a strong sense from the Hadoop community that 0.21 is all that well baked.  To quote the website:
"This release contains many improvements, new features, bug fixes and optimizations. It has not undergone testing at scale and should not be considered stable or suitable for production. This release is being classified as a minor release, which means that it should be API compatible with 0.20.2."

If they can't give it a vote of confidence, then I don't think we should either.

It also reminds me that I think we should at a minimum have a conversation about ways we might insulate ourselves a little bit from Hadoop while still harnessing all of it's power.  Ted and I talked about it a bit at the Bay Area meetup we had a few months ago.  The Plume/Flume stuff seems promising for helping with that as well as giving some other benefits, but that relies on us having an open source version of Flume (which Ted and others have started).  I don't know that it is all that practical in short term and I'm not proposing any rewrites at this point, but we should consider it as working at that layer might allow the ability to plugin different backends that are better performing given certain setups (local, small cluster, large cluster).  Such a bit of insulation might allow us to plug in other capabilities as well.  One of the things Hadoop has spawned is a whole lot more interest in these kind of capabilities and I fully expect to see new/related paradigms coming out.  Obviously, we aren't just going to jump on anything, but if we can think about ways we might be able to plug them in.  Thoughts?

-Grant

On Nov 4, 2010, at 3:35 PM, Jeff Eastman wrote:

> We have historically tracked the latest versions of Hadoop pretty soon after they have been available. If the tests run on 0.21 and it has the CompositeInputFormat then I'd be +1 to move forward. Hopefully there will be a Cloudera version that tracks it pretty soon too, else users will have to build their own AMIs again.
> 
> -----Original Message-----
> From: Shannon Quinn (JIRA) [mailto:jira@apache.org] 
> Sent: Thursday, November 04, 2010 12:27 PM
> To: dev@mahout.apache.org
> Subject: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> 
> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928314#action_12928314 ] 
> 
> Shannon Quinn commented on MAHOUT-537:
> --------------------------------------
> 
> Something worth discussing: Hadoop just released version 0.21.0, which re-includes the updated CompositeInputFormat that was missing in 0.20.2 and deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the trunk, but provided they do then I'm wondering if I should go ahead and implement this patch using Hadoop 0.21. Any thoughts?
> 
>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>> -------------------------------------------------------------
>> 
>>                Key: MAHOUT-537
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>            Project: Mahout
>>         Issue Type: Improvement
>>   Affects Versions: 0.4
>>           Reporter: Shannon Quinn
>>           Assignee: Shannon Quinn
>>        Attachments: MAHOUT-537.patch
>> 
>> 
>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

RE: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jeff Eastman <je...@Narus.com>.

We have historically tracked the latest versions of Hadoop pretty soon after they have been available. If the tests run on 0.21 and it has the CompositeInputFormat then I'd be +1 to move forward. Hopefully there will be a Cloudera version that tracks it pretty soon too, else users will have to build their own AMIs again.

-----Original Message-----
From: Shannon Quinn (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, November 04, 2010 12:27 PM
To: dev@mahout.apache.org
Subject: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928314#action_12928314 ] 

Shannon Quinn commented on MAHOUT-537:
--------------------------------------

Something worth discussing: Hadoop just released version 0.21.0, which re-includes the updated CompositeInputFormat that was missing in 0.20.2 and deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the trunk, but provided they do then I'm wondering if I should go ahead and implement this patch using Hadoop 0.21. Any thoughts?

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

I think that the production momentum is definitely on the 0.20.20x series.

THose are pretty much guaranteed to be compatible with 0.23 (aka MapReduce
nextGen) because of who is pushing them.

On Sun, May 22, 2011 at 11:04 AM, Grant Ingersoll <gs...@apache.org>wrote:

> The release notes for 0.21 weren't exactly inspirational when it comes to
> adoption:
> "It has not undergone testing at scale and should not be considered stable
> or suitable for production." - --
> http://hadoop.apache.org/common/releases.html
>
> -G
>
> On May 21, 2011, at 2:43 PM, Ted Dunning wrote:
>
> > StumbleUpon and TrendMicro are on 0.20, I think.
> >
> > Yahoo might have some 0.21 stuff going.
> >
> > FB is 0.20 for the hbase stuff.
> >
> > On Sat, May 21, 2011 at 2:39 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> >> On Sat, May 21, 2011 at 2:16 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >>> Actually, I don't know of many who are even using 0.21.  Pretty much
> >>> everybody I know is using 0.20.3
> >>>
> >>
> >> Yep, I think we're actually on some 0.20.1 variant, still.  And isn't
> >> Facebook
> >> on 0.20-append (or some variation therof)  I don't know anyone big on
> 0.21
> >> or higher.
> >>
> >> On Sat, May 21, 2011 at 1:44 PM, Shannon Quinn <sq...@gatech.edu>
> wrote:
> >>>> Unless, of course, everyone has been using 0.21 or even 0.22 as you
> >> have
> >>> :)
> >>>>
> >>>
> >>
> >>  -jake
> >>
>
>
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

Well Jake has a point, that some of these "new" 0.21+ APIs that you
(and I) would like to use are actually already in Hadoop but in the
old deprecated .mapred. APIs. And if we're of a mind to not move past
0.20.x, well, it doesn't 100% mean we can't use these functions.

It does mean using perhaps some deprecated code which is ugly. It's
not so so ugly since these APIs are not only coming back in 0.21, but,
are even un-deprecated in their old form in later versions. Confusing.
I personally would support an "exception" for implementations that use
old APIs for this reason.

Keep in mind it is unfortunately hard to mix-n-match APIs. You'll
probably have to use all .mapred. stuff if you use any.

It's also possible to rewrite all this to not use the MultipleInputs
stuff on the newer APIs with some of the kinds of techniques Ted and I
mentioned. It'd also be valid to go this way, but, I imagine will be
slower, perhaps a lot. And I think that would be good reason to not go
this way.

I think I (and likely others) trust your judgment to do what's best
here as you're actively working on this.
But if you mean you really do want advice on what's better ... erm ask Jake?

On Sun, May 22, 2011 at 10:32 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> But what it sounds like you're saying - cleverness with the keys and
> organization of the input paths has the possibility to keep the process as a
> single job (rather than 3), it'll just be a little "hacky" until some point
> down the road when we switch some later version (0.21? 0.22?) of the API.
> Ultimately, though, we'll have eliminated the 0.18 mapred.* libraries.
>
> Is this what you're getting at?
>
> On 5/22/2011 4:49 PM, Sean Owen wrote:
>>
>> Ah righty -- this exists in the old API doesn't it... even in 0.20.x
>> But it's deprecated. But it's not deprecated in 0.21+.
>>
>> Yes I think there's a strong argument to make use of that even if it
>> is deprecated.
>>
>> I had in mind the unnecessary use of old .mapred. APIs for simple
>> Mappers and Reducers.
>>
>> On Sun, May 22, 2011 at 9:41 PM, Jake Mannix<ja...@gmail.com>
>>  wrote:
>>>
>>> Wait, are you saying that we should force things like matrix
>>> multiplication
>>> to become a 3-job process, instead of the current 1-job process?
>>>
>>> I thought we've already discussed and decided that moving to 0.20 APIs
>>> where possible should be done, but where it removes functionality and
>>> efficiency, we would allow the old API?
>>>
>>>  -jake
>>>
>
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

But what it sounds like you're saying - cleverness with the keys and 
organization of the input paths has the possibility to keep the process 
as a single job (rather than 3), it'll just be a little "hacky" until 
some point down the road when we switch some later version (0.21? 0.22?) 
of the API. Ultimately, though, we'll have eliminated the 0.18 mapred.* 
libraries.

Is this what you're getting at?

On 5/22/2011 4:49 PM, Sean Owen wrote:
> Ah righty -- this exists in the old API doesn't it... even in 0.20.x
> But it's deprecated. But it's not deprecated in 0.21+.
>
> Yes I think there's a strong argument to make use of that even if it
> is deprecated.
>
> I had in mind the unnecessary use of old .mapred. APIs for simple
> Mappers and Reducers.
>
> On Sun, May 22, 2011 at 9:41 PM, Jake Mannix<ja...@gmail.com>  wrote:
>> Wait, are you saying that we should force things like matrix multiplication
>> to become a 3-job process, instead of the current 1-job process?
>>
>> I thought we've already discussed and decided that moving to 0.20 APIs
>> where possible should be done, but where it removes functionality and
>> efficiency, we would allow the old API?
>>
>>   -jake
>>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, May 22, 2011 at 1:49 PM, Sean Owen <sr...@gmail.com> wrote:

> Ah righty -- this exists in the old API doesn't it... even in 0.20.x
> But it's deprecated. But it's not deprecated in 0.21+.
>

Exactly.  In 0.21+, the new apis have the old functionality, but in 0.20,
you have to use the old apis to get the functionality.

> Yes I think there's a strong argument to make use of that even if it
> is deprecated.
>
> I had in mind the unnecessary use of old .mapred. APIs for simple
> Mappers and Reducers.
>

Yeah, the problem with some of these things, like MultipleOutputFormat,
or map-side join, is that to get access to them, *everything* in your
jobs which want to use them (including mappers, reducers, etc) need
to use the *.mapred.* classes, not the new mapreduce classes, which
is annoying, but necessary.

  -jake

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

Ah righty -- this exists in the old API doesn't it... even in 0.20.x
But it's deprecated. But it's not deprecated in 0.21+.

Yes I think there's a strong argument to make use of that even if it
is deprecated.

I had in mind the unnecessary use of old .mapred. APIs for simple
Mappers and Reducers.

On Sun, May 22, 2011 at 9:41 PM, Jake Mannix <ja...@gmail.com> wrote:
> Wait, are you saying that we should force things like matrix multiplication
> to become a 3-job process, instead of the current 1-job process?
>
> I thought we've already discussed and decided that moving to 0.20 APIs
> where possible should be done, but where it removes functionality and
> efficiency, we would allow the old API?
>
>  -jake
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, May 22, 2011 at 12:35 PM, Sean Owen <sr...@gmail.com> wrote:

> I think you'll have to push that for 1.0 for now then; 0.20.x doesn't
> have map-side joins. Yes that is a blocker for what you're trying to
> do and what Sebastian is trying to do for recommendations. I've
> already reimplemented recommenders separately with these things and it
> simplifies and speeds up the pipeline.
>

Wait, are you saying that we should force things like matrix multiplication
to become a 3-job process, instead of the current 1-job process?

I thought we've already discussed and decided that moving to 0.20 APIs
where possible should be done, but where it removes functionality and
efficiency, we would allow the old API?

  -jake

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

It is also possible to extend the input format so that it handles some files
one way and other files another.  The key can be any common supertype of the
keys from the inputs (at worst Writable).

On Sun, May 22, 2011 at 12:35 PM, Sean Owen <sr...@gmail.com> wrote:

> One solution is to create an "XOrYWritable" which holds either an X or
> a Y. Then the jobs that output an X or a Y both output one same value
> type, XOrYWritable. See VectorOrPrefWritable for instance.
>
> The Reducer can then check each value to pick out an X or a Y and get both.
>
>
> In some cases you have to know the ordering, whether you'll get an X
> or Y first. In this case you need some cleverness with the key.
> Instead of a VarLongWritable for a key, you need something like
> "EntityJoinKey" which contains a long value (the ID) but also a
> boolean or integer that indicates an ordering. Maybe it adds a boolean
> called "before".
>
> It needs to implement WritableComparable and order by the ID value,
> but then by the before/after flag.
> It also needs to specify a Partitioner which maps keys to the same
> reducer if they have the same ID, regardless of before/after flag.
>
> This is fairly convenient because you have a clearer picture of which
> values are coming in on "before" keys and then which are coming after.
>
>
> It's definitely more complex, but it's doable.
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

I think you'll have to push that for 1.0 for now then; 0.20.x doesn't
have map-side joins. Yes that is a blocker for what you're trying to
do and what Sebastian is trying to do for recommendations. I've
already reimplemented recommenders separately with these things and it
simplifies and speeds up the pipeline.

I'd be more against sticking to 0.20.x except that there's already
evidently some issue even getting *on* to 0.20.x in the code, which is
more important to address. And the jump to 0.21.x is a moderate
increase in functionality. To take advantage of it still requires
rewriting everything. Maybe we should wait for an even bigger leap
forward to rewrite everything.

Here's a summary of my recipe for dealing with this in 0.20.x.

First, while you can't have multiple mappers, you can have multiple
input paths. So, you can join two different inputs keyed by the same
keys without trouble, typically with an identity Mapper. Of course,
they have to have the same value class. This is a problem if you want
to join Xs and Ys keyed by the same key.

One solution is to create an "XOrYWritable" which holds either an X or
a Y. Then the jobs that output an X or a Y both output one same value
type, XOrYWritable. See VectorOrPrefWritable for instance.

The Reducer can then check each value to pick out an X or a Y and get both.

In some cases you have to know the ordering, whether you'll get an X
or Y first. In this case you need some cleverness with the key.
Instead of a VarLongWritable for a key, you need something like
"EntityJoinKey" which contains a long value (the ID) but also a
boolean or integer that indicates an ordering. Maybe it adds a boolean
called "before".

It needs to implement WritableComparable and order by the ID value,
but then by the before/after flag.
It also needs to specify a Partitioner which maps keys to the same
reducer if they have the same ID, regardless of before/after flag.

This is fairly convenient because you have a clearer picture of which
values are coming in on "before" keys and then which are coming after.

It's definitely more complex, but it's doable.

On Sun, May 22, 2011 at 8:20 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> What did you have in mind, then, for making matrix multiplication work
> without map-side joins (or at least, in the simple format available in
> 0.18)?

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

What did you have in mind, then, for making matrix multiplication work 
without map-side joins (or at least, in the simple format available in 
0.18)?

On 5/22/2011 3:16 PM, Sean Owen wrote:
> Okey, let's stick to 0.20.x for now. I will push the JIRAs out another
> release that concern upgrading.
>
> However I do think we still need to get onto 0.20.203, and off of the
> old deprecated APIs, by the next release.
>
> On Sun, May 22, 2011 at 7:04 PM, Grant Ingersoll<gs...@apache.org>  wrote:
>> The release notes for 0.21 weren't exactly inspirational when it comes to adoption:
>> "It has not undergone testing at scale and should not be considered stable or suitable for production." - -- http://hadoop.apache.org/common/releases.html
>>
>> -G
>>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

Okey, let's stick to 0.20.x for now. I will push the JIRAs out another
release that concern upgrading.

However I do think we still need to get onto 0.20.203, and off of the
old deprecated APIs, by the next release.

On Sun, May 22, 2011 at 7:04 PM, Grant Ingersoll <gs...@apache.org> wrote:
> The release notes for 0.21 weren't exactly inspirational when it comes to adoption:
> "It has not undergone testing at scale and should not be considered stable or suitable for production." - -- http://hadoop.apache.org/common/releases.html
>
> -G
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Grant Ingersoll <gs...@apache.org>.

The release notes for 0.21 weren't exactly inspirational when it comes to adoption: 
"It has not undergone testing at scale and should not be considered stable or suitable for production." - -- http://hadoop.apache.org/common/releases.html

-G

On May 21, 2011, at 2:43 PM, Ted Dunning wrote:

> StumbleUpon and TrendMicro are on 0.20, I think.
> 
> Yahoo might have some 0.21 stuff going.
> 
> FB is 0.20 for the hbase stuff.
> 
> On Sat, May 21, 2011 at 2:39 PM, Jake Mannix <ja...@gmail.com> wrote:
> 
>> On Sat, May 21, 2011 at 2:16 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> 
>>> Actually, I don't know of many who are even using 0.21.  Pretty much
>>> everybody I know is using 0.20.3
>>> 
>> 
>> Yep, I think we're actually on some 0.20.1 variant, still.  And isn't
>> Facebook
>> on 0.20-append (or some variation therof)  I don't know anyone big on 0.21
>> or higher.
>> 
>> On Sat, May 21, 2011 at 1:44 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>> Unless, of course, everyone has been using 0.21 or even 0.22 as you
>> have
>>> :)
>>>> 
>>> 
>> 
>>  -jake
>>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

StumbleUpon and TrendMicro are on 0.20, I think.

Yahoo might have some 0.21 stuff going.

FB is 0.20 for the hbase stuff.

On Sat, May 21, 2011 at 2:39 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Sat, May 21, 2011 at 2:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Actually, I don't know of many who are even using 0.21.  Pretty much
> > everybody I know is using 0.20.3
> >
>
> Yep, I think we're actually on some 0.20.1 variant, still.  And isn't
> Facebook
> on 0.20-append (or some variation therof)  I don't know anyone big on 0.21
> or higher.
>
> On Sat, May 21, 2011 at 1:44 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> > > Unless, of course, everyone has been using 0.21 or even 0.22 as you
> have
> > :)
> > >
> >
>
>   -jake
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, May 21, 2011 at 2:16 PM, Ted Dunning <te...@gmail.com> wrote:

> Actually, I don't know of many who are even using 0.21.  Pretty much
> everybody I know is using 0.20.3
>

Yep, I think we're actually on some 0.20.1 variant, still.  And isn't
Facebook
on 0.20-append (or some variation therof)  I don't know anyone big on 0.21
or higher.

On Sat, May 21, 2011 at 1:44 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> > Unless, of course, everyone has been using 0.21 or even 0.22 as you have
> :)
> >
>

  -jake

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

Actually, I don't know of many who are even using 0.21.  Pretty much
everybody I know is using 0.20.3

On Sat, May 21, 2011 at 1:44 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Unless, of course, everyone has been using 0.21 or even 0.22 as you have :)
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

I suggest that everything move to 0.21. It is not 100% API compatible,
either way, so you can't support both at the same time.

Anything that isn't updated (the Bayes code is the notorious culprit) will
be deprecated and removed IMHO.

Sean

On Sat, May 21, 2011 at 9:44 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> More than happy to do so; the only caveat is that we're effectively
> bringing
> DistributedRowMatrix up from 0.18 to 0.21, while the rest of Mahout is at
> 0.20. From what I can tell, 0.21 doesn't really remove anything, so the
> migration should be fairly painless...but if this is a Mahout-wide goal for
> a 0.6 release then we may want to create a new issue for that;
> DistributedRowMatrix is just one cog of the entire wheel.
>
> Unless, of course, everyone has been using 0.21 or even 0.22 as you have :)
>
> On Sat, May 21, 2011 at 2:52 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037295#comment-13037295
> ]
> >
> > Sean Owen commented on MAHOUT-537:
> > ----------------------------------
> >
> > I could, though honestly, I think the better solution at this point is to
> > move to Hadoop 0.21 as part of the next release. It is the current
> release
> > and nearly superseded by 0.22. It has some features we need to move
> forward.
> > It is closer to what many are using in CDH3/4. The only drawback I see is
> > that Amazon EMR is on 0.20.2. However we're releasing 0.5 now for 0.20.2.
> > And it is 6 months until we would put out a release needing 0.21, after
> > which time I imagine 0.22 is out and EMR makes available 0.21 -- or if it
> > doesn't, we have to leave behind support.
> >
> > So let me open an item for that, and I suggest you can proceed using 0.21
> > features here.
> > (That is what I am doing for personal projects and it really simplified
> > things. I'm on 0.22 now myself.)
> >
> > > Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> > > -------------------------------------------------------------
> > >
> > >                 Key: MAHOUT-537
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >          Components: Math
> > >    Affects Versions: 0.4, 0.5
> > >            Reporter: Shannon Quinn
> > >            Assignee: Shannon Quinn
> > >             Fix For: 0.6
> > >
> > >         Attachments: MAHOUT-537.patch, MAHOUT-537.patch,
> > MAHOUT-537.patch, MAHOUT-537.patch
> > >
> > >
> > > Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
> > API, in particular eliminate dependence on the deprecated JobConf, using
> > instead the separate Job and Configuration objects.
> >
> > --
> > This message is automatically generated by JIRA.
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
>

Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

More than happy to do so; the only caveat is that we're effectively bringing
DistributedRowMatrix up from 0.18 to 0.21, while the rest of Mahout is at
0.20. From what I can tell, 0.21 doesn't really remove anything, so the
migration should be fairly painless...but if this is a Mahout-wide goal for
a 0.6 release then we may want to create a new issue for that;
DistributedRowMatrix is just one cog of the entire wheel.

Unless, of course, everyone has been using 0.21 or even 0.22 as you have :)

On Sat, May 21, 2011 at 2:52 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037295#comment-13037295]
>
> Sean Owen commented on MAHOUT-537:
> ----------------------------------
>
> I could, though honestly, I think the better solution at this point is to
> move to Hadoop 0.21 as part of the next release. It is the current release
> and nearly superseded by 0.22. It has some features we need to move forward.
> It is closer to what many are using in CDH3/4. The only drawback I see is
> that Amazon EMR is on 0.20.2. However we're releasing 0.5 now for 0.20.2.
> And it is 6 months until we would put out a release needing 0.21, after
> which time I imagine 0.22 is out and EMR makes available 0.21 -- or if it
> doesn't, we have to leave behind support.
>
> So let me open an item for that, and I suggest you can proceed using 0.21
> features here.
> (That is what I am doing for personal projects and it really simplified
> things. I'm on 0.22 now myself.)
>
> > Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> > -------------------------------------------------------------
> >
> >                 Key: MAHOUT-537
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Math
> >    Affects Versions: 0.4, 0.5
> >            Reporter: Shannon Quinn
> >            Assignee: Shannon Quinn
> >             Fix For: 0.6
> >
> >         Attachments: MAHOUT-537.patch, MAHOUT-537.patch,
> MAHOUT-537.patch, MAHOUT-537.patch
> >
> >
> > Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
> API, in particular eliminate dependence on the deprecated JobConf, using
> instead the separate Job and Configuration objects.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

[jira] [Updated] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537_hack.patch

Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back--which worked fine--but otherwise I'm stuck and could use some help.

If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058246#comment-13058246 ] 

Sean Owen commented on MAHOUT-537:
----------------------------------

I think it's a great effort. Looks like you had to copy in a number of Hadoop classes and still are facing some problems. It may be a hard road to go down. 

We're "officially" on 0.20.203.0 at the moment, and in my mind, the essence of this issue is seeing if there's any way to use the .mapreduce. rather than deprecated .mapred. APIs in 0.20.x at least, or, reuse more of AbstractJob for consistency (improving it as needed).

Do you see any scope for those types of changes? If the informed opinion is just that this isn't going to be meaningfully possible, I say close this issue.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057945#comment-13057945 ] 

Shannon Quinn edited comment on MAHOUT-537 at 6/30/11 5:39 PM:
---------------------------------------------------------------

Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back, which worked fine, but that didn't exactly inform me as to why it can't be read within the Mapper. So otherwise I'm stuck and could use some help.

If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.

      was (Author: magsol):
    Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back, which worked fine, but otherwise I'm stuck and could use some help.

If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.
  
> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044018#comment-13044018 ] 

Sean Owen commented on MAHOUT-537:
----------------------------------

The people have spoken! Forget 0.21. At best I think we will wait and see on 0.22. 

I think the much larger concern in my mind is being as consistent as possible across the project in how implementations are approached. Within 0.20.x we can stand to be more consistent -- ideally, not using the deprecated APIs, but using them if there's a very good reason.

The clustering code is still needlessly different for example, and is going to be deprecated as a result.

So I think the outcome from this issue is... try to make it as like to the other M/R jobs in the project? Most everything tries to use the imperfect AbstractJob thing, which is a good rallying point. I am not sure how realistic it is here but it would be great to standardize more.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-537.
------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.6)

Likewise I think time to give up on this one. It seems not too feasible to change this code.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

Ah, and the reason why I never encountered this problem in my own code 
is because I've been dealing exclusively with symmetric 
matrices...thanks very much for the primer, I'll need it in my 
continuing work!

On 1/6/11 9:11 PM, Jake Mannix wrote:
> Hey Shannon,
>
>    I'm replying via phone, so apologies in advance for brevity:
>
>    If you have a DRM (A) which is n rows by m columns, and another DRM (B)
> which is m rows by p columns, there is *no single method* on DRM which
> computes A*B (a sensible matrix with n rows by p columns).  To compute this,
> you would run A.transpose().times(B).
>
>    On the other hand, if you already have a matrix (call it At) with m rows
> by n columns, then At.times(B) will compute a matrix with n rows and p
> columns in one method call (and one MR pass) whose entries are exactly the
> same as taking the true matrix multiplication of the transpose of At times
> B.
>
>    Any time you use DRM.times(), you are required to have both DRM instances
> have the same number of rows (*not* number of columns of the first equals
> the number of rows of the second).  In fact, as Dmitriy points out, the have
> to have the same number of InputSplits as well (which is easily achieved by
> having both be created in MR jobs with the same # of reducers).
>
>    -jake
>
> On Jan 6, 2011 1:53 PM, "Shannon Quinn"<sq...@gatech.edu>  wrote:
>
>>    Matrix A has N rows (each of which has cardinality M_A), and Matrix B
> has>  N rows (each of whi...
> I suppose this is where I get confused. I thought, by definition, matrix A
> has dimensions (n by m), and matrix B has dimensions (m by p), and the
> resulting matrix is (n by p). I saw in the implementation that it cleverly
> uses the transpose of A such that just the row vectors are needed, but my
> confusion comes from the fact that I don't see an explicit transpose before
> the times() job gets going.
>
> So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if the
> three rows of A are being sent to the MR job with the two rows of B, which
> doesn't make any sense. I know there should be a transpose of A somewhere
> but I don't see it.
>
> Unless the assumption is that the user calls transpose() before calling
> times()? Which doesn't make any sense either since I've used this job just
> fine. I know I'm missing something simple...thanks for your help.
>
> Also: I'll shelve the general DRM rewrite patch, then, for the time being.
> You make good points, and there are other patches I should work on in the
> meantime :) (though I could just experiment with 0.21 to see how well that
> works)
>
> Shannon
>
>>    There are thus N pairs of>  vectors {A_i, B_i}, and if you take
> MatrixSum_{i=1,N} (A_i^T x B_i...
>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

Hey Shannon,

  I'm replying via phone, so apologies in advance for brevity:

  If you have a DRM (A) which is n rows by m columns, and another DRM (B)
which is m rows by p columns, there is *no single method* on DRM which
computes A*B (a sensible matrix with n rows by p columns).  To compute this,
you would run A.transpose().times(B).

  On the other hand, if you already have a matrix (call it At) with m rows
by n columns, then At.times(B) will compute a matrix with n rows and p
columns in one method call (and one MR pass) whose entries are exactly the
same as taking the true matrix multiplication of the transpose of At times
B.

  Any time you use DRM.times(), you are required to have both DRM instances
have the same number of rows (*not* number of columns of the first equals
the number of rows of the second).  In fact, as Dmitriy points out, the have
to have the same number of InputSplits as well (which is easily achieved by
having both be created in MR jobs with the same # of reducers).

  -jake

On Jan 6, 2011 1:53 PM, "Shannon Quinn" <sq...@gatech.edu> wrote:

>   Matrix A has N rows (each of which has cardinality M_A), and Matrix B
has > N rows (each of whi...
I suppose this is where I get confused. I thought, by definition, matrix A
has dimensions (n by m), and matrix B has dimensions (m by p), and the
resulting matrix is (n by p). I saw in the implementation that it cleverly
uses the transpose of A such that just the row vectors are needed, but my
confusion comes from the fact that I don't see an explicit transpose before
the times() job gets going.

So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if the
three rows of A are being sent to the MR job with the two rows of B, which
doesn't make any sense. I know there should be a transpose of A somewhere
but I don't see it.

Unless the assumption is that the user calls transpose() before calling
times()? Which doesn't make any sense either since I've used this job just
fine. I know I'm missing something simple...thanks for your help.

Also: I'll shelve the general DRM rewrite patch, then, for the time being.
You make good points, and there are other patches I should work on in the
meantime :) (though I could just experiment with 0.21 to see how well that
works)

Shannon

>   There are thus N pairs of > vectors {A_i, B_i}, and if you take
MatrixSum_{i=1,N} (A_i^T x B_i...

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

>    Matrix A has N rows (each of which has cardinality M_A), and Matrix B has
> N rows (each of which has cardinality M_B).
I suppose this is where I get confused. I thought, by definition, matrix 
A has dimensions (n by m), and matrix B has dimensions (m by p), and the 
resulting matrix is (n by p). I saw in the implementation that it 
cleverly uses the transpose of A such that just the row vectors are 
needed, but my confusion comes from the fact that I don't see an 
explicit transpose before the times() job gets going.

So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if 
the three rows of A are being sent to the MR job with the two rows of B, 
which doesn't make any sense. I know there should be a transpose of A 
somewhere but I don't see it.

Unless the assumption is that the user calls transpose() before calling 
times()? Which doesn't make any sense either since I've used this job 
just fine. I know I'm missing something simple...thanks for your help.

Also: I'll shelve the general DRM rewrite patch, then, for the time 
being. You make good points, and there are other patches I should work 
on in the meantime :) (though I could just experiment with 0.21 to see 
how well that works)

Shannon

>    There are thus N pairs of
> vectors {A_i, B_i}, and if you take MatrixSum_{i=1,N} (A_i^T x B_i), you get
> a matrix with M_A rows, each of which has cardinality M_B, and this matrix
> is exactly A^T * B.
>
> *You take the transpose on the vectors, row at a time*, from the first of
> the two matrices.
>
>    -jake
>
>
>> I want to understand this little bit so I adequately replicate it in the
>> new patch. Thanks!
>>
>> Shannon
>>
>> Apologies for the brevity, this was sent from my iPhone
>>
>> On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)"<ji...@apache.org>  wrote:
>>
>>>      [
>> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>> Shannon Quinn updated MAHOUT-537:
>>> ---------------------------------
>>>
>>>     Attachment: MAHOUT-537.patch
>>>
>>> Updated patch. Fixes from previous patch are included, this time merged
>> with unrelated changes to the related files. Also removed all the
>> commented-out old code, and even caught and fixed a few bugs. Fully
>> implemented timesSquared(). All that remains is the times(DRM) job. Will
>> update on this very soon.
>>> (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
>>>
>>>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>>>> -------------------------------------------------------------
>>>>
>>>>                 Key: MAHOUT-537
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>    Affects Versions: 0.4
>>>>            Reporter: Shannon Quinn
>>>>            Assignee: Shannon Quinn
>>>>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>>>>
>>>>
>>>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
>> API, in particular eliminate dependence on the deprecated JobConf, using
>> instead the separate Job and Configuration objects.
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

Hi Shannon, sorry to have been absent too much in this thread!

On Thu, Dec 30, 2010 at 2:16 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> I'm just about finished with this patch (though I'm road tripping at the
> moment), but I wanted to seek some clarification on the mechanics behind
> DRM's matrix multiplication.
>
> I see upon closer inspection that what is actually used is the transpose of
> the multiplicand (matrix A^T in A*B), thereby using only matrix rows (how
> DRMs are organized across HDFS). However, I didn't see any explicit
> transpose operation within the times() method. How is this carried out?
>

The transpose operation is a side effect of the fact that a DRM just
consists of a list of vectors, and you could view it as a row-based matrix,
or a column based matrix.  The matrix multiplication like so:

  Matrix A has N rows (each of which has cardinality M_A), and Matrix B has
N rows (each of which has cardinality M_B).  There are thus N pairs of
vectors {A_i, B_i}, and if you take MatrixSum_{i=1,N} (A_i^T x B_i), you get
a matrix with M_A rows, each of which has cardinality M_B, and this matrix
is exactly A^T * B.

*You take the transpose on the vectors, row at a time*, from the first of
the two matrices.

  -jake


> I want to understand this little bit so I adequately replicate it in the
> new patch. Thanks!
>
> Shannon
>
> Apologies for the brevity, this was sent from my iPhone
>
> On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)" <ji...@apache.org> wrote:
>
> >
> >     [
> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Shannon Quinn updated MAHOUT-537:
> > ---------------------------------
> >
> >    Attachment: MAHOUT-537.patch
> >
> > Updated patch. Fixes from previous patch are included, this time merged
> with unrelated changes to the related files. Also removed all the
> commented-out old code, and even caught and fixed a few bugs. Fully
> implemented timesSquared(). All that remains is the times(DRM) job. Will
> update on this very soon.
> >
> > (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
> >
> >> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> >> -------------------------------------------------------------
> >>
> >>                Key: MAHOUT-537
> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
> >>            Project: Mahout
> >>         Issue Type: Improvement
> >>   Affects Versions: 0.4
> >>           Reporter: Shannon Quinn
> >>           Assignee: Shannon Quinn
> >>        Attachments: MAHOUT-537.patch, MAHOUT-537.patch
> >>
> >>
> >> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
> API, in particular eliminate dependence on the deprecated JobConf, using
> instead the separate Job and Configuration objects.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

I'm just about finished with this patch (though I'm road tripping at the moment), but I wanted to seek some clarification on the mechanics behind DRM's matrix multiplication. 

I see upon closer inspection that what is actually used is the transpose of the multiplicand (matrix A^T in A*B), thereby using only matrix rows (how DRMs are organized across HDFS). However, I didn't see any explicit transpose operation within the times() method. How is this carried out?

I want to understand this little bit so I adequately replicate it in the new patch. Thanks!

Shannon

Apologies for the brevity, this was sent from my iPhone

On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)" <ji...@apache.org> wrote:

> 
>     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Shannon Quinn updated MAHOUT-537:
> ---------------------------------
> 
>    Attachment: MAHOUT-537.patch
> 
> Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon.
> 
> (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
> 
>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>> -------------------------------------------------------------
>> 
>>                Key: MAHOUT-537
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>            Project: Mahout
>>         Issue Type: Improvement
>>   Affects Versions: 0.4
>>           Reporter: Shannon Quinn
>>           Assignee: Shannon Quinn
>>        Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>> 
>> 
>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537.patch

Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon.

(regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Issue Comment Edited] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044010#comment-13044010 ] 

Dmitriy Lyubimov edited comment on MAHOUT-537 at 6/3/11 8:14 PM:
-----------------------------------------------------------------

Second Jake. 
bq.  I think the better solution at this point is to move to Hadoop 0.21 as part of the next release. 

-1 on this yet. (if i can recollect, Ted had concern about this move as well). 

At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team. 

It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR. 

We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, *so i'll just have to drop using trunk or future Mahout distributions until better times.* 

*I know for sure we will never use 0.21 they way it is released.*

There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms). 

      was (Author: dlyubimov):
    Second Jake. -1 on this yet. (if i can recollect, Ted had concern about this move as well). 

At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team. 

It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR. 

We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, *so i'll just have to drop using trunk or future Mahout distributions until better times.* 

*I know for sure we will never use 0.21 they way it is released.*

There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms). 
  
> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Ted Dunning <te...@gmail.com>.

The top level pom is where all the versions are specified.

On Thu, Nov 18, 2010 at 9:48 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> That's what I found in my perusals, I was just curious if there was also a
> repository referenced somewhere that indicated where exactly it pulls the
> particular version of Hadoop from.
>
> On Thu, Nov 18, 2010 at 12:09 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Sure just change this in the top pom.xml file:
> >
> >    <hadoop.version>0.20.2</hadoop.version>
> >
> >
> > On Thu, Nov 18, 2010 at 4:42 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > (Where is Sean when we need him). I assume you've tried searching all
> the
> > > .pom files for a suitable entry to tweak?
> >
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

That's what I found in my perusals, I was just curious if there was also a
repository referenced somewhere that indicated where exactly it pulls the
particular version of Hadoop from.

On Thu, Nov 18, 2010 at 12:09 PM, Sean Owen <sr...@gmail.com> wrote:

> Sure just change this in the top pom.xml file:
>
>    <hadoop.version>0.20.2</hadoop.version>
>
>
> On Thu, Nov 18, 2010 at 4:42 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > (Where is Sean when we need him). I assume you've tried searching all the
> > .pom files for a suitable entry to tweak?
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

Sure just change this in the top pom.xml file:

    <hadoop.version>0.20.2</hadoop.version>


On Thu, Nov 18, 2010 at 4:42 PM, Jeff Eastman <je...@narus.com> wrote:

> (Where is Sean when we need him). I assume you've tried searching all the
> .pom files for a suitable entry to tweak?

RE: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jeff Eastman <je...@Narus.com>.

(Where is Sean when we need him). I assume you've tried searching all the .pom files for a suitable entry to tweak? 

-----Original Message-----
From: Shannon Quinn [mailto:squinn.squinn@gmail.com] On Behalf Of Shannon Quinn
Sent: Wednesday, November 17, 2010 5:49 PM
To: dev@mahout.apache.org
Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

I do have a backup plan, and I can certainly take that path from here. 
But just to sate my curiosity... :P how would I go about changing the 
Maven build to include a different version of Hadoop?

I'll submit an updated patch (hopefully this weekend) with a more-hacky 
approach to bringing this up to 0.20.2.

Shannon

On 11/17/2010 6:49 PM, Jeff Eastman wrote:
> I understand you just want to experiment, but I think it will be unlikely we will switch to 0.21 given what has been reported about this release. Given this situation, is it possible to stick with the current implementation until the later releases improve?
>
> -----Original Message-----
> From: Shannon Quinn [mailto:squinn.squinn@gmail.com] On Behalf Of Shannon Quinn
> Sent: Wednesday, November 17, 2010 3:54 AM
> To: dev@mahout.apache.org
> Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>
> I wanted to keep the old code as a quick-and-dirty reference, but I will delete on the next patch. Not sure when that will be, as we are nearing the end of the semester...
>
> The unit tests that are failing should be the ones that depend on DRM.times(), as well as any that invoke its constructor (which has changed). Those should be fixed once I finish the conversion of times().
>
> However, I still haven't figured out how to use a different version of Hadoop in my build; Maven pulls version 0.20 in by default and I can't seem to find where to modify this, and even if I could I don't know what I'd modify it to. I understand there's a whole other discussion about whether or not a different version of Hadoop is the answer, but I want to see if this works at all first. If not, I have another (but slightly more hacky) approach I could use. Anyone know how to do this?
>
> Shannon
>
> Apologies for the brevity, this was sent from my iPhone
>
> On Nov 16, 2010, at 23:42, "Jeff Eastman (JIRA)"<ji...@apache.org>  wrote:
>
>>     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932808#action_12932808 ]
>>
>> Jeff Eastman commented on MAHOUT-537:
>> -------------------------------------
>>
>> Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.
>>
>>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>>> -------------------------------------------------------------
>>>
>>>                 Key: MAHOUT-537
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>    Affects Versions: 0.4
>>>            Reporter: Shannon Quinn
>>>            Assignee: Shannon Quinn
>>>         Attachments: MAHOUT-537.patch
>>>
>>>
>>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

I do have a backup plan, and I can certainly take that path from here. 
But just to sate my curiosity... :P how would I go about changing the 
Maven build to include a different version of Hadoop?

I'll submit an updated patch (hopefully this weekend) with a more-hacky 
approach to bringing this up to 0.20.2.

Shannon

On 11/17/2010 6:49 PM, Jeff Eastman wrote:
> I understand you just want to experiment, but I think it will be unlikely we will switch to 0.21 given what has been reported about this release. Given this situation, is it possible to stick with the current implementation until the later releases improve?
>
> -----Original Message-----
> From: Shannon Quinn [mailto:squinn.squinn@gmail.com] On Behalf Of Shannon Quinn
> Sent: Wednesday, November 17, 2010 3:54 AM
> To: dev@mahout.apache.org
> Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>
> I wanted to keep the old code as a quick-and-dirty reference, but I will delete on the next patch. Not sure when that will be, as we are nearing the end of the semester...
>
> The unit tests that are failing should be the ones that depend on DRM.times(), as well as any that invoke its constructor (which has changed). Those should be fixed once I finish the conversion of times().
>
> However, I still haven't figured out how to use a different version of Hadoop in my build; Maven pulls version 0.20 in by default and I can't seem to find where to modify this, and even if I could I don't know what I'd modify it to. I understand there's a whole other discussion about whether or not a different version of Hadoop is the answer, but I want to see if this works at all first. If not, I have another (but slightly more hacky) approach I could use. Anyone know how to do this?
>
> Shannon
>
> Apologies for the brevity, this was sent from my iPhone
>
> On Nov 16, 2010, at 23:42, "Jeff Eastman (JIRA)"<ji...@apache.org>  wrote:
>
>>     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932808#action_12932808 ]
>>
>> Jeff Eastman commented on MAHOUT-537:
>> -------------------------------------
>>
>> Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.
>>
>>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>>> -------------------------------------------------------------
>>>
>>>                 Key: MAHOUT-537
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>    Affects Versions: 0.4
>>>            Reporter: Shannon Quinn
>>>            Assignee: Shannon Quinn
>>>         Attachments: MAHOUT-537.patch
>>>
>>>
>>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>

RE: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jeff Eastman <je...@Narus.com>.

I understand you just want to experiment, but I think it will be unlikely we will switch to 0.21 given what has been reported about this release. Given this situation, is it possible to stick with the current implementation until the later releases improve?

-----Original Message-----
From: Shannon Quinn [mailto:squinn.squinn@gmail.com] On Behalf Of Shannon Quinn
Sent: Wednesday, November 17, 2010 3:54 AM
To: dev@mahout.apache.org
Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

I wanted to keep the old code as a quick-and-dirty reference, but I will delete on the next patch. Not sure when that will be, as we are nearing the end of the semester...

The unit tests that are failing should be the ones that depend on DRM.times(), as well as any that invoke its constructor (which has changed). Those should be fixed once I finish the conversion of times(). 

However, I still haven't figured out how to use a different version of Hadoop in my build; Maven pulls version 0.20 in by default and I can't seem to find where to modify this, and even if I could I don't know what I'd modify it to. I understand there's a whole other discussion about whether or not a different version of Hadoop is the answer, but I want to see if this works at all first. If not, I have another (but slightly more hacky) approach I could use. Anyone know how to do this?

Shannon

Apologies for the brevity, this was sent from my iPhone

On Nov 16, 2010, at 23:42, "Jeff Eastman (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932808#action_12932808 ] 
> 
> Jeff Eastman commented on MAHOUT-537:
> -------------------------------------
> 
> Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.
> 
>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>> -------------------------------------------------------------
>> 
>>                Key: MAHOUT-537
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>            Project: Mahout
>>         Issue Type: Improvement
>>   Affects Versions: 0.4
>>           Reporter: Shannon Quinn
>>           Assignee: Shannon Quinn
>>        Attachments: MAHOUT-537.patch
>> 
>> 
>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

I wanted to keep the old code as a quick-and-dirty reference, but I will delete on the next patch. Not sure when that will be, as we are nearing the end of the semester...

The unit tests that are failing should be the ones that depend on DRM.times(), as well as any that invoke its constructor (which has changed). Those should be fixed once I finish the conversion of times(). 

However, I still haven't figured out how to use a different version of Hadoop in my build; Maven pulls version 0.20 in by default and I can't seem to find where to modify this, and even if I could I don't know what I'd modify it to. I understand there's a whole other discussion about whether or not a different version of Hadoop is the answer, but I want to see if this works at all first. If not, I have another (but slightly more hacky) approach I could use. Anyone know how to do this?

Shannon

Apologies for the brevity, this was sent from my iPhone

On Nov 16, 2010, at 23:42, "Jeff Eastman (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932808#action_12932808 ] 
> 
> Jeff Eastman commented on MAHOUT-537:
> -------------------------------------
> 
> Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.
> 
>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>> -------------------------------------------------------------
>> 
>>                Key: MAHOUT-537
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>            Project: Mahout
>>         Issue Type: Improvement
>>   Affects Versions: 0.4
>>           Reporter: Shannon Quinn
>>           Assignee: Shannon Quinn
>>        Attachments: MAHOUT-537.patch
>> 
>> 
>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932808#action_12932808 ] 

Jeff Eastman commented on MAHOUT-537:
-------------------------------------

Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537.patch

Matrix-matrix multiplication is now implemented, via somewhat of a hack job with the NamedVector class and corresponding m/r job. This hasn't been tested yet (that is the next step). It does compile, so in theory all that remains is to adjust the DRM unit tests to accommodate Hadoop 0.20.2.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044010#comment-13044010 ] 

Dmitriy Lyubimov commented on MAHOUT-537:
-----------------------------------------

Second Jake. -1 on this yet. (if i can recollect, Ted had concern about this move as well). 

At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team. 

It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR. 

We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, *so i'll just have to drop using trunk or future Mahout distributions until better times.* 

*I know for sure we will never use 0.21 they way it is released.*

There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms). 

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057945#comment-13057945 ] 

Shannon Quinn edited comment on MAHOUT-537 at 6/30/11 5:36 PM:
---------------------------------------------------------------

Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back, which worked fine, but otherwise I'm stuck and could use some help.

If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.

      was (Author: magsol):
    Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back--which worked fine--but otherwise I'm stuck and could use some help.

If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.
  
> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037270#comment-13037270 ] 

Shannon Quinn commented on MAHOUT-537:
--------------------------------------

The patch has been ready to go since I posted it, but our original consensus based on the limitations of 0.20 (which haven't changed) are what kept this patch in limbo: namely that 0.20 conveniently leaves out a crucial data type, the absence of which requires 3 M/R passes to do the matrix-matrix multiplication, whereas in 0.18 and 0.21--where this type is present--requires only 1 pass. 

In your last post, however, you alluded to some cleverness in doing joins and customizing the partitioner that I never did get the details on. Would you mind expounding on that? I scoured through every 0.20 format type and type manager I could find and didn't see anything promising, so your more experienced perspective would be most helpful. 

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058563#comment-13058563 ] 

Shannon Quinn commented on MAHOUT-537:
--------------------------------------

I thought it would be simpler. Granted I know very little of how HDFS works, so I'm not sure what's causing the problems or how to fix it. The fact that nothing written can be read back later (tests come back with 0 values or empty lists, or files simply don't exist where the Configuration says they should) seems like it should be an easy fix, but I don't know where to start.

The next best thing to this approach was to more or less mimic the bare necessities of these dependencies in custom implementations, something I don't have the expertise for just yet. I was hoping this would serve only as a holdover until 0.22+ when the dependencies are officially re-included, but in the meantime would enable us to move entirely off 0.18.

Again, it was just a wild idea and I wanted to see if it would work. Still want to see if it will work, in fact.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537_hack.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044015#comment-13044015 ] 

Dmitriy Lyubimov commented on MAHOUT-537:
-----------------------------------------

bq. Does 0.21 bring back map-side joins and multiple outputs? We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either.

Yes i beleive it does but not in a quite compatible way with old spec and they are not in CDH.

They also dropped some support for good i think (such as MultipleOutputFormat) and incorporated these capabilities into MultipleOutputs which makes some code upgrades a little bit more intensive than simple class name change.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037295#comment-13037295 ] 

Sean Owen commented on MAHOUT-537:
----------------------------------

I could, though honestly, I think the better solution at this point is to move to Hadoop 0.21 as part of the next release. It is the current release and nearly superseded by 0.22. It has some features we need to move forward. It is closer to what many are using in CDH3/4. The only drawback I see is that Amazon EMR is on 0.20.2. However we're releasing 0.5 now for 0.20.2. And it is 6 months until we would put out a release needing 0.21, after which time I imagine 0.22 is out and EMR makes available 0.21 -- or if it doesn't, we have to leave behind support.

So let me open an item for that, and I suggest you can proceed using 0.21 features here.
(That is what I am doing for personal projects and it really simplified things. I'm on 0.22 now myself.)

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037459#comment-13037459 ] 

Jake Mannix commented on MAHOUT-537:
------------------------------------

Does 0.21 bring back map-side joins and multiple outputs?  We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either. 

I think we should probably have a more in-depth discussion about which Hadoop releases we support.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537.patch

Attached is the patch without the custom Writable I wrote, instead using NamedVector.

It seems (to me) that there are two options for eliminating the two extra M/R tasks I had to create in lieu of the CompositeInputFormat's joins:

1) Have each row of a DistributedRowMatrix labeled when it is first created. Since DRM isn't much more than a glorified wrapper, its constructor can't implement something like this, so this would be infeasible from a scope perspective.
2) Guarantee the ordering of two given rows in the Iterable object of a Combiner/Reducer, so we know one of them belongs to the multiplicand, the other to the multiplier.

Option #2 seems most technically feasible, however my limited understanding of the inner workings of Hadoop prevents me from knowing where to start. I've taken a look at Partitioner, RecordReader, and various InputFormats and they haven't given me any intuition. Any thoughts on how to do this? Or another method entirely?

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

On 1/3/11 8:30 AM, Sean Owen wrote:
> If you need to control whether B or C comes first it gets tougher
> since you need a custom wrapper key for A. But it's not terrible.
That's the rub. Since this is matrix multiplication, the operation isn't 
commutative, so I have to know after the identity mapper explicitly 
which element is B and which is C so I can multiply them in the correct 
sequence. That's effectively what I implemented through NamedVector (I 
went through and removed all the references to the writable I wrote and 
used just the NamedVector, thanks for the heads-up on that one), except 
it comes with the cost of having to preprocess both matrices to assign 
the necessary labels, which results in the two additional m/r jobs.

In the implementation I've drawn up, the Mapper is the identity mapper 
you mentioned, and the combiner is just what the mapper was (and is, in 
the current Mahout release). The Reducer is more or less the same.

Also: there's a comment in the javadocs for the current matrix 
multiplication job that says "this.transpose.times(other)", but I don't 
see any explicit call to transpose() in the current job. How does this 
happen?

Shannon

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

On Sun, Jan 2, 2011 at 8:48 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> Ah, well if VectorWritable can already support this, then there definitely
> isn't any need for another writable. I took a look at VW awhile back and
> didn't see anything that could help; is there some sort of a label I could
> use?

Yep have a look at VectorWritable.write() for example. It does handle
NamedVector.

> Yes, the issue is joins. I'm effectively trying to replace this one line of
> code:
>
>    conf.set("mapred.join.expr", CompositeInputFormat.compose(
>          "inner", SequenceFileInputFormat.class, aPath, bPath));

This may not be 100% what you are talking about, but this is my general recipe.

First you can specify the multiple input paths with
FileInputFormat.setInputPaths().
Say one path has (A,B) and the other has (A,C). You are trying to join
into (A,(B,C)).
What I do is create a "BOrCWritable" which either has a B or a C
inside. Then you need to have already output your input as (A,BOrC) in
both paths. This is the real messy part, but in practice has not been
terrible in the contexts I've needed it.
Then your mapper is an identity mapper and the reducer will receive B
and C for each A, each inside a BOrC.

If you need to control whether B or C comes first it gets tougher
since you need a custom wrapper key for A. But it's not terrible.

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

> NamedVector is already supported in VectorWritable, do we need a new Writable?
Ah, well if VectorWritable can already support this, then there 
definitely isn't any need for another writable. I took a look at VW 
awhile back and didn't see anything that could help; is there some sort 
of a label I could use?

> Is the issue that you are doing joins? Without CompositeInputFormat it's still possible, and we use the pattern elsewhere. You need some cleverness with a custom key and partitioner that will send key x from source A and key x from source B to the same reducer while maintaining inside a bit that indicates whether it's from A or B.
>
Yes, the issue is joins. I'm effectively trying to replace this one line 
of code:

     conf.set("mapred.join.expr", CompositeInputFormat.compose(
           "inner", SequenceFileInputFormat.class, aPath, bPath));

If this can be done without CompositeInputFormat, or the partitioner can 
be modified to definitively assign specific/custom keys and values to 
specific nodes, then that would be perfect. Should I look into Hadoop's 
Partitioner/MapPartitioner/MapTask classes for this, or is there 
somewhere else I should look?

Thanks for the feedback!

Shannon

[jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976244#action_12976244 ] 

Sean Owen commented on MAHOUT-537:
----------------------------------

I think this is a great effort. I think it's essential that the project remain attached to 0.20.2 at the moment because I believe many people will want to use it with Amazon EMR which is on 0.20.2. We still have some stuff written for 0.19.x and it's higher priority to move off that than onto 0.21.x I think. Complicating this is the fact that 0.21.x is not backward compatible with 0.20.x.

NamedVector is already supported in VectorWritable, do we need a new Writable?

Is the issue that you are doing joins? Without CompositeInputFormat it's still possible, and we use the pattern elsewhere. You need some cleverness with a custom key and partitioner that will send key x from source A and key x from source B to the same reducer while maintaining inside a bit that indicates whether it's from A or B.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Sean Owen <sr...@gmail.com>.

Yah... I actually tend to agree, since it's pretty useful, and is
apparently making a come-back. I personally could go with that. It'd
be great to have more standardization and all that but that can come
later as Hadoop more easily permits it.

That said there are some aspects of the old API I think we can stop
using, and suppose we should update where possible. I am operating
under the assumption that .mapreduce. is still going to supersede
.mapred. at some point. Is that our opinion?

On Thu, Jan 6, 2011 at 9:36 PM, Jake Mannix <ja...@gmail.com> wrote:
> I'm going to dive in finally add my $0.02 on this whole "0.20 API" issue
> in DistributedRowMatrix:
>
> I very strongly feel that we should *not* constrain ourselves to use the
> new apis in the case of functionality which is *missing* in the new API,
> in particular: map-side joins.  As has been mentioned by Dmitriy and
> others:
>
> On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Second remark is that blockwise multiplication is also pointless for
>> sufficiently sparse matrices. Indeed, sum of outer products of columns and
>> rows with intermediate reduction in combiners is by far most promising in
>> terms of shuffle/sort io.
>
>
> Doing a matrix multiplication in one MR pass is HUGE in comparison to
> having to do reduce-side joins and go through second (or third!) shuffle
> phases.  When you consider doing this K times during Lanczos iteration,
> switching to reduce-side matrix multiplication is a non-starter for me.
>
> In addition, this particular operation (matrix multiplication) is just one
> instance of a fairly general action (LDA would scale better if it also
> did a join of the topic/word parameter matrix and the corpus on each
> iteration, so the entire matrix wasn't loaded into memory on every
> mapper), and doing joins in the reducer means you often have to make
> extra passes, as opposed to joining in the mapper and getting a full
> shuffle-reduce step after to do more work.
>
> So yeah, that's me just saying all this again:
>
>
>> On another note, if input is similarly partitioned (not always the case),
>> then map-side multiplication will always be I/O superior to reduce-side
>> multiplication since while I/O is less and especially less in the keyset
>> cardinality undergoing thru sorters. The power of map-side operations comes
>> from the notion that yes we require a lot from the input but no, it's not a
>> lot if input is already part of a bigger MR pipeline.
>>
>
> In general, until feature parity is achieved on the new apis in a hadoop
> distribution which is industry standard, I don't think we should constrain
> ourselves to *removing* functionality for the sake of getting rid of
> deprecation warnings.
>
>  -jake
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

I'm going to dive in finally add my $0.02 on this whole "0.20 API" issue
in DistributedRowMatrix:

I very strongly feel that we should *not* constrain ourselves to use the
new apis in the case of functionality which is *missing* in the new API,
in particular: map-side joins.  As has been mentioned by Dmitriy and
others:

On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Second remark is that blockwise multiplication is also pointless for
> sufficiently sparse matrices. Indeed, sum of outer products of columns and
> rows with intermediate reduction in combiners is by far most promising in
> terms of shuffle/sort io.

Doing a matrix multiplication in one MR pass is HUGE in comparison to
having to do reduce-side joins and go through second (or third!) shuffle
phases.  When you consider doing this K times during Lanczos iteration,
switching to reduce-side matrix multiplication is a non-starter for me.

In addition, this particular operation (matrix multiplication) is just one
instance of a fairly general action (LDA would scale better if it also
did a join of the topic/word parameter matrix and the corpus on each
iteration, so the entire matrix wasn't loaded into memory on every
mapper), and doing joins in the reducer means you often have to make
extra passes, as opposed to joining in the mapper and getting a full
shuffle-reduce step after to do more work.

So yeah, that's me just saying all this again:

> On another note, if input is similarly partitioned (not always the case),
> then map-side multiplication will always be I/O superior to reduce-side
> multiplication since while I/O is less and especially less in the keyset
> cardinality undergoing thru sorters. The power of map-side operations comes
> from the notion that yes we require a lot from the input but no, it's not a
> lot if input is already part of a bigger MR pipeline.
>

In general, until feature parity is achieved on the new apis in a hadoop
distribution which is industry standard, I don't think we should constrain
ourselves to *removing* functionality for the sake of getting rid of
deprecation warnings.

  -jake

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Andrew Hitchcock <ad...@gmail.com>.

If there are specific patches you would like applied to Elastic
MapReduce, I would recommend asking for them on our forums:

https://forums.aws.amazon.com/forum.jspa?forumID=52

We are fairly receptive when it comes to customer feedback about patches.

Regards,
Andrew

On Sun, Jan 2, 2011 at 11:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems
> to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup
> without much flexibility here). I guess moving ahead with APIs in Mahout
> would indeed create problems for whoever is using EMR (I don't).
>
> On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> I would think blockwise multiplication (which is, by the way, has a
>> standard algorithm in Matrix computations by Van Loan and Golub), is pretty
>> pointless with Mahout since there's no blockwise matrix format presently,
>> and even if it were, no existing algorithms support it. All prep utils only
>> produce row-wise format. We could write a routine to "block" it but it would
>> seem to be an exercise in futility.
>>
>> Second remark is that blockwise multiplication is also pointless for
>> sufficiently sparse matrices. Indeed, sum of outer products of columns and
>> rows with intermediate reduction in combiners is by far most promising in
>> terms of shuffle/sort io. Outer products, when further split in columns or
>> rows, would also be quite sparse and hence small in size while reduction
>> in keyset cardinality is just gigantic compared to blockwise
>> multiplications. (That said, i never ran comparison benchmark of the two)
>>
>> Note that what authors essentially are suggesting (even in strategy 4)
>> that there is explosive growth of shuffle and sort keyset i/o, and what's
>> more, they say they never tried it in distributed mode(!). imagine hundreds
>> of machines sending a copy of their input to a lot of other machines in the
>> cluster. Summing outer products avoids broadcasting the input to multiple
>> reducers.
>>
>> On another note, if input is similarly partitioned (not always the case),
>> then map-side multiplication will always be I/O superior to reduce-side
>> multiplication since while I/O is less and especially less in the keyset
>> cardinality undergoing thru sorters. The power of map-side operations comes
>> from the notion that yes we require a lot from the input but no, it's not a
>> lot if input is already part of a bigger MR pipeline.
>>
>> Finally, back to 0.20/0.21 issue... I said before in this thread that
>> migrating to 0.21 would render Mahout incompatible with majority of
>> production frameworks out there. But after working with ssvd code, i came to
>> think of a compromise: since most of the production environments are running
>> Cloudera distribution, many 0.21 things are supported there and there's a
>> lot of code around that's written for new API which is backported in
>> Cloudera. It's difficult for me to judge how much Cloudera's implementation
>> covers of what is in 0.21 (in fact, i did come across a couple of 0.21
>> things still missing in CDH), but in terms of Hadoop compatibility, i think
>> Mahout project would be best served if it indeed moved on to a new api (i.e.
>> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep
>> it on the edge of what's currently practical and out there. Keeping sitting
>> on the old api IMO is definitely a drag. My stochastic svd code is using new
>> api in CDH3 and i would very much not want to backport it to old api, it
>> would not be practical as everyone out there is on CDH and more so than on
>> 0.20.2.
>>
>> -Dmitriy
>>
>>
>>
>>>  Some more general remarks: I think the matrix multiplication can be
>>>> implemented more efficiently. I've done a matrix multiplication of a sparse
>>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>>>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>>>> performance of the implementation described at
>>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't
>>>> really compare. But Imho this can be improved ( though it's possible that
>>>> the poor performance was due to mistakes made by me )
>>>>
>>> I will definitely investigate these methods over the coming days, these
>>> look fantastic.
>>>
>>> Shannon
>>>
>>
>>
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems
to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup
without much flexibility here). I guess moving ahead with APIs in Mahout
would indeed create problems for whoever is using EMR (I don't).

On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I would think blockwise multiplication (which is, by the way, has a
> standard algorithm in Matrix computations by Van Loan and Golub), is pretty
> pointless with Mahout since there's no blockwise matrix format presently,
> and even if it were, no existing algorithms support it. All prep utils only
> produce row-wise format. We could write a routine to "block" it but it would
> seem to be an exercise in futility.
>
> Second remark is that blockwise multiplication is also pointless for
> sufficiently sparse matrices. Indeed, sum of outer products of columns and
> rows with intermediate reduction in combiners is by far most promising in
> terms of shuffle/sort io. Outer products, when further split in columns or
> rows, would also be quite sparse and hence small in size while reduction
> in keyset cardinality is just gigantic compared to blockwise
> multiplications. (That said, i never ran comparison benchmark of the two)
>
> Note that what authors essentially are suggesting (even in strategy 4)
> that there is explosive growth of shuffle and sort keyset i/o, and what's
> more, they say they never tried it in distributed mode(!). imagine hundreds
> of machines sending a copy of their input to a lot of other machines in the
> cluster. Summing outer products avoids broadcasting the input to multiple
> reducers.
>
> On another note, if input is similarly partitioned (not always the case),
> then map-side multiplication will always be I/O superior to reduce-side
> multiplication since while I/O is less and especially less in the keyset
> cardinality undergoing thru sorters. The power of map-side operations comes
> from the notion that yes we require a lot from the input but no, it's not a
> lot if input is already part of a bigger MR pipeline.
>
> Finally, back to 0.20/0.21 issue... I said before in this thread that
> migrating to 0.21 would render Mahout incompatible with majority of
> production frameworks out there. But after working with ssvd code, i came to
> think of a compromise: since most of the production environments are running
> Cloudera distribution, many 0.21 things are supported there and there's a
> lot of code around that's written for new API which is backported in
> Cloudera. It's difficult for me to judge how much Cloudera's implementation
> covers of what is in 0.21 (in fact, i did come across a couple of 0.21
> things still missing in CDH), but in terms of Hadoop compatibility, i think
> Mahout project would be best served if it indeed moved on to a new api (i.e.
> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep
> it on the edge of what's currently practical and out there. Keeping sitting
> on the old api IMO is definitely a drag. My stochastic svd code is using new
> api in CDH3 and i would very much not want to backport it to old api, it
> would not be practical as everyone out there is on CDH and more so than on
> 0.20.2.
>
> -Dmitriy
>
>
>
>>  Some more general remarks: I think the matrix multiplication can be
>>> implemented more efficiently. I've done a matrix multiplication of a sparse
>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>>> performance of the implementation described at
>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't
>>> really compare. But Imho this can be improved ( though it's possible that
>>> the poor performance was due to mistakes made by me )
>>>
>> I will definitely investigate these methods over the coming days, these
>> look fantastic.
>>
>> Shannon
>>
>
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I would think blockwise multiplication (which is, by the way, has a standard
algorithm in Matrix computations by Van Loan and Golub), is pretty pointless
with Mahout since there's no blockwise matrix format presently, and even if
it were, no existing algorithms support it. All prep utils only produce
row-wise format. We could write a routine to "block" it but it would seem to
be an exercise in futility.

Second remark is that blockwise multiplication is also pointless for
sufficiently sparse matrices. Indeed, sum of outer products of columns and
rows with intermediate reduction in combiners is by far most promising in
terms of shuffle/sort io. Outer products, when further split in columns or
rows, would also be quite sparse and hence small in size while reduction
in keyset cardinality is just gigantic compared to blockwise
multiplications. (That said, i never ran comparison benchmark of the two)

Note that what authors essentially are suggesting (even in strategy 4)
that there is explosive growth of shuffle and sort keyset i/o, and what's
more, they say they never tried it in distributed mode(!). imagine hundreds
of machines sending a copy of their input to a lot of other machines in the
cluster. Summing outer products avoids broadcasting the input to multiple
reducers.

On another note, if input is similarly partitioned (not always the case),
then map-side multiplication will always be I/O superior to reduce-side
multiplication since while I/O is less and especially less in the keyset
cardinality undergoing thru sorters. The power of map-side operations comes
from the notion that yes we require a lot from the input but no, it's not a
lot if input is already part of a bigger MR pipeline.

Finally, back to 0.20/0.21 issue... I said before in this thread that
migrating to 0.21 would render Mahout incompatible with majority of
production frameworks out there. But after working with ssvd code, i came to
think of a compromise: since most of the production environments are running
Cloudera distribution, many 0.21 things are supported there and there's a
lot of code around that's written for new API which is backported in
Cloudera. It's difficult for me to judge how much Cloudera's implementation
covers of what is in 0.21 (in fact, i did come across a couple of 0.21
things still missing in CDH), but in terms of Hadoop compatibility, i think
Mahout project would be best served if it indeed moved on to a new api (i.e.
0.21 ) but would not get ahead of what is supported in CDH3. That would keep
it on the edge of what's currently practical and out there. Keeping sitting
on the old api IMO is definitely a drag. My stochastic svd code is using new
api in CDH3 and i would very much not want to backport it to old api, it
would not be practical as everyone out there is on CDH and more so than on
0.20.2.

-Dmitriy



>  Some more general remarks: I think the matrix multiplication can be
>> implemented more efficiently. I've done a matrix multiplication of a sparse
>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>> performance of the implementation described at
>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't
>> really compare. But Imho this can be improved ( though it's possible that
>> the poor performance was due to mistakes made by me )
>>
> I will definitely investigate these methods over the coming days, these
> look fantastic.
>
> Shannon
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

> The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep using the old API until we can switch to 21.0.
I am more than willing to admit it's an ugly hack; however, I started an 
email thread back in November during ApacheCon regarding testing Mahout 
with 0.21 of Hadoop, and the general consensus was to avoid 0.21 until 
the next version of Hadoop was released (seems even the Hadoop folks 
don't care for 0.21 too much). I'm more than happy to pick up those 
experiments and relay the results, if the sentiment has changed.
> 2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the first 2 two jobs are very basic operations, but still for performance's sake it's better to keep the amount of jobs low.  I'm almost 100% certain that this implementation will be slower than the original one ( though I have no idea how much slower, would be interesting to know )
I completely agree, I just wasn't sure how to do the join operation (see 
my emails with Sean Owen) in the absence of CompositeInputFormat. It 
sounds from Sean, however, that this is still possible; I just need to 
do my research.
> 3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when the matrix is very sparse this will result in a substantial overhead.
No arguments here. From the same email thread with Sean, though, it 
sounds as though VectorWritable might have what we need without having 
to resort to what is effectively a Writable wrapper 
(NamedVectorWritable), so I'll do that.
> 4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason for this. It would be better to use a plain VectorWritable.
I noticed this while I was doing the patch, but since the input and 
output of the Combiner have to be the same, I didn't see an alternative 
(unless there's a way around this?).
> If we insist in compliance with 20.2, it might be interesting to have a look at:
> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
> This implementation avoids the use of compositeinputformat by checking the current inputpath  in the setup.
>
This is an awesome webpage! I'll read over this more carefully soon, but 
what you mentioned was my original strategy: to check the input path 
being read from within the Mappers/Reducers. Unfortunately, I couldn't 
find a way to do this, as the only "Path" I could check was the 
"currentWorkingDirectory", which just turned out to be MAHOUT_HOME (at 
least on my dev machine). If there's a way of doing this, please do let 
me know.
> Some more general remarks: I think the matrix multiplication can be implemented more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of the performance of the implementation described at http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really compare. But Imho this can be improved ( though it's possible that the poor performance was due to mistakes made by me )
I will definitely investigate these methods over the coming days, these 
look fantastic.

Shannon

[jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Joris Geessels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976204#action_12976204 ] 

Joris Geessels commented on MAHOUT-537:
---------------------------------------

The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep using the old API until we can switch to 21.0. 
Some remarks: 
1) I didn't test the code either, but couldn't spot any obvious errors. So it seems to me that it should work.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the first 2 two jobs are very basic operations, but still for performance's sake it's better to keep the amount of jobs low.  I'm almost 100% certain that this implementation will be slower than the original one ( though I have no idea how much slower, would be interesting to know ) 
3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when the matrix is very sparse this will result in a substantial overhead. 
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason for this. It would be better to use a plain VectorWritable.

If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html 
This implementation avoids the use of compositeinputformat by checking the current inputpath  in the setup. 

Some more general remarks: I think the matrix multiplication can be implemented more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of the performance of the implementation described at http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really compare. But Imho this can be improved ( though it's possible that the poor performance was due to mistakes made by me )

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537.patch

This first patch fixes most of the DistributedRowMatrix compliance issues, bringing TimesJob, TimesSquaredJob, and TransposeJob to work with the new Hadoop 0.20.2 API, most notably by eliminating dependence on the deprecated JobConf object, using instead Job and Configuration. DistributedRowMatrix has also been modified to accept a Configuration object in its configure() method. All that remains is to fix the times(DRM) method, which will be tricky given its dependence on the CompositeInputFormat object which is conspicuously absent any analogous type in the latest Hadoop release. Will update.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928314#action_12928314 ] 

Shannon Quinn commented on MAHOUT-537:
--------------------------------------

Something worth discussing: Hadoop just released version 0.21.0, which re-includes the updated CompositeInputFormat that was missing in 0.20.2 and deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the trunk, but provided they do then I'm wondering if I should go ahead and implement this patch using Hadoop 0.21. Any thoughts?

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Updated] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-537:
-----------------------------

          Component/s: Math
             Due Date: 26/Jul/13
    Affects Version/s: 0.5
        Fix Version/s: 0.6

This seems to me like part of one of the most crucial tasks for the next release: deprecating and then removing or fixing anything not using the newer Hadoop APIs. Shannon did you reach a point you can commit? I'd strongly encourage you to finish this migration, it's great and important work.

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>             Fix For: 0.6
>
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira