You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/05/25 16:17:07 UTC

Moving to new Hadoop APIs

Just to state what seems to be in progress -- looks like we are agreed
we should move to the new Hadoop APIs. Some code is already using it;
most of the part that isn't is the recommender which was due to some
strange bugs deep in Hadoop in prior versions. It's time to try it
again. I'm going to work on porting everything forward now.

The other argument against this was that Amazon EMR runs 0.18.3. I
think Jeff already established that what we're doing has already
broken compatibility with 0.18.x. We can point those users to release
0.3 and say they can try to back-port that code to 0.18.x
compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
cluster using AWS and hopefully EMR updates soon.

Sean

Re: Moving to new Hadoop APIs

Posted by Jake Mannix <ja...@gmail.com>.
Ugly!  MultipleOutputs and MultipleInputs are so nice sometimes...

  -jake

On May 27, 2010 9:47 AM, "Sean Owen" <sr...@gmail.com> wrote:

In reality, it's not so hard to work around. You do have to make two
(short) additional jobs. Each one runs one of the mappers, and each
just uses an identity reducer. Then, both those outputs are fed into a
new job, via an identity mapper, into the reducer. It's not terrible.

On Thu, May 27, 2010 at 5:45 PM, ankur goel <an...@yahoo.com> wrote:
> AFAIK an equivalent does NOT exists. However the problem is tracked by
JIRA
> - https://issues.apache.org/jira/browse/MAPREDUCE-1170. Please vote there
if
> we need a fix in hadoop 20.2 as the patch hasn't gone anywhere.

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
In reality, it's not so hard to work around. You do have to make two
(short) additional jobs. Each one runs one of the mappers, and each
just uses an identity reducer. Then, both those outputs are fed into a
new job, via an identity mapper, into the reducer. It's not terrible.

On Thu, May 27, 2010 at 5:45 PM, ankur goel <an...@yahoo.com> wrote:
> AFAIK an equivalent does NOT exists. However the problem is tracked by JIRA
> - https://issues.apache.org/jira/browse/MAPREDUCE-1170. Please vote there if
> we need a fix in hadoop 20.2 as the patch hasn't gone anywhere.

Re: Moving to new Hadoop APIs

Posted by Robin Anil <ro...@gmail.com>.
On Wed, May 26, 2010 at 1:39 AM, Robin Anil <ro...@gmail.com> wrote:

> How about adding a Mahout Mapper/Reducer Layer. it has lots of cons like
> maintainability. But it will ease out a lot of dependency problem and it may
> help adoption in non-hadoop frameworks.
>
> It will clean up/reduce lot of junk job setup code and will be easy to
plugin with Drew's Document stuff, easily fetch counters. I am talking about
a very clean lightweight Mapper/Reducer wrapper classes.

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
This wouldn't solve my particular problem but that's not the question I imagine.

I suppose I'd have to see a patch to understand what this might
meaningfully simplify. I think I use most of the Mapper / Reducer API
as it is. Now, the whole Job creation process -- simplifiable.

On Tue, May 25, 2010 at 9:22 PM, Robin Anil <ro...@gmail.com> wrote:
> Its a bit of work but once complete, Implementing K-means should be less
> than 10 lines of code.
>
> Robin
>

Re: Moving to new Hadoop APIs

Posted by Robin Anil <ro...@gmail.com>.
Its a bit of work but once complete, Implementing K-means should be less
than 10 lines of code.

Robin

Re: Moving to new Hadoop APIs

Posted by Jake Mannix <ja...@gmail.com>.
And if you do this, you could in fact make the MahoutMapper/Reducer
classes be serializable/writable, and do fun things like have classes
(something like
DistributedDataSet) which have methods like
dataSet.mapReduce(myMahoutMapper, myMahoutReducer), which returns a
transformed DistributedDataSet.  Configuration data which would be too big
to fit inside of a Configuration object, could be side-channelled via the
DistributedCache...

On Tue, May 25, 2010 at 1:18 PM, Robin Anil <ro...@gmail.com> wrote:

> Algorithms implement Mahout Mapper/Reducer interfaces. Abstract Job sets up
> jobs by calling Hadoop function. There will be only one Hadoop
> Mapper/Reducer class which will take a Mahout Mapper/Reducer during setup
> and call the Map function inside it
>
> Robin
>

Re: Moving to new Hadoop APIs

Posted by Robin Anil <ro...@gmail.com>.
Algorithms implement Mahout Mapper/Reducer interfaces. Abstract Job sets up
jobs by calling Hadoop function. There will be only one Hadoop
Mapper/Reducer class which will take a Mahout Mapper/Reducer during setup
and call the Map function inside it

Robin

Re: Moving to new Hadoop APIs

Posted by Ted Dunning <te...@gmail.com>.
Not sure what you mean by this.

On Tue, May 25, 2010 at 1:09 PM, Robin Anil <ro...@gmail.com> wrote:

> How about adding a Mahout Mapper/Reducer Layer. it has lots of cons like
> maintainability. But it will ease out a lot of dependency problem and it
> may
> help adoption in non-hadoop frameworks.
>
> On Wed, May 26, 2010 at 1:28 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > That's a good perspective. I don't think moving to the new APIs is
> > that burdensome, especially as we're doing a better and better job of
> > centralizing and standardizing our approach to interacting with
> > Hadoop.
> >
> > I somehow sense we're going to need to run 'close to the metal' with
> > Hadoop to squeeze maximum performance out of it. Hence maybe we aren't
> > the casual users that would most benefit from a layer in between. But
> > that's just a hunch.
> >
> > This is, so far, an isolated issue that I'm attacking. Maybe it is
> > just bad practice to have two mappers with different inputs but I
> > think it's a handy trick.
> >
> > For now, I may punt and just use the old code for this particular
> > mechanism, since it's only deprecated. When/if it goes away, then we
> > can really deal with it. By that time maybe new solutions are
> > available.
> >
> >
> >
> > On Tue, May 25, 2010 at 8:45 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > I presume that Robin's rework addresses this, right?
> > >
> > > ---------- Forwarded message ----------
> > > From: Chris K Wensel <ch...@wensel.net>
> > > Date: Tue, May 25, 2010 at 12:43 PM
> > > Subject: Re: Moving to new Hadoop APIs
> > > To: Ted Dunning <te...@gmail.com>
> > >
> > >
> > > you guys need to update the mail list page. still sends to
> > lucene.apache.org
> > >
> > > On May 25, 2010, at 12:40 PM, Ted Dunning wrote:
> > >
> > >
> > > Thanks.
> > >
> > > On Tue, May 25, 2010 at 12:20 PM, Chris K Wensel <ch...@wensel.net>
> > wrote:
> > >
> > >> I'm not on the list.
> > >>
> > >> here is my opinion on the new apis
> > >>
> > >>
> >
> http://groups.google.com/group/cascading-user/browse_thread/thread/4dc26b68401bbc0f#
> > >>
> > >> and here
> > >>
> > >>
> >
> http://stackoverflow.com/questions/2855167/which-hadoop-api-version-should-i-use/2859863#2859863
> > >>
> > >> I can reply to the list, but am rushing out of the office. let me
> know.
> > >>
> > >> ckw
> > >>
> >
>

Re: Moving to new Hadoop APIs

Posted by Robin Anil <ro...@gmail.com>.
How about adding a Mahout Mapper/Reducer Layer. it has lots of cons like
maintainability. But it will ease out a lot of dependency problem and it may
help adoption in non-hadoop frameworks.

On Wed, May 26, 2010 at 1:28 AM, Sean Owen <sr...@gmail.com> wrote:

> That's a good perspective. I don't think moving to the new APIs is
> that burdensome, especially as we're doing a better and better job of
> centralizing and standardizing our approach to interacting with
> Hadoop.
>
> I somehow sense we're going to need to run 'close to the metal' with
> Hadoop to squeeze maximum performance out of it. Hence maybe we aren't
> the casual users that would most benefit from a layer in between. But
> that's just a hunch.
>
> This is, so far, an isolated issue that I'm attacking. Maybe it is
> just bad practice to have two mappers with different inputs but I
> think it's a handy trick.
>
> For now, I may punt and just use the old code for this particular
> mechanism, since it's only deprecated. When/if it goes away, then we
> can really deal with it. By that time maybe new solutions are
> available.
>
>
>
> On Tue, May 25, 2010 at 8:45 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > I presume that Robin's rework addresses this, right?
> >
> > ---------- Forwarded message ----------
> > From: Chris K Wensel <ch...@wensel.net>
> > Date: Tue, May 25, 2010 at 12:43 PM
> > Subject: Re: Moving to new Hadoop APIs
> > To: Ted Dunning <te...@gmail.com>
> >
> >
> > you guys need to update the mail list page. still sends to
> lucene.apache.org
> >
> > On May 25, 2010, at 12:40 PM, Ted Dunning wrote:
> >
> >
> > Thanks.
> >
> > On Tue, May 25, 2010 at 12:20 PM, Chris K Wensel <ch...@wensel.net>
> wrote:
> >
> >> I'm not on the list.
> >>
> >> here is my opinion on the new apis
> >>
> >>
> http://groups.google.com/group/cascading-user/browse_thread/thread/4dc26b68401bbc0f#
> >>
> >> and here
> >>
> >>
> http://stackoverflow.com/questions/2855167/which-hadoop-api-version-should-i-use/2859863#2859863
> >>
> >> I can reply to the list, but am rushing out of the office. let me know.
> >>
> >> ckw
> >>
>

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
That's a good perspective. I don't think moving to the new APIs is
that burdensome, especially as we're doing a better and better job of
centralizing and standardizing our approach to interacting with
Hadoop.

I somehow sense we're going to need to run 'close to the metal' with
Hadoop to squeeze maximum performance out of it. Hence maybe we aren't
the casual users that would most benefit from a layer in between. But
that's just a hunch.

This is, so far, an isolated issue that I'm attacking. Maybe it is
just bad practice to have two mappers with different inputs but I
think it's a handy trick.

For now, I may punt and just use the old code for this particular
mechanism, since it's only deprecated. When/if it goes away, then we
can really deal with it. By that time maybe new solutions are
available.



On Tue, May 25, 2010 at 8:45 PM, Ted Dunning <te...@gmail.com> wrote:
> I presume that Robin's rework addresses this, right?
>
> ---------- Forwarded message ----------
> From: Chris K Wensel <ch...@wensel.net>
> Date: Tue, May 25, 2010 at 12:43 PM
> Subject: Re: Moving to new Hadoop APIs
> To: Ted Dunning <te...@gmail.com>
>
>
> you guys need to update the mail list page. still sends to lucene.apache.org
>
> On May 25, 2010, at 12:40 PM, Ted Dunning wrote:
>
>
> Thanks.
>
> On Tue, May 25, 2010 at 12:20 PM, Chris K Wensel <ch...@wensel.net> wrote:
>
>> I'm not on the list.
>>
>> here is my opinion on the new apis
>>
>> http://groups.google.com/group/cascading-user/browse_thread/thread/4dc26b68401bbc0f#
>>
>> and here
>>
>> http://stackoverflow.com/questions/2855167/which-hadoop-api-version-should-i-use/2859863#2859863
>>
>> I can reply to the list, but am rushing out of the office. let me know.
>>
>> ckw
>>

Fwd: Moving to new Hadoop APIs

Posted by Ted Dunning <te...@gmail.com>.
I presume that Robin's rework addresses this, right?

---------- Forwarded message ----------
From: Chris K Wensel <ch...@wensel.net>
Date: Tue, May 25, 2010 at 12:43 PM
Subject: Re: Moving to new Hadoop APIs
To: Ted Dunning <te...@gmail.com>


you guys need to update the mail list page. still sends to lucene.apache.org

On May 25, 2010, at 12:40 PM, Ted Dunning wrote:


Thanks.

On Tue, May 25, 2010 at 12:20 PM, Chris K Wensel <ch...@wensel.net> wrote:

> I'm not on the list.
>
> here is my opinion on the new apis
>
> http://groups.google.com/group/cascading-user/browse_thread/thread/4dc26b68401bbc0f#
>
> and here
>
> http://stackoverflow.com/questions/2855167/which-hadoop-api-version-should-i-use/2859863#2859863
>
> I can reply to the list, but am rushing out of the office. let me know.
>
> ckw
>
> On May 25, 2010, at 11:45 AM, Ted Dunning wrote:
>
>
> Chris,
>
> Can you comment on this?
>
> ---------- Forwarded message ----------
> From: Sean Owen <sr...@gmail.com>
> Date: Tue, May 25, 2010 at 11:37 AM
> Subject: Re: Moving to new Hadoop APIs
> To: Mahout Dev List <de...@mahout.apache.org>, Ankur Goel <
> ankur_goel79@yahoo.com>
>
>
> First snag I've hit -- the new APIs don't include an equivalent of
> MultipleInputs, which is the primary support for mapping several input
> paths. Anyone happen to know the equivalent (esp. Ankur)? This could
> be a bear to reproduce on our own.
>
> Sean
>
> On Tue, May 25, 2010 at 3:17 PM, Sean Owen <sr...@gmail.com> wrote:
> > Just to state what seems to be in progress -- looks like we are agreed
> > we should move to the new Hadoop APIs. Some code is already using it;
> > most of the part that isn't is the recommender which was due to some
> > strange bugs deep in Hadoop in prior versions. It's time to try it
> > again. I'm going to work on porting everything forward now.
> >
> > The other argument against this was that Amazon EMR runs 0.18.3. I
> > think Jeff already established that what we're doing has already
> > broken compatibility with 0.18.x. We can point those users to release
> > 0.3 and say they can try to back-port that code to 0.18.x
> > compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> > cluster using AWS and hopefully EMR updates soon.
> >
> > Sean
> >
>
>
>  --
> Chris K Wensel
> chris@concurrentinc.com
> http://www.concurrentinc.com
>
>

  --
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
First snag I've hit -- the new APIs don't include an equivalent of
MultipleInputs, which is the primary support for mapping several input
paths. Anyone happen to know the equivalent (esp. Ankur)? This could
be a bear to reproduce on our own.

Sean

On Tue, May 25, 2010 at 3:17 PM, Sean Owen <sr...@gmail.com> wrote:
> Just to state what seems to be in progress -- looks like we are agreed
> we should move to the new Hadoop APIs. Some code is already using it;
> most of the part that isn't is the recommender which was due to some
> strange bugs deep in Hadoop in prior versions. It's time to try it
> again. I'm going to work on porting everything forward now.
>
> The other argument against this was that Amazon EMR runs 0.18.3. I
> think Jeff already established that what we're doing has already
> broken compatibility with 0.18.x. We can point those users to release
> 0.3 and say they can try to back-port that code to 0.18.x
> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> cluster using AWS and hopefully EMR updates soon.
>
> Sean
>

Re: Moving to new Hadoop APIs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1 on both. I'm gone sailing for the weekend but will restart work on 
MAHOUT-167 now

On 5/25/10 8:33 AM, Robin Anil wrote:
> +1 On moving forward and on using Abstract Job across the Codebase.
>
>
> On Tue, May 25, 2010 at 7:47 PM, Sean Owen<sr...@gmail.com>  wrote:
>
>    
>> Just to state what seems to be in progress -- looks like we are agreed
>> we should move to the new Hadoop APIs. Some code is already using it;
>> most of the part that isn't is the recommender which was due to some
>> strange bugs deep in Hadoop in prior versions. It's time to try it
>> again. I'm going to work on porting everything forward now.
>>
>> The other argument against this was that Amazon EMR runs 0.18.3. I
>> think Jeff already established that what we're doing has already
>> broken compatibility with 0.18.x. We can point those users to release
>> 0.3 and say they can try to back-port that code to 0.18.x
>> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
>> cluster using AWS and hopefully EMR updates soon.
>>
>> Sean
>>
>>      
>    


Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
Yah that's what I expected, and that's what we'd settled on to date.

I recently heard maybe that wasn't the case, and want to make sure the
project doesn't get stuck using a bit of both for long. So now I'm on
the new APIs, but checking in since that move seems to decrease
consistency rather than increase it.

Robin is there a big change in the wings that updates all this that I don't see?

Re: Moving to new Hadoop APIs

Posted by Jake Mannix <ja...@gmail.com>.
I made sure to write all the matrix and decomposer stuff in the old mapred.*
hiearchy, so that is not on 0.20+ yet.   But I don't know about the rest of
it, I've seen lots of 0.18-based code as I dug around.

On Wed, May 26, 2010 at 1:10 AM, Sean Owen <sr...@gmail.com> wrote:

> So, I converted to use the new APIs since Robin had mentioned that
> most all the other code uses it.
>
> But I took a glance, and actually I don't see anything using the new
> APIs at all except LDA and one FPM implementation.
>
> Are we talking about the same thing?
> Basically we are talking about using nothing in org.apache.hadoop.mapred.*
>
> On Tue, May 25, 2010 at 4:33 PM, Robin Anil <ro...@gmail.com> wrote:
> > +1 On moving forward and on using Abstract Job across the Codebase.
> >
> >
> > On Tue, May 25, 2010 at 7:47 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> Just to state what seems to be in progress -- looks like we are agreed
> >> we should move to the new Hadoop APIs. Some code is already using it;
> >> most of the part that isn't is the recommender which was due to some
> >> strange bugs deep in Hadoop in prior versions. It's time to try it
> >> again. I'm going to work on porting everything forward now.
> >>
> >> The other argument against this was that Amazon EMR runs 0.18.3. I
> >> think Jeff already established that what we're doing has already
> >> broken compatibility with 0.18.x. We can point those users to release
> >> 0.3 and say they can try to back-port that code to 0.18.x
> >> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> >> cluster using AWS and hopefully EMR updates soon.
> >>
> >> Sean
> >>
> >
>

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
So, I converted to use the new APIs since Robin had mentioned that
most all the other code uses it.

But I took a glance, and actually I don't see anything using the new
APIs at all except LDA and one FPM implementation.

Are we talking about the same thing?
Basically we are talking about using nothing in org.apache.hadoop.mapred.*

On Tue, May 25, 2010 at 4:33 PM, Robin Anil <ro...@gmail.com> wrote:
> +1 On moving forward and on using Abstract Job across the Codebase.
>
>
> On Tue, May 25, 2010 at 7:47 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> Just to state what seems to be in progress -- looks like we are agreed
>> we should move to the new Hadoop APIs. Some code is already using it;
>> most of the part that isn't is the recommender which was due to some
>> strange bugs deep in Hadoop in prior versions. It's time to try it
>> again. I'm going to work on porting everything forward now.
>>
>> The other argument against this was that Amazon EMR runs 0.18.3. I
>> think Jeff already established that what we're doing has already
>> broken compatibility with 0.18.x. We can point those users to release
>> 0.3 and say they can try to back-port that code to 0.18.x
>> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
>> cluster using AWS and hopefully EMR updates soon.
>>
>> Sean
>>
>

Re: Moving to new Hadoop APIs

Posted by Robin Anil <ro...@gmail.com>.
+1 On moving forward and on using Abstract Job across the Codebase.


On Tue, May 25, 2010 at 7:47 PM, Sean Owen <sr...@gmail.com> wrote:

> Just to state what seems to be in progress -- looks like we are agreed
> we should move to the new Hadoop APIs. Some code is already using it;
> most of the part that isn't is the recommender which was due to some
> strange bugs deep in Hadoop in prior versions. It's time to try it
> again. I'm going to work on porting everything forward now.
>
> The other argument against this was that Amazon EMR runs 0.18.3. I
> think Jeff already established that what we're doing has already
> broken compatibility with 0.18.x. We can point those users to release
> 0.3 and say they can try to back-port that code to 0.18.x
> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> cluster using AWS and hopefully EMR updates soon.
>
> Sean
>

Re: Moving to new Hadoop APIs

Posted by Sean Owen <sr...@gmail.com>.
I'm happy to report I converted all the recommender-related jobs to
Hadoop 0.20.x and sorted out the issue I had before, and simply
reworked the jobs to not need one job to have two mapper inputs.

I don't think I broke anything, but, can't be 100% sure since the
tests aren't exhaustive.

I suppose I'd encourage everyone to chuck in their upgrades as soon as
is realistic too. Then I can for instance take on updating anything
left.

On Tue, May 25, 2010 at 3:17 PM, Sean Owen <sr...@gmail.com> wrote:
> Just to state what seems to be in progress -- looks like we are agreed
> we should move to the new Hadoop APIs. Some code is already using it;
> most of the part that isn't is the recommender which was due to some
> strange bugs deep in Hadoop in prior versions. It's time to try it
> again. I'm going to work on porting everything forward now.
>
> The other argument against this was that Amazon EMR runs 0.18.3. I
> think Jeff already established that what we're doing has already
> broken compatibility with 0.18.x. We can point those users to release
> 0.3 and say they can try to back-port that code to 0.18.x
> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> cluster using AWS and hopefully EMR updates soon.
>
> Sean
>

Re: Moving to new Hadoop APIs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
NVM, I think I've got DistanceMeasures sorted out. Still working on 
creating proper MockContexts to feed to the mapper and reducer tests. 
I'll post my patch in whatever status it is at the end of today.

On 5/26/10 1:35 PM, Jeff Eastman wrote:
> I've got most of Canopy converted but am exploding my brain trying to 
> figure out how best to coax DistanceMeasures to support a 
> configure(Configuration) method. There's some subtle inheritance in 
> the parameters package and I can't decide just where to touch it. I'm 
> running out of time before I head out tomorrow so I was hoping 
> somebody can sort it out while I'm gone.
>
> Jeff
>
> PS: I'm going sailing for a few days with my cousin Peter and his son 
> Trout (http://www.coastalview.com/center.asp?article=12251). Should be 
> back to a computer late Tuesday.
>
>
> On 5/25/10 7:17 AM, Sean Owen wrote:
>> Just to state what seems to be in progress -- looks like we are agreed
>> we should move to the new Hadoop APIs. Some code is already using it;
>> most of the part that isn't is the recommender which was due to some
>> strange bugs deep in Hadoop in prior versions. It's time to try it
>> again. I'm going to work on porting everything forward now.
>>
>> The other argument against this was that Amazon EMR runs 0.18.3. I
>> think Jeff already established that what we're doing has already
>> broken compatibility with 0.18.x. We can point those users to release
>> 0.3 and say they can try to back-port that code to 0.18.x
>> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
>> cluster using AWS and hopefully EMR updates soon.
>>
>> Sean
>>
>


Re: Moving to new Hadoop APIs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I've got most of Canopy converted but am exploding my brain trying to 
figure out how best to coax DistanceMeasures to support a 
configure(Configuration) method. There's some subtle inheritance in the 
parameters package and I can't decide just where to touch it. I'm 
running out of time before I head out tomorrow so I was hoping somebody 
can sort it out while I'm gone.

Jeff

PS: I'm going sailing for a few days with my cousin Peter and his son 
Trout (http://www.coastalview.com/center.asp?article=12251). Should be 
back to a computer late Tuesday.


On 5/25/10 7:17 AM, Sean Owen wrote:
> Just to state what seems to be in progress -- looks like we are agreed
> we should move to the new Hadoop APIs. Some code is already using it;
> most of the part that isn't is the recommender which was due to some
> strange bugs deep in Hadoop in prior versions. It's time to try it
> again. I'm going to work on porting everything forward now.
>
> The other argument against this was that Amazon EMR runs 0.18.3. I
> think Jeff already established that what we're doing has already
> broken compatibility with 0.18.x. We can point those users to release
> 0.3 and say they can try to back-port that code to 0.18.x
> compatibility. But 0.4 onwards is for 0.20.x+ and you can run your own
> cluster using AWS and hopefully EMR updates soon.
>
> Sean
>
>