You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2012/02/22 15:37:45 UTC

Hadoop Utils

We've collected a fair bit of Hadoop utils over the years.  I am finding them generally useful in other projects.  Would it make sense to either split them out to a standalone jar and/or donate them upstream to Hadoop itself?

I'm thinking the things like:
Seq File iterators and potentially the SeqFileDumper too
AbstractJob and related

My gut preference is that we maintain ownership of them but pub them in a separate JAR.

Thoughts?

-Grant

Re: Hadoop Utils

Posted by Grant Ingersoll <gs...@apache.org>.

On Feb 22, 2012, at 9:52 AM, Sean Owen wrote:

> I think its fine to let them live in integration here rather than a new
> module. The iterators could be useful upstream yes and maybe a few more
> bits.

> The AbstractJob might still be a little too app specific.

I've been reusing some of it, although I don't much need the default options (other than in/out).  As it is, lately instead of extending AbstractJob, I construct a command line object that uses the CLI processing pieces of AJ and then feed the results in as needed using APIs.  It separates out the command line processing a bit and feels cleaner to me. 

I'll see if I can do some refactoring to clean up and show what I mean.

> On Feb 22, 2012 2:37 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> 
>> We've collected a fair bit of Hadoop utils over the years.  I am finding
>> them generally useful in other projects.  Would it make sense to either
>> split them out to a standalone jar and/or donate them upstream to Hadoop
>> itself?
>> 
>> I'm thinking the things like:
>> Seq File iterators and potentially the SeqFileDumper too
>> AbstractJob and related
>> 
>> My gut preference is that we maintain ownership of them but pub them in a
>> separate JAR.
>> 
>> Thoughts?
>> 
>> -Grant

Re: Hadoop Utils

Posted by Ted Dunning <te...@gmail.com>.

But is integration published as a separate jar?

Sent from my iPhone

On Feb 22, 2012, at 6:52 AM, Sean Owen <sr...@gmail.com> wrote:

> I think its fine to let them live in integration here rather than a new
> module. The iterators could be useful upstream yes and maybe a few more
> bits. The AbstractJob might still be a little too app specific.
> On Feb 22, 2012 2:37 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> 
>> We've collected a fair bit of Hadoop utils over the years.  I am finding
>> them generally useful in other projects.  Would it make sense to either
>> split them out to a standalone jar and/or donate them upstream to Hadoop
>> itself?
>> 
>> I'm thinking the things like:
>> Seq File iterators and potentially the SeqFileDumper too
>> AbstractJob and related
>> 
>> My gut preference is that we maintain ownership of them but pub them in a
>> separate JAR.
>> 
>> Thoughts?
>> 
>> -Grant

Re: Hadoop Utils

Posted by Sean Owen <sr...@gmail.com>.

I think its fine to let them live in integration here rather than a new
module. The iterators could be useful upstream yes and maybe a few more
bits. The AbstractJob might still be a little too app specific.
On Feb 22, 2012 2:37 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

> We've collected a fair bit of Hadoop utils over the years.  I am finding
> them generally useful in other projects.  Would it make sense to either
> split them out to a standalone jar and/or donate them upstream to Hadoop
> itself?
>
> I'm thinking the things like:
> Seq File iterators and potentially the SeqFileDumper too
> AbstractJob and related
>
> My gut preference is that we maintain ownership of them but pub them in a
> separate JAR.
>
> Thoughts?
>
> -Grant

Re: Hadoop Utils

Posted by Isabel Drost <is...@apache.org>.

On 22.02.2012 Jake Mannix wrote:
> Separate jar means separate maven artifact, right?  I think that breaking
> things up a little has a few negatives (requires people to depend on more
> things, often),

Assuming most of our users depend on us in some Maven projects: Couldn't we 
provide an additional artificial artefact that simply depends on all submodules 
for users to pull in everything-Mahout at once?

Isabel

Re: Hadoop Utils

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Feb 22, 2012 at 8:19 AM, Ted Dunning <te...@gmail.com> wrote:

> Separate jar does mean separate maven artifact but the dependency
> mechanism should handle that and the new artifacts should be very stable.
>

agreed.


>
> Sent from my iPhone
>
> On Feb 22, 2012, at 6:54 AM, Jake Mannix <ja...@gmail.com> wrote:
>
> > On Wed, Feb 22, 2012 at 6:37 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> We've collected a fair bit of Hadoop utils over the years.  I am finding
> >> them generally useful in other projects.  Would it make sense to either
> >> split them out to a standalone jar and/or donate them upstream to Hadoop
> >> itself?
> >>
> >> I'm thinking the things like:
> >> Seq File iterators and potentially the SeqFileDumper too
> >> AbstractJob and related
> >>
> >> My gut preference is that we maintain ownership of them but pub them in
> a
> >> separate JAR.
> >>
> >
> > +1
> >
> > And as many of the non-business-logic-coupled *Writables, as we've
> > discussed before (I think there's even a ticket open for this part).
> >
> > Separate jar means separate maven artifact, right?  I think that breaking
> > things up a little has a few negatives (requires people to depend on more
> > things, often), but positives outweigh them (people can depend on only
> the
> > things they need, and code gets shared more widely, more adoption,
> etc...).
> >
> >  -jake
>

Re: Hadoop Utils

Posted by Ted Dunning <te...@gmail.com>.

Separate jar does mean separate maven artifact but the dependency mechanism should handle that and the new artifacts should be very stable. 

Sent from my iPhone

On Feb 22, 2012, at 6:54 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, Feb 22, 2012 at 6:37 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> We've collected a fair bit of Hadoop utils over the years.  I am finding
>> them generally useful in other projects.  Would it make sense to either
>> split them out to a standalone jar and/or donate them upstream to Hadoop
>> itself?
>> 
>> I'm thinking the things like:
>> Seq File iterators and potentially the SeqFileDumper too
>> AbstractJob and related
>> 
>> My gut preference is that we maintain ownership of them but pub them in a
>> separate JAR.
>> 
> 
> +1
> 
> And as many of the non-business-logic-coupled *Writables, as we've
> discussed before (I think there's even a ticket open for this part).
> 
> Separate jar means separate maven artifact, right?  I think that breaking
> things up a little has a few negatives (requires people to depend on more
> things, often), but positives outweigh them (people can depend on only the
> things they need, and code gets shared more widely, more adoption, etc...).
> 
>  -jake

Re: Hadoop Utils

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Feb 22, 2012 at 6:37 AM, Grant Ingersoll <gs...@apache.org>wrote:

> We've collected a fair bit of Hadoop utils over the years.  I am finding
> them generally useful in other projects.  Would it make sense to either
> split them out to a standalone jar and/or donate them upstream to Hadoop
> itself?
>
> I'm thinking the things like:
> Seq File iterators and potentially the SeqFileDumper too
> AbstractJob and related
>
> My gut preference is that we maintain ownership of them but pub them in a
> separate JAR.
>

+1

And as many of the non-business-logic-coupled *Writables, as we've
discussed before (I think there's even a ticket open for this part).

Separate jar means separate maven artifact, right?  I think that breaking
things up a little has a few negatives (requires people to depend on more
things, often), but positives outweigh them (people can depend on only the
things they need, and code gets shared more widely, more adoption, etc...).

  -jake