You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Josh Wills <jw...@cloudera.com> on 2013/03/22 17:37:19 UTC

Crunch, Mahout, and HCatalog

Hey all,

I'm working on some tools for doing data integration and building machine
learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about
what I'm up to here:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

and the code is here: https://github.com/cloudera/ml

I wanted to answer a couple of questions preemptively, if you don't mind:

Q: Why?
A: I started planning out the next version of my data science course, and I
was concerned that my students were going to spend too much time on data
integration tasks (e.g., converting CSVs to Vectors) that really should be
automated. I obviously enjoy writing my Java MR stuff in Crunch, and I
thought it would be a good idea to open source the tools to showcase how
awesome Crunch can be.

Q: Why not do this as part of the Crunch or Mahout projects?
A: Dependency management. Crunch doesn't depend on Mahout, and Mahout
doesn't depend on Crunch, and I think that for the sanity of the developers
of both projects, it should stay that way. Dependency management is already
enough of a nightmare for Hadoop projects that I didn't want to do anything
to make it worse. I will contribute anything from the toolkit back to
Crunch that is deemed useful by the community (e.g., the reservoir sampling
stuff in CRUNCH-178) and doesn't introduce any new dependencies.

Q: Where is this going?
A: I'm going to be co-developing the tools and the coursework for the
class, so I have a reasonably good idea of what features I need to add,
with HCatalog integration and ensemble models being the two major items on
the TODO list. I'm not looking to build a tool for every ML algorithm ever
invented, just some a small set of core models that are easy to use, easy
to tune, and thus easy for new data scientists to get started with.

If there's anything else folks are curious about, please just let me know
and I'd be happy to answer.

Josh

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Crunch, Mahout, and HCatalog

Posted by Josh Wills <jw...@cloudera.com>.
On Sun, Mar 24, 2013 at 9:59 AM, Matthias Friedrich <ma...@mafr.de> wrote:

> On Friday, 2013-03-22, Josh Wills wrote:
> > I'm working on some tools for doing data integration and building machine
> > learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about
> > what I'm up to here:
> >
> > http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/
> >
> > and the code is here: https://github.com/cloudera/ml
>
> Cool thing, thanks for open sourcing it!
>
> [...]
> > Q: Why not do this as part of the Crunch or Mahout projects?
> > A: Dependency management. Crunch doesn't depend on Mahout, and Mahout
> > doesn't depend on Crunch, and I think that for the sanity of the
> developers
> > of both projects, it should stay that way. Dependency management is
> already
> > enough of a nightmare for Hadoop projects that I didn't want to do
> anything
> > to make it worse. I will contribute anything from the toolkit back to
> > Crunch that is deemed useful by the community (e.g., the reservoir
> sampling
> > stuff in CRUNCH-178) and doesn't introduce any new dependencies.
>
> This is really sad - but most probably the best decision for now. Do
> you happen to know if there is any work planned on the Hadoop side to
> clean up this situation?
>

Nothing that I'm aware of, but I copied Roman, who is more knowledgeable on
this topic than I am.


>
> Regards,
>   Matthias
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Crunch, Mahout, and HCatalog

Posted by Matthias Friedrich <ma...@mafr.de>.
On Friday, 2013-03-22, Josh Wills wrote:
> I'm working on some tools for doing data integration and building machine
> learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about
> what I'm up to here:
> 
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/
> 
> and the code is here: https://github.com/cloudera/ml

Cool thing, thanks for open sourcing it!

[...]
> Q: Why not do this as part of the Crunch or Mahout projects?
> A: Dependency management. Crunch doesn't depend on Mahout, and Mahout
> doesn't depend on Crunch, and I think that for the sanity of the developers
> of both projects, it should stay that way. Dependency management is already
> enough of a nightmare for Hadoop projects that I didn't want to do anything
> to make it worse. I will contribute anything from the toolkit back to
> Crunch that is deemed useful by the community (e.g., the reservoir sampling
> stuff in CRUNCH-178) and doesn't introduce any new dependencies.

This is really sad - but most probably the best decision for now. Do
you happen to know if there is any work planned on the Hadoop side to
clean up this situation?

Regards,
  Matthias