You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/12/12 18:38:34 UTC

The next time someone wants to help

The next time someone wants to get into contributing to Mahout, wouldn’t it be nice to prune dependencies?

For instance Spark depends on math-scala, which depends on math—at least ideally but in reality dependencies include mr-legacy. If some things were refactored into math we might have a much streamlined dependency tree. Some things in Math also can be replaced with newer Scala libs and so could be moved out to a java-common or something that would not be required by the Scala code.

If people are going to use the V1 version of Mahout it would be nice if the choice didn’t force them to drag along all the legacy code if it isn’t being used.

Re: The next time someone wants to help

Posted by Pat Ferrel <pa...@occamsmachete.com>.

You guys know the mrlegacy much better than I do. I was just thinking that since it is pretty big it would be nice to prune some of it from the Scala deps.

Pruning would help with another issue--creating either an “all deps” artifact or a managed libs module. I had another weird bug (that’s 2 now) that seems relates to dependencies supplied on the classpath instead of in the jars. So it runs on one machine (dev machine) but not on another (cluster). This will drive users crazy if it happens very often. 

On Dec 12, 2014, at 2:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Fri, Dec 12, 2014 at 2:24 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> Hadoop dependencies are a quagmire.
> 
> It would be far preferable to rewrite the necessary serialization to avoid
> Hadoop dependencies entirely.
> 
> If we dropping the MR code, why do we need to reference the VectorWritable
> class at all?
> 

yes, this is the only form of serialization right now. Yes, it would be
much more preferrable to rewrite it without going after Writable, in Kryo
terms.

Given amount of activity in that domain lately though, I am just being
realistic here.

But yes, i support getting rid of Writable type serialization.

We do need Sequence file format though.

Also keep in mind that spark brings hadoop dependencies as well. Which is
also sort of both blessing and a curse.

Blessing because we don't have to declare a particular hadoop dependency
any longer.

Curse is that because of course actuall hadoop version depends on
parameters of Spark compilation; not what pom and maven tells us. So we are
constrained only to pieces that are "forever" compatible accross hadoop
history.

>

Re: The next time someone wants to help

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Dec 12, 2014 at 2:24 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Hadoop dependencies are a quagmire.
>
> It would be far preferable to rewrite the necessary serialization to avoid
> Hadoop dependencies entirely.
>
> If we dropping the MR code, why do we need to reference the VectorWritable
> class at all?
>

yes, this is the only form of serialization right now. Yes, it would be
much more preferrable to rewrite it without going after Writable, in Kryo
terms.

Given amount of activity in that domain lately though, I am just being
realistic here.

But yes, i support getting rid of Writable type serialization.

We do need Sequence file format though.

Also keep in mind that spark brings hadoop dependencies as well. Which is
also sort of both blessing and a curse.

Blessing because we don't have to declare a particular hadoop dependency
any longer.

Curse is that because of course actuall hadoop version depends on
parameters of Spark compilation; not what pom and maven tells us. So we are
constrained only to pieces that are "forever" compatible accross hadoop
history.

>

Re: The next time someone wants to help

Posted by Ted Dunning <te...@gmail.com>.

Hadoop dependencies are a quagmire.

It would be far preferable to rewrite the necessary serialization to avoid
Hadoop dependencies entirely.

If we dropping the MR code, why do we need to reference the VectorWritable
class at all?

Even in the worse case, we could simply recode the binary layer from
scratch without the heinous dependencies.



On Fri, Dec 12, 2014 at 10:06 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
>
> A bit more detail on what needs to happen here IMO:
>
> Likely, hadoop-releated things we still need for spark etc. like
> VectorWritable need to be factored out into a (new) module mahout-hadoop or
> something. Important notion here is that we only want to depend on
> hadoop-commons, which in theory should be common for both new and old
> hadoop MR apis. We may face the fact that we need hdfs as well there; e.g.
> perhaps for reading sequence file headers, not sure; but we definitely do
> not need anything mapreduce.
>
> Math still cannot depend on that mahout-hadoop since math must not depend
> on anything hadoop, that was the premise since like the beginning.
> Mahout-math is in-core ops only, lightweight, self-contained thing.
>
> more likely, spark module (and maybe some others if they use that) will
> have to depend on hadoop serialization for vectors and matrices directly,
> i.e. on mahout-hadoop.
>
> mrlegacy stuff of course needs to be completely isolated (nobody else
> depends on it) and made dependent on mahout-hadoop as well.
>
> On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > The next time someone wants to get into contributing to Mahout, wouldn’t
> > it be nice to prune dependencies?
> >
> > For instance Spark depends on math-scala, which depends on math—at least
> > ideally but in reality dependencies include mr-legacy. If some things
> were
> > refactored into math we might have a much streamlined dependency tree.
> Some
> > things in Math also can be replaced with newer Scala libs and so could be
> > moved out to a java-common or something that would not be required by the
> > Scala code.
> >
> > If people are going to use the V1 version of Mahout it would be nice if
> > the choice didn’t force them to drag along all the legacy code if it
> isn’t
> > being used.
>

Re: The next time someone wants to help

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

A bit more detail on what needs to happen here IMO:

Likely, hadoop-releated things we still need for spark etc. like
VectorWritable need to be factored out into a (new) module mahout-hadoop or
something. Important notion here is that we only want to depend on
hadoop-commons, which in theory should be common for both new and old
hadoop MR apis. We may face the fact that we need hdfs as well there; e.g.
perhaps for reading sequence file headers, not sure; but we definitely do
not need anything mapreduce.

Math still cannot depend on that mahout-hadoop since math must not depend
on anything hadoop, that was the premise since like the beginning.
Mahout-math is in-core ops only, lightweight, self-contained thing.

more likely, spark module (and maybe some others if they use that) will
have to depend on hadoop serialization for vectors and matrices directly,
i.e. on mahout-hadoop.

mrlegacy stuff of course needs to be completely isolated (nobody else
depends on it) and made dependent on mahout-hadoop as well.

On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> The next time someone wants to get into contributing to Mahout, wouldn’t
> it be nice to prune dependencies?
>
> For instance Spark depends on math-scala, which depends on math—at least
> ideally but in reality dependencies include mr-legacy. If some things were
> refactored into math we might have a much streamlined dependency tree. Some
> things in Math also can be replaced with newer Scala libs and so could be
> moved out to a java-common or something that would not be required by the
> Scala code.
>
> If people are going to use the V1 version of Mahout it would be nice if
> the choice didn’t force them to drag along all the legacy code if it isn’t
> being used.

Re: The next time someone wants to help

Posted by Andrew Musselman <an...@gmail.com>.

Yes +1

On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The next time someone wants to get into contributing to Mahout, wouldn’t
> it be nice to prune dependencies?
>
> For instance Spark depends on math-scala, which depends on math—at least
> ideally but in reality dependencies include mr-legacy. If some things were
> refactored into math we might have a much streamlined dependency tree. Some
> things in Math also can be replaced with newer Scala libs and so could be
> moved out to a java-common or something that would not be required by the
> Scala code.
>
> If people are going to use the V1 version of Mahout it would be nice if
> the choice didn’t force them to drag along all the legacy code if it isn’t
> being used.