You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Andrew Palumbo <ap...@outlook.com> on 2016/09/06 20:33:46 UTC

Mahout distro Size

The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.stjschools.org/public/apache/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this size down.

  1.  A few Possibilities:

  2.  Drop h2o (binary only) from Distro? (18M - unused)

  3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>: Remove Hadoop 1 support. could save us some space.

  4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>: Remove dependency jars from /lib in mahout binary distribution. Should also save space.

  5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot of dependencies to <provided> scope, we can revisit: MAHOUT-1705<https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies in job jar for mahout-examples.

     *   16M    ./lib/hadoop

     *   85M    ./lib/

        *   Many of the jars in /lib/ and possibly /lib/hadoop are already packaged into the mahout-examples jar and adding them to the classpath from /lib/ is therefore redundant. As well many may be provided.

Re: Mahout distro Size

Posted by Andrew Palumbo <ap...@outlook.com>.
Actually I think I remember Dr. Cos saying, around the time we started working with bigtop, that he went through their poms with a fine toothed comb and helped them get everything in order.  Maybe we could ask him to help us out.

________________________________
From: Dmitriy Lyubimov <dl...@gmail.com>
Sent: Tuesday, September 6, 2016 8:24:29 PM
To: dev@mahout.apache.org
Subject: Re: Mahout distro Size

I dunno. they build shaded assembly artifact it seems and are happy with
this approach. It would seem we'd just need the legacy deps in a similar
case.

On Tue, Sep 6, 2016 at 4:48 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> bq.
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
>
> Do you mean using something like Spark's dependency resolver?
>
> ________________________________
> From: Dmitriy Lyubimov <dl...@gmail.com>
> Sent: Tuesday, September 6, 2016 4:46:24 PM
> To: dev@mahout.apache.org
> Subject: Re: Mahout distro Size
>
> 2 + 1
> 3 + 1
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
> On the other hand, the samsara only dependencies are really light. backends
> are really always "provided", and the rest of it is fairly small enough not
> to be an issue either way.  but we probably definitely should drop local
> support for MR stuff (MR local mode didn't work correctly anyway, last time
> I checked)
>
> On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> > The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> > stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> > distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> > size down.
> >
> >   1.  A few Possibilities:
> >
> >   2.  Drop h2o (binary only) from Distro? (18M - unused)
> >
> >   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> > Remove Hadoop 1 support. could save us some space.
> >
> >   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> > Remove dependency jars from /lib in mahout binary distribution. Should
> also
> > save space.
> >
> >   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> > of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> > https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> > in job jar for mahout-examples.
> >
> >      *   16M    ./lib/hadoop
> >
> >      *   85M    ./lib/
> >
> >         *   Many of the jars in /lib/ and possibly /lib/hadoop are
> already
> > packaged into the mahout-examples jar and adding them to the classpath
> from
> > /lib/ is therefore redundant. As well many may be provided.
> >
>

Re: Mahout distro Size

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I dunno. they build shaded assembly artifact it seems and are happy with
this approach. It would seem we'd just need the legacy deps in a similar
case.

On Tue, Sep 6, 2016 at 4:48 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> bq.
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
>
> Do you mean using something like Spark's dependency resolver?
>
> ________________________________
> From: Dmitriy Lyubimov <dl...@gmail.com>
> Sent: Tuesday, September 6, 2016 4:46:24 PM
> To: dev@mahout.apache.org
> Subject: Re: Mahout distro Size
>
> 2 + 1
> 3 + 1
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
> On the other hand, the samsara only dependencies are really light. backends
> are really always "provided", and the rest of it is fairly small enough not
> to be an issue either way.  but we probably definitely should drop local
> support for MR stuff (MR local mode didn't work correctly anyway, last time
> I checked)
>
> On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> > The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> > stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> > distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> > size down.
> >
> >   1.  A few Possibilities:
> >
> >   2.  Drop h2o (binary only) from Distro? (18M - unused)
> >
> >   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> > Remove Hadoop 1 support. could save us some space.
> >
> >   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> > Remove dependency jars from /lib in mahout binary distribution. Should
> also
> > save space.
> >
> >   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> > of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> > https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> > in job jar for mahout-examples.
> >
> >      *   16M    ./lib/hadoop
> >
> >      *   85M    ./lib/
> >
> >         *   Many of the jars in /lib/ and possibly /lib/hadoop are
> already
> > packaged into the mahout-examples jar and adding them to the classpath
> from
> > /lib/ is therefore redundant. As well many may be provided.
> >
>

Re: Mahout distro Size

Posted by Andrew Palumbo <ap...@outlook.com>.
bq.

4: other projects do something too. spark (at least it used to) to produce
tons of lib-managed deps as the result of its build, they probably still
have?


Do you mean using something like Spark's dependency resolver?

________________________________
From: Dmitriy Lyubimov <dl...@gmail.com>
Sent: Tuesday, September 6, 2016 4:46:24 PM
To: dev@mahout.apache.org
Subject: Re: Mahout distro Size

2 + 1
3 + 1

4: other projects do something too. spark (at least it used to) to produce
tons of lib-managed deps as the result of its build, they probably still
have?

On the other hand, the samsara only dependencies are really light. backends
are really always "provided", and the rest of it is fairly small enough not
to be an issue either way.  but we probably definitely should drop local
support for MR stuff (MR local mode didn't work correctly anyway, last time
I checked)

On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> size down.
>
>   1.  A few Possibilities:
>
>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>
>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> Remove Hadoop 1 support. could save us some space.
>
>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> Remove dependency jars from /lib in mahout binary distribution. Should also
> save space.
>
>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> in job jar for mahout-examples.
>
>      *   16M    ./lib/hadoop
>
>      *   85M    ./lib/
>
>         *   Many of the jars in /lib/ and possibly /lib/hadoop are already
> packaged into the mahout-examples jar and adding them to the classpath from
> /lib/ is therefore redundant. As well many may be provided.
>

Re: Mahout distro Size

Posted by Andrew Palumbo <ap...@outlook.com>.
I'm uncertainly sure that each is understood [?]

________________________________
From: Suneel Marthi <sm...@apache.org>
Sent: Tuesday, September 6, 2016 4:54:09 PM
To: mahout
Subject: Re: Mahout distro Size

On Tue, Sep 6, 2016 at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> PS i probably should not say "probably definitely" next to each other.
> Definitely just definitely :)
>

That's fine.

 "Openly Closed" is now officially part of Apache Lexicon, so why not add
"Definitely Probable".


>
> On Tue, Sep 6, 2016 at 1:46 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > 2 + 1
> > 3 + 1
> >
> > 4: other projects do something too. spark (at least it used to) to
> produce
> > tons of lib-managed deps as the result of its build, they probably still
> > have?
> >
> > On the other hand, the samsara only dependencies are really light.
> > backends are really always "provided", and the rest of it is fairly small
> > enough not to be an issue either way.  but we probably definitely should
> > drop local support for MR stuff (MR local mode didn't work correctly
> > anyway, last time I checked)
> >
> > On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
> >
> >> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> stjsc
> >> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
> >> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
> >> down.
> >>
> >>   1.  A few Possibilities:
> >>
> >>   2.  Drop h2o (binary only) from Distro? (18M - unused)
> >>
> >>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> >> Remove Hadoop 1 support. could save us some space.
> >>
> >>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> >> Remove dependency jars from /lib in mahout binary distribution. Should
> also
> >> save space.
> >>
> >>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a
> lot
> >> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> >> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> >> in job jar for mahout-examples.
> >>
> >>      *   16M    ./lib/hadoop
> >>
> >>      *   85M    ./lib/
> >>
> >>         *   Many of the jars in /lib/ and possibly /lib/hadoop are
> >> already packaged into the mahout-examples jar and adding them to the
> >> classpath from /lib/ is therefore redundant. As well many may be
> provided.
> >>
> >
> >
>

Re: Mahout distro Size

Posted by Suneel Marthi <sm...@apache.org>.
On Tue, Sep 6, 2016 at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> PS i probably should not say "probably definitely" next to each other.
> Definitely just definitely :)
>

That's fine.

 "Openly Closed" is now officially part of Apache Lexicon, so why not add
"Definitely Probable".


>
> On Tue, Sep 6, 2016 at 1:46 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > 2 + 1
> > 3 + 1
> >
> > 4: other projects do something too. spark (at least it used to) to
> produce
> > tons of lib-managed deps as the result of its build, they probably still
> > have?
> >
> > On the other hand, the samsara only dependencies are really light.
> > backends are really always "provided", and the rest of it is fairly small
> > enough not to be an issue either way.  but we probably definitely should
> > drop local support for MR stuff (MR local mode didn't work correctly
> > anyway, last time I checked)
> >
> > On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
> >
> >> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> stjsc
> >> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
> >> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
> >> down.
> >>
> >>   1.  A few Possibilities:
> >>
> >>   2.  Drop h2o (binary only) from Distro? (18M - unused)
> >>
> >>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> >> Remove Hadoop 1 support. could save us some space.
> >>
> >>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> >> Remove dependency jars from /lib in mahout binary distribution. Should
> also
> >> save space.
> >>
> >>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a
> lot
> >> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> >> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> >> in job jar for mahout-examples.
> >>
> >>      *   16M    ./lib/hadoop
> >>
> >>      *   85M    ./lib/
> >>
> >>         *   Many of the jars in /lib/ and possibly /lib/hadoop are
> >> already packaged into the mahout-examples jar and adding them to the
> >> classpath from /lib/ is therefore redundant. As well many may be
> provided.
> >>
> >
> >
>

Re: Mahout distro Size

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS i probably should not say "probably definitely" next to each other.
Definitely just definitely :)

On Tue, Sep 6, 2016 at 1:46 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> 2 + 1
> 3 + 1
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
> On the other hand, the samsara only dependencies are really light.
> backends are really always "provided", and the rest of it is fairly small
> enough not to be an issue either way.  but we probably definitely should
> drop local support for MR stuff (MR local mode didn't work correctly
> anyway, last time I checked)
>
> On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.stjsc
>> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
>> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
>> down.
>>
>>   1.  A few Possibilities:
>>
>>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>>
>>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
>> Remove Hadoop 1 support. could save us some space.
>>
>>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
>> Remove dependency jars from /lib in mahout binary distribution. Should also
>> save space.
>>
>>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
>> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
>> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
>> in job jar for mahout-examples.
>>
>>      *   16M    ./lib/hadoop
>>
>>      *   85M    ./lib/
>>
>>         *   Many of the jars in /lib/ and possibly /lib/hadoop are
>> already packaged into the mahout-examples jar and adding them to the
>> classpath from /lib/ is therefore redundant. As well many may be provided.
>>
>
>

Re: Mahout distro Size

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
2 + 1
3 + 1

4: other projects do something too. spark (at least it used to) to produce
tons of lib-managed deps as the result of its build, they probably still
have?

On the other hand, the samsara only dependencies are really light. backends
are really always "provided", and the rest of it is fairly small enough not
to be an issue either way.  but we probably definitely should drop local
support for MR stuff (MR local mode didn't work correctly anyway, last time
I checked)

On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> size down.
>
>   1.  A few Possibilities:
>
>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>
>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> Remove Hadoop 1 support. could save us some space.
>
>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> Remove dependency jars from /lib in mahout binary distribution. Should also
> save space.
>
>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> in job jar for mahout-examples.
>
>      *   16M    ./lib/hadoop
>
>      *   85M    ./lib/
>
>         *   Many of the jars in /lib/ and possibly /lib/hadoop are already
> packaged into the mahout-examples jar and adding them to the classpath from
> /lib/ is therefore redundant. As well many may be provided.
>

Re: Mahout distro Size

Posted by Andrew Palumbo <ap...@outlook.com>.
Ok sounds good, I agree with all as well.



I think that I have a PR that I could ressurect for #3.  I'd tried it just before a release, and then pulled it at the last minute.  I think that it was relatively simple:  https://github.com/apache/mahout/pull/129.  I only pulled it because I did not have time to test is well enough.With some minor updates, this should take care of it.

[https://avatars0.githubusercontent.com/u/7681565?v=3&s=400]<https://github.com/apache/mahout/pull/129>

MAHOUT-1706: remove dependency jars from /lib in the binary distribution by andrewpalumbo · Pull Request #129 · apache/mahout<https://github.com/apache/mahout/pull/129>
github.com
The mahout distribution currently is shipping ~56 MB of dependecy jars in the /lib directory of the distribution. These are only added to the classpath by /bin/mahout in the binary distribution. ...




+1 to #5, which is covered by MAHOUT-1705, and needs to be reopened-.  This will take a bit of work and I'm sure a good amount of testing.



As far as MAHOUT_LOCAL goes, it is already already in the process of being phased out.  It has been removed from all of the examples.

Here's my +1 to dropping it all together.

________________________________
From: Suneel Marthi <su...@gmail.com>
Sent: Tuesday, September 6, 2016 4:55:10 PM
To: mahout
Subject: Re: Mahout distro Size

+1 to all of them. 2 and 3 are very trivial to do.  Definitely consider
doing #5.


On Tue, Sep 6, 2016 at 4:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.stjsc
> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
> down.
>
>   1.  A few Possibilities:
>
>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>
>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> Remove Hadoop 1 support. could save us some space.
>
>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> Remove dependency jars from /lib in mahout binary distribution. Should also
> save space.
>
>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> in job jar for mahout-examples.
>
>      *   16M    ./lib/hadoop
>
>      *   85M    ./lib/
>
>         *   Many of the jars in /lib/ and possibly /lib/hadoop are already
> packaged into the mahout-examples jar and adding them to the classpath from
> /lib/ is therefore redundant. As well many may be provided.
>

Re: Mahout distro Size

Posted by Suneel Marthi <su...@gmail.com>.
+1 to all of them. 2 and 3 are very trivial to do.  Definitely consider
doing #5.


On Tue, Sep 6, 2016 at 4:33 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.stjsc
> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
> down.
>
>   1.  A few Possibilities:
>
>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>
>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> Remove Hadoop 1 support. could save us some space.
>
>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> Remove dependency jars from /lib in mahout binary distribution. Should also
> save space.
>
>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> of dependencies to <provided> scope, we can revisit: MAHOUT-1705<
> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> in job jar for mahout-examples.
>
>      *   16M    ./lib/hadoop
>
>      *   85M    ./lib/
>
>         *   Many of the jars in /lib/ and possibly /lib/hadoop are already
> packaged into the mahout-examples jar and adding them to the classpath from
> /lib/ is therefore redundant. As well many may be provided.
>