You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (Created) (JIRA)" <ji...@apache.org> on 2011/11/05 01:05:52 UTC

[jira] [Created] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Extract Writables into a separate module to allow smaller dependencies
----------------------------------------------------------------------

                 Key: MAHOUT-874
                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
             Project: Mahout
          Issue Type: Improvement
            Reporter: Ted Dunning


The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.

I have a prototype, but it has some funky characteristics which I would like to discuss.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172935#comment-13172935 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

On Mon, Dec 19, 2011 at 8:41 PM, Jake Mannix (Commented) (JIRA) <


There are also EntityCountWritable, EntityEntityWritable,
EntityPrefWritable, EntityPrefWritableArrayWritable,
RecommendedItemsWritable, PrefAndSimilarityColumnWritable,
VectorAndPrefsWritable, VectorOrPrefWritable.

The dependency on hadoop is huge, yes, but if we're running on hadoop

I think that including core but not hadoop might do the trick even so.
 Suddenly it occurs to me that the right way to deal with this is to use
the provided scope.




I only used the jar size in MB as a measure of how large the transitive
dependencies actually are.

                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144821#comment-13144821 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

If Cluster is bringing in too much, maybe in this first pass, we don't move it over?  Keep this new jar/module small for now, and leave as a future JIRA ticket to find a way to extract Cluster out of core and get it into the writable module.

Realistically, we could keep this jar super tiny to start with (the taste/*Writables, o.a.m.common.*Writable, and o.a.m.math.*Writable), and only pull in more complicated stuff that isn't properly decoupled later.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172942#comment-13172942 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

OK.  Putting hadoop in as "provided" reduces the size of all of the
dependencies to 3.8MB.  Eliminating slf4j drops this to 3.7MB.  Eliminating
mahout-math drops this to 60KB.

Ergo, mahout-math is by far the tall pole and roughly 3.8MB is the
reasonable minimum for the transitive dependencies.  This is not all that
bad and is a lot better than the 20MB that we started with.

On Mon, Dec 19, 2011 at 9:39 PM, Ted Dunning (Commented) (JIRA) <


                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Grant Ingersoll (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144980#comment-13144980 ] 

Grant Ingersoll edited comment on MAHOUT-874 at 11/6/11 11:33 AM:
------------------------------------------------------------------

Why is Cluster even dependent on VectorWritable?  Shouldn't it just be dependent on Vector?  Seems to me that VectorWritable should only ever be instantiated inside of a Map/Reduce job.  All the core stuff should just take Vector.

Stuff like:
{code}
@Override
  public void observe(VectorWritable x) {
    observe(x.get());
  }
{code}

just seems silly.  We already have observe(Vector).

Not that it necessarily solves the problem just yet, but it still strikes me as not needed.  Perhaps the same is also true for Model?  In fact, could Model be moved to Math?  Seems fairly generic and perhaps useful outside of clustering?.  Then, we could have ModelWritable which takes care of the Writable part of it.
                
      was (Author: gsingers):
    Why is Cluster even dependent on VectorWritable?  Shouldn't it just be dependent on Vector?  Seems to me that VectorWritable should only ever be instantiated inside of a Map/Reduce job.  All the core stuff should just take Vector.

Stuff like:
{code}
@Override
  public void observe(VectorWritable x) {
    observe(x.get());
  }
{code}

just seems silly.  We already have observe(Vector).
                  
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172729#comment-13172729 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

Yes, the primary problem is that of jar-hell, and transitive dependencies.  Mahout-math depends on very little that it really needs (other than guava) - both commons-math and uncommons-math are only used in a few places, and can be <exclude>'ed from ivy/maven imports for most apps.  Once you go to mahout-core, the list of dependencies grows pretty huge, and keeping track of how long your exclude list is can be unweildy.  

So it's not the size, per se, but the stuff that gets pulled in.  Any maven artifact which can be included with just a few <exclude>hadoop</exclude> bits and yet still only bring in just a few things would make it much easier to convince other teams to pull this in.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172818#comment-13172818 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

I'm not proposing that mahout-math depend on mahout-core.   Where did I say that?  mahout-core depends on mahout-math depends on mahout-collections.  I'm suggesting we have mahout-core depend on both mahout-writables and mahout-math which depend on mahout-collections.

So in theory, yes, putting a bunch of <exclude> for every dep in core that isn't used, that can work.  But is ugly, and the writable package, if it existed, could be depended on in other open source projects which wanted to be wire compatible with us.  Example case in point: elephant-bird is one of twitter's open source hadoop utils projects.  It doesn't want to depend on all of mahout, but would like to be able to load mahout vectorwritables etc, and then turn those into, say, a pig script.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172736#comment-13172736 ] 

Sean Owen commented on MAHOUT-874:
----------------------------------

Separating out a few classes won't change what they depend on, and won't cause you to need any more or fewer classes at runtime. Your jar hell is the same.
Is the issue Maven packaging all the transitive dependencies? If that's your issue then again, a run through Proguard (with properly configured entry points) will strip out not just the Mahout code you don't use but anything else you don't use. I think that is maybe the better solution to the particular issue you face? these things otherwise seem pretty "core" and live where they should live for the general user.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172829#comment-13172829 ] 

Sean Owen commented on MAHOUT-874:
----------------------------------

That's not what I meant -- you were drawing a comparison to mahout-math vs mahout-core. I was saying it didn't seem like quite the same thing, since as I understand the change, the new module still depends on core. Do I misunderstand, since if true, this really wouldn't change anything? I thought Ted was pointing out that to actually make headway, and cut the pointer to core, there is additional code surgery needed around Cluster.

I guess I am still missing what's wrong with "depending on all of mahout-core". Have you seen the tree that Hadoop brings in -- has it ever mattered?
I know I am asking a dumb question, but I am still not clear: is it the size of a jarred up file of all transitive dependencies that is at issue? But forget the question of whether it matters; it doesn't matter to me but wouldn't mean I would object to such a change if even a few people wanted it.

My real question is just whether this is solving the problem it's supposed to solve. If the question is one of run-time dependencies, this change will not make any difference, so I would not see a reason to make it. If it's a question of Maven/compile-time dependency, then as I understand this still doesn't solve something due to a lingering dependence on core via cluster. (I may misunderstand.) In which case I would merely say there needs to be ground-work done, that hasn't been done, and that's what should be posted as a patch and discussed next!
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144474#comment-13144474 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

With a quick slash, the dependencies are down to this
{code}
org.apache.mahout:mahout-math:0.6-SNAPSHOT
org.apache.hadoop:hadoop-core:0.20.204.0
commons-cli:commons-cli:1.2
commons-httpclient:commons-httpclient:3.0.1
commons-codec:commons-codec:1.4
commons-configuration:commons-configuration:1.6
org.codehaus.jackson:jackson-core-asl:1.8.2
org.codehaus.jackson:jackson-mapper-asl:1.8.2
org.slf4j:slf4j-api:1.6.1
org.slf4j:slf4j-jcl:1.6.1
junit:junit:4.8.2
{code}

The number of classes I had to move was a bit surprising.  THis is will result in some ugliness in coding because different pieces of packages will need to be in different places.

The complete list of classes in the writable jar is this:

{code}
src/main/java/org/apache/mahout/cf/taste/hadoop/EntityCountWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/EntityEntityWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritableArrayWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/item/PrefAndSimilarityColumnWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorAndPrefsWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorOrPrefWritable.java
src/main/java/org/apache/mahout/cf/taste/hadoop/RecommendedItemsWritable.java
src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericRecommendedItem.java
src/main/java/org/apache/mahout/cf/taste/recommender/RecommendedItem.java
src/main/java/org/apache/mahout/classifier/sgd/PolymorphicWritable.java
src/main/java/org/apache/mahout/clustering/AbstractCluster.java
src/main/java/org/apache/mahout/clustering/Cluster.java
src/main/java/org/apache/mahout/clustering/ClusterObservations.java
src/main/java/org/apache/mahout/clustering/Model.java
src/main/java/org/apache/mahout/clustering/spectral/common/IntDoublePairWritable.java
src/main/java/org/apache/mahout/clustering/spectral/common/VertexWritable.java
src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java
src/main/java/org/apache/mahout/clustering/WeightedVectorWritable.java
src/main/java/org/apache/mahout/common/ClassUtils.java
src/main/java/org/apache/mahout/common/IntPairWritable.java
src/main/java/org/apache/mahout/common/parameters/Parameter.java
src/main/java/org/apache/mahout/common/parameters/Parametered.java
src/main/java/org/apache/mahout/graph/linkanalysis/VectorElementWritable.java
src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/DenseBlockWritable.java
src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java
src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java
src/main/java/org/apache/mahout/math/MatrixWritable.java
src/main/java/org/apache/mahout/math/MultiLabelVectorWritable.java
src/main/java/org/apache/mahout/math/Varint.java
src/main/java/org/apache/mahout/math/VarIntWritable.java
src/main/java/org/apache/mahout/math/VarLongWritable.java
src/main/java/org/apache/mahout/math/VectorWritable.java
{code}

The disappointment is really with the Cluster class.  It had to move and that pulled a bunch of other things across.

What is the sense about this?

                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172477#comment-13172477 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

Hey Ted,

  Is there a way we can revive this / get this in shape?  This issue is blocking getting Mahout integrated with some projects we have (that don't want all of mahout-core's baggage, but want writables + math).  Is this patch out of date?
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172489#comment-13172489 ] 

Sean Owen commented on MAHOUT-874:
----------------------------------

Is this purely an issue of the size of your resulting jar? You can do this more effectively with a one liner with proguard in your build. I imagine its convoluted to pull out the Writables and is going to make everyone else need two jars where there was one. 
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172792#comment-13172792 ] 

Sean Owen commented on MAHOUT-874:
----------------------------------

What all classes in core depend on doesn't matter, if you are only using the Writables. Then, it only matters what the Writable classes depend on; unused classes are never loaded and have no effect. But then, it means depending on a jar of just the Writables doesn't change what you need at runtime, so what does this help for your use case? I assume it's not a runtime issue then.

It's something to do with Maven output? But what real problem is that causing... the availability of dependencies doesn't harm anything. It makes the job file bigger. But are you deploying in a case where a couple megs in a jar file matters? The only case I've seen where it matters is mobile apps, these days, and you say you don't want to use Proguard. Ted's indicating it doesn't save much.

Why is <exclude> so bad, this seems like what it's for. core has 12 third-party dependencies, and won't move much. That's not so bad even if you wanted to exclude each one. You could create your own (internal) artifact that is just "core, stripping the dependencies we don't want" that everyone can depend on.

mahout-math doesn't depend on mahout-core. I think you're proposing a circular dependency here (?). Which is possible. But that is symptomatic of the difference. I suppose you can start looking at severing more dependencies and breaking out even more sub-modules; now users need to figure out which of 3, 4, 5 jars are needed.

I don't doubt this solves your problem, just asking whether it solves a more general need, since it is going to create small additional work for all other consumers or core.

Or: don't we actually need some code surgery around Cluster to actually accomplish what you want anyway? or else it ends up depending on core anyway.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145119#comment-13145119 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

Dependency on *Writable isn't a problem.

The problem is that Clusters are writable or that certain writables depend on them.

See Model.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172913#comment-13172913 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

Let me see what might be done.  Currently, I have the lazy way of doing
things and build the jar-with-dependencies without an explicit assembly.
 As it is, therefore, there is less expressivity available than there might
be with respect to things like excludes, but I don't know the assembly
plugin all that well.

I will see what is easy to do.

On Mon, Dec 19, 2011 at 3:59 PM, Jake Mannix (Commented) (JIRA) <


                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172915#comment-13172915 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

Yes, I was meaning that mahout-writables would *not* depend on mahout-core.  If that requires further headway around Cluster (or just leaving the ClusterWritables back in core, and pulling them out later when the surgery is complete), then so be it.

The dependency on hadoop is huge, yes, but if we're running on hadoop (which would be the case if you have mahout-writable as the package in question), then you already depend on that, that's a given.

It is not the question of jar size in MB which matters here, no. The question is of runtime dependencies, and I guess we're just missing understanding each other because I'm not pushing on the original git branch Ted made, but instead the *end goal* of what would happen once cluster was removed.  Yes, the work that should be patched next, in my view, actually, is to post what you get if you pull out all of the *easy* *Writables (ie. everything except Cluster, I guess?) as a first pass, leaving cluster back in core.  

I would personally think that was a positive first step, a) creating a place for writables to go, moving forward, and b) providing a dependency which knew how to deal with many of the common serialized objects of mahout.  Step 2 would be to work further iterations around getting all remaining Writables out of core and into this new package.

I don't think Step 1 and 2 need to be done at the same time, however.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144616#comment-13144616 ] 

Lance Norskog commented on MAHOUT-874:
--------------------------------------

If you're going to unify clustering and classification, then some cluster-only classes will disappear, right? Perhaps only core writables should be pushed out?

This is a poster child for using a few weakly typed data structures instead of many strongly typed structures. A cluster is a graph, so use graph-oriented structures instead of custom ones.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144904#comment-13144904 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

Jake, that is reasonable, but I think I would just attack it the other way round by moving Cluster now to the writable jar and then move it back later as possible.  The size of the jar is tiny and Cluster doesn't affect the dependencies.


                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172522#comment-13172522 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

I am sure that the patch is out of date and my git repo is a much easier
place to get a coherent change.

My problem is that this drops the Mahout size to a few 10's of K, but it
doesn't get rid of the dependencies which bloat the package back to about
10MB.  See this, for instance,

*$ pwd*
*/Users/tdunning/Apache/mahout/writables*
*$ du -sh target/*.jar*
*9.8M target/mahout-writables-0.6-SNAPSHOT-jar-with-dependencies.jar*
* 48K target/mahout-writables-0.6-SNAPSHOT-sources.jar*
* 60K target/mahout-writables-0.6-SNAPSHOT.jar*


Would this even make a difference to you?

On Mon, Dec 19, 2011 at 10:37 AM, Jake Mannix (Commented) (JIRA) <


                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173069#comment-13173069 ] 

Sean Owen commented on MAHOUT-874:
----------------------------------

Ah, there's a 'provided' scope? That would be great since no use case we support involves "bringing our own" Hadoop. It's needed to compile only. That would probably reduce the job jar size too, and probably avoid some problems.

So just that one change to core makes it that much smaller? Surely 3.8MB and a few transitive dependencies is pretty OK to bring in to anything?
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172754#comment-13172754 ] 

Jake Mannix commented on MAHOUT-874:
------------------------------------

The *Writables depend on very little other than core Mahout classes (internal) and Hadoop.  The runtime jarhell is completely minimized.  Look at the list of dependencies the *Writables would depend on compared to the mahout-core package.

Just as mahout-math is totally "core" to what we do, it's still really nice that it's in its own jar with very minimal external dependencies.  mahout-writables *could* be equally slim and non-dependent, and allow for a jar which lets people read/write wire-compatible data with us without depending on everything that mahout-core pulls in.

Re: Progaurd I don't think I have much say in changing the way our build system works.  We use ivy, and I can depend on stuff from maven repos, and put in exclude statements, but that's about it.  This is very similar to other places I've worked, as this is a pretty common issue.
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144980#comment-13144980 ] 

Grant Ingersoll commented on MAHOUT-874:
----------------------------------------

Why is Cluster even dependent on VectorWritable?  Shouldn't it just be dependent on Vector?  Seems to me that VectorWritable should only ever be instantiated inside of a Map/Reduce job.  All the core stuff should just take Vector.

Stuff like:
{code}
@Override
  public void observe(VectorWritable x) {
    observe(x.get());
  }
{code}

just seems silly.  We already have observe(Vector).
                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172524#comment-13172524 ] 

Ted Dunning commented on MAHOUT-874:
------------------------------------

Hmm....

Looking at this again, the biggest dependency is Hadoop.  Presumably, that
will be available in your cluster.  :-)



                
> Extract Writables into a separate module to allow smaller dependencies
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-874
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-874
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira