You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2015/02/03 01:54:09 UTC

Re: Codebase refactoring proposal

bottom line compile-time dependencies are satisfied with no extra stuff
from mr-legacy or its transitives. This is proven by virtue of  successful
compilation with no dependency on mr-legacy on the tree.

Runtime sufficiency for no extra dependency is proven via running shell or
embedded tests (unit tests) which are successful too. This implies
embedding and shell apis.

Issue with guava is typical one. if it were an issue, i wouldn't be able to
compile and/or run stuff. Now, question is what do we do if drivers want
extra stuff that is not found in Spark.

Now, It is so nice not to depend on anything extra so i am hesitant to
offer anything  here. either shading or lib with opt-in dependency policy
would suffice though, since it doesn't look like we'd have to have tons of
extra for drivers.



On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I vaguely remember there being a Guava version problem where the version
> had to be rolled back in one of the hadoop modules. The math-scala
> IndexedDataset shouldn’t care about version.
>
> BTW It seems pretty easy to take out the option parser and replace with
> match and tuples especially if we can extend the Scala App class. It might
> actually simplify things since I can then use several case classes to hold
> options (scopt needed one object), which in turn takes out all those ugly
> casts. I’ll take a look next time I’m in there.
>
> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> in 'spark' module it is overwritten with spark dependency, which also comes
> at the same version so happens. so should be fine with 1.1.x
>
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-spark_2.10 ---
> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> [INFO] |  |  |  +-
> commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  |  +-
> commons-collections:commons-collections:jar:3.2.1:compile
> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  |  \-
> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> [INFO] |  |  |  |  +-
> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> [INFO] |  |  |  |  |  +-
>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +-
>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> [INFO] |  |  |  |  |  |  |  +-
> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> [INFO] |  |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |     \-
> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> [INFO] |  |  |  |  |  |     |        \-
> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> [INFO] |  |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
> [INFO] |  |  |  |  |  |  |     \-
> javax.activation:activation:jar:1.1:compile
> [INFO] |  |  |  |  |  |  +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> [INFO] |  |  |  |  |  |  \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> [INFO] |  |  |  |  |  \-
> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> [INFO] |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> [INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  \-
> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> [INFO] |  |     \-
>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> [INFO] |  |        \-
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> [INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> [INFO] |  |  +-
> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> [INFO] |  |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> d
>
> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > looks like it is also requested by mahout-math, wonder what is using it
> > there.
> >
> > At very least, it needs to be synchronized to the one currently used by
> > spark.
> >
> > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
> > ---
> > [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >
> >
> > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> Looks like Guava is in Spark.
> >>
> >> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>
> >> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
> >> would not be included since I think it was taken from the mrlegacy jar.
> >>
> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>
> >> ---------- Forwarded message ----------
> >> From: "Pat Ferrel" <pa...@occamsmachete.com>
> >> Date: Jan 25, 2015 9:39 AM
> >> Subject: Re: Codebase refactoring proposal
> >> To: <de...@mahout.apache.org>
> >> Cc:
> >>
> >>> When you get a chance a PR would be good.
> >>
> >> Yes, it would. And not just for that.
> >>
> >>> As I understand it you are putting some class jars somewhere in the
> >> classpath. Where? How?
> >>>
> >>
> >> /bin/mahout
> >>
> >> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >> 'bin/mahout -spark'.)
> >>
> >> If i interpret current shell code there correctky, legacy path tries to
> >> use
> >> examples assemblies if not packaged, or /lib if packaged. True
> motivation
> >> of that significantly predates 2010 and i suspect only Benson knows
> whole
> >> true intent there.
> >>
> >> The spark path, which is really a quick hack of the script, tries to get
> >> only selected mahout jars and locally instlalled spark classpath which i
> >> guess is just the shaded spark jar in recent spark releases. It also
> >> apparently tries to include /libs/*, which is never compiled in
> unpackaged
> >> version, and now i think it is a bug it is included  because /libs/* is
> >> apparently legacy packaging, and shouldnt be used  in spark jobs with a
> >> wildcard. I cant beleive how lazy i am, i still did not find time to
> >> understand mahout build in all cases.
> >>
> >> I am not even sure if packaged mahout will work with spark, honestly,
> >> because of the /lib. Never tried that, since i mostly use application
> >> embedding techniques.
> >>
> >> The same solution may apply to adding external dependencies and removing
> >> the assembly in the Spark module. Which would leave only one major build
> >> issue afaik.
> >>>
> >>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>>
> >>> No, no PR. Only experiment on private. But i believe i sufficiently
> >> defined
> >>> what i want to do in order to gauge if we may want to advance it some
> >> time
> >>> later. Goal is much lighter dependency for spark code. Eliminate
> >> everything
> >>> that is not compile-time dependent. (and a lot of it is thru legacy MR
> >> code
> >>> which we of course don't use).
> >>>
> >>> Cant say i understand the remaining issues you are talking about
> though.
> >>>
> >>> If you are talking about compiling lib or shaded assembly, no, this
> >> doesn't
> >>> do anything about it. Although point is, as it stands, the algebra and
> >>> shell don't have any external dependencies but spark and these 4 (5?)
> >>> mahout jars so they technically don't even need an assembly (as
> >>> demonstrated).
> >>>
> >>> As i said, it seems driver code is the only one that may need some
> >> external
> >>> dependencies, but that's a different scenario from those i am talking
> >>> about. But i am relatively happy with having the first two working
> >> nicely
> >>> at this point.
> >>>
> >>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>
> >>>> +1
> >>>>
> >>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
> >> nice
> >>>> to see how you’ve structured that in case we can use the same model to
> >>>> solve the two remaining refactoring issues.
> >>>> 1) external dependencies in the spark module
> >>>> 2) no spark or h2o in the release artifacts.
> >>>>
> >>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> >>>>
> >>>> Also +1
> >>>>
> >>>> iPhone'd
> >>>>
> >>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
> wrote:
> >>>>>
> >>>>> +1
> >>>>>
> >>>>>
> >>>>> Sent from my Verizon Wireless 4G LTE smartphone
> >>>>>
> >>>>> <div>-------- Original message --------</div><div>From: Dmitriy
> >> Lyubimov
> >>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
> >>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
> >>>> refactoring proposal </div><div>
> >>>>> </div>
> >>>>> So right now mahout-spark depends on mr-legacy.
> >>>>> I did quick refactoring and it turns out it only _irrevocably_
> depends
> >> on
> >>>>> the following classes there:
> >>>>>
> >>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> and
> >>>> ...
> >>>>> *sigh* o.a.m.common.Pair
> >>>>>
> >>>>> So  I just dropped those five classes into new a new tiny
> >> mahout-hadoop
> >>>>> module (to signify stuff that is directly relevant to serializing
> >> thigns
> >>>> to
> >>>>> DFS API) and completely removed mrlegacy and its transients from
> spark
> >>>> and
> >>>>> spark-shell dependencies.
> >>>>>
> >>>>> So non-cli applications (shell scripts and embedded api use) actually
> >>>> only
> >>>>> need spark dependencies (which come from SPARK_HOME classpath, of
> >> course)
> >>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> >>>>> optionally mahout-spark-shell (for running shell)).
> >>>>>
> >>>>> This of course still doesn't address driver problems that want to
> >> throw
> >>>>> more stuff into front-end classpath (such as cli parser) but at least
> >> it
> >>>>> renders transitive luggage of mr-legacy (and the size of
> >> worker-shipped
> >>>>> jars) much more tolerable.
> >>>>>
> >>>>> How does that sound?
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >
>
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

But also keep in mind that Flink folks are eager to allocate resources for
ML work. So maybe that's the way to work it -- create a DataFrame-based
seq2sparse port and then just hand it off to them to add to either Flink
directly (but with DRM output), or as a part of Mahout.

On Wed, Feb 4, 2015 at 2:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Spark's DataFrame is obviously not agnostic.
>
> I don't believe there's a good way to abstract it. Unfortunately. I think
> getting too much into distributed operation abstraction is a bit dangerous.
>
> I think MLI was one project that attempted to do that -- but it did not
> take off i guess. or at least there were 0 commits in like 18 months there
> if i am not mistaken, and it never made it into spark tree.
>
> So it is a good question. if we need a dataframe in flink, what do we do.
> I am open to suggestions. I very much don't want to do "yet another
> abstract language-integrated Spark SQL" feature.
>
> Given resources, IMO it'd be better to take on fewer goals but make them
> shine. So i'd do spark-based seq2sparse version first and that'd give some
> ideas how to create ports/abstractions of that work to Flink.
>
>
>
> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>>
>> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>>
>>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>>> there since they are going beyond the scope of that PR's work to chase
>>> the
>>> root of the issue.
>>>
>>> on quasi-algebraic methods
>>> ========================
>>>
>>> What is the dilemma here? don't see any.
>>>
>>> I already explained that no more than 25% of algorithms are truly 100%
>>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>>
>>> So we are building system that allows us to cut developer's work by at
>>> least 60% and make his work also more readable by 3000%. As far as I am
>>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>>> of engine-specific primitives and algebra.
>>>
>>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>>> primitives such as row-wise aggregators in one of the pull requests.
>>> Engine-specific primitives and algebra can perfectly co-exist in the
>>> guts.
>>> And that's how i am doing my stuff in practice, except i now can skip 80%
>>> effort on algebra and bridging incompatible intputs-outputs.
>>>
>> I am **definitely** not advocating messing with the algebraic optimizer.
>> That was what I saw as the plus side to Gokhan's PR- a separate engine
>> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
>> on the PR either because admittedly I did not have a chance to spend a lot
>> of time on it.  But my quick takeaway was  that we could take some very
>> useful and hopefully (close to) ubiquitous distributed operators and pass
>> them through to the engine "guts".
>>
>> I briefly looked through some of the flink and h2o code and noticed
>> Flink's aggregateOperator [1]
>> and h2o's MapReduce API and [2] my thought was that we could write pass
>> through operators for some of the more useful operations from math-scala
>> and then implement them fully in their respective packages.  Though I am
>> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
>> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
>> alot of time to look at these and see if this would work at all.
>>
>> My thought was not to bring primitive engine specific aggregetors,
>> combiners,  etc. into math-scala.
>>
>> I had thought though that we were trying to develop a fully engine
>> agnostic algorithm library in on top of the R-Like distributed BLAS.
>>
>>
>> So would the idea be to implement i.e. seq2sparse fully in the spark
>> module?  It would seem to fracture the project a bit.
>>
>>
>> Or to implement algorithms sequentially if mapBlock() will not suffice
>> and then optimize them in their respective modules?
>>
>>
>>
>>
>>> None of that means that R-like algebra cannot be engine agnostic. So
>>> people
>>> are unhappy about not being able to write the whole in totaly agnostic
>>> way?
>>> And so they (falsely) infer the pieces of their work cannot be helped by
>>> agnosticism individually, or the tools are not being as good as they
>>> might
>>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>>
>>> We proved algebra can be agnostic. I don't think this notion should be
>>> disputed.
>>>
>>> And even if there were a shred of real benefit by making algebra tools
>>> un-agnostic, it would not ever outweigh tons of good we could get for the
>>> project by integrating with e.g. Flink folks. This one one the points
>>> MLLib
>>> will never be able to overcome -- to be truly shared ML platform where
>>> people could create and share ML, but not just a bunch of ad-hoc
>>> spaghetty
>>> of distributed api calls and Spark-nailed black boxes.
>>>
>>> Well yes methodology implementations will still have native distributed
>>> calls. Just not nearly as many as they otherwise would, and will be much
>>> more easier to support on another back-end using Strategy patterns. E.g.
>>> implicit feedback problem that i originally wrote as quasi-method for
>>> Spark
>>> only, would've taken just an hour or so to add strategy for flink, since
>>> it
>>> retains all in-core and distributed algebra work as is.
>>>
>>> Not to mention benefit of single type pipelining.
>>>
>>> And once we add hardware-accelerated bindings for in-core stuff, all
>>> these
>>> methods would immediately benefit from it.
>>>
>>> On MLLib interoperability issues,
>>> =========================
>>>
>>> well, let me ask you this: what it means to be MLLib-interoperable? is
>>> MLLib even interoperable within itself?
>>>
>>> E.g. i remember there was one most frequent request on the list here: how
>>> can we cluster dimensionally-reduced data?
>>>
>>> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
>>> produces collection of vectors (and where did our document ids go? not
>>> sure); then we'd have to run svd or pca, both of which would accept
>>> RowMatrix (bummer! but we have collection of vectors); which would
>>> produce
>>> RowMatrix as well but kmeans training takes RDD of vectors (bummer
>>> again!).
>>>
>>> Not directly pluggable, although semi-trivially or trivially convertible.
>>> Plus strips off information that we potentially already have computed
>>> earlier in the pipeline, so we'd need to compute it again. I think
>>> problem
>>> is well demonstrated.
>>>
>>> Or, say, ALS stuff (implicit als in particular) is really an algebraic
>>> problem. Should be taking input in form of matrices (that my feature
>>> extraction algebraic pipeline perhaps has just prepared) but really takes
>>> POJOs. Bummer again.
>>>
>>> So what it is exactly we should be interoperable with in this picture if
>>> MLLib itself is not consistent?
>>>
>>> Let's look at the type system in flux there:
>>>
>>> we have
>>> (1) collection of vectors,
>>> (2) matrix of known dimensions for collection of vectors (row matrix),
>>> (3) indexedRowMatrix which is matrix of known dimension with keys that
>>> can
>>> be _only_ long; and
>>> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>>>
>>> But ok, let's constrain ourselves to matrix types only.
>>>
>>> Multitude of matrix types creates problems for tasks that require
>>> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
>>> in the case of mllib). In the aforementioned case of dimensionality
>>> reduction over document collection, there's simply no way to propagate
>>> document ids to the rows of dimensionally-reduced data. As in none at
>>> all.
>>> as in hard no-work-around-exists stop.
>>>
>>> So. There's truly no need for multiple incompatible matrix types. There
>>> has
>>> to be just single matrix type. Just flexible one. And everything
>>> algebraic
>>> needs to use it.
>>>
>>> And if geometry is needed, then it could be either already known or
>>> lazily
>>> computed, but if it is not needed, nobody bothers to compute it. (i.e.
>>> truly no need And this knowledge should not be lost just because we have
>>> to
>>> convert between types.
>>>
>>> And if we want to express complex row keys such as for cluster
>>> assignments
>>> for example (my real case) then we could have a type with keys like
>>> Tuple2(rowKeyType, cluster-string).
>>>
>>> And that nobody really cares if intermediate results are really be row or
>>> column partitioned.
>>>
>>> All within single type of things.
>>>
>>> Bottom line, "interoperability" with mllib is both hard and trivial.
>>>
>>> Trivial is because whenever you need to convert, it is one line of code
>>> and
>>> also a trivial distributed map fusion element. (I do have pipelines
>>> streaming mllib methods within DRM-based pipelines, not just
>>> speculating).
>>>
>>> Hard is because there are so many types you may need/want to convert
>>> between, so there's not much point to even try to write converters for
>>> all
>>> possible cases but rather go on need-to-do basis.
>>>
>>> It is also hard because their type system obviously continues evolving as
>>> we speak. So no point chase the rabbit in the making.
>>>
>>> Epilogue
>>> =======
>>> There's no problem with the philosophy of the distributed and
>>> non-distributed algebra approach. It is incredibly useful in practice
>>> and I
>>> have proven it continuously (what is in public domain is just tip of the
>>> iceberg).
>>>
>>> Rather, there's organizational anemia in the project. Like corporate
>>> legal
>>> interests (that includes me not being able to do quick turnaround of
>>> fixes), and not having been able to tap into university resources. But i
>>> don't believe in any technical philosophy problem.
>>>
>>> So given that aforementioned resource/logistical anemia, it will likely
>>> take some when it would seem it  gets worse  before it gets better. But
>>> afaik there are multiple efforts going on behind the curtains to break
>>> red
>>> tapes. so i'd just wait a bit.
>>>
>>>
>>
>> [1] https://github.com/apache/flink/blob/master/flink-java/
>> src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
>> [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/
>> 5/docs-website/developuser/java.html
>>
>>
>>
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Spark's DataFrame is obviously not agnostic.

I don't believe there's a good way to abstract it. Unfortunately. I think
getting too much into distributed operation abstraction is a bit dangerous.

I think MLI was one project that attempted to do that -- but it did not
take off i guess. or at least there were 0 commits in like 18 months there
if i am not mistaken, and it never made it into spark tree.

So it is a good question. if we need a dataframe in flink, what do we do. I
am open to suggestions. I very much don't want to do "yet another abstract
language-integrated Spark SQL" feature.

Given resources, IMO it'd be better to take on fewer goals but make them
shine. So i'd do spark-based seq2sparse version first and that'd give some
ideas how to create ports/abstractions of that work to Flink.



On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:

>
> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>
>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>> there since they are going beyond the scope of that PR's work to chase the
>> root of the issue.
>>
>> on quasi-algebraic methods
>> ========================
>>
>> What is the dilemma here? don't see any.
>>
>> I already explained that no more than 25% of algorithms are truly 100%
>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>
>> So we are building system that allows us to cut developer's work by at
>> least 60% and make his work also more readable by 3000%. As far as I am
>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>> of engine-specific primitives and algebra.
>>
>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>> primitives such as row-wise aggregators in one of the pull requests.
>> Engine-specific primitives and algebra can perfectly co-exist in the guts.
>> And that's how i am doing my stuff in practice, except i now can skip 80%
>> effort on algebra and bridging incompatible intputs-outputs.
>>
> I am **definitely** not advocating messing with the algebraic optimizer.
> That was what I saw as the plus side to Gokhan's PR- a separate engine
> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
> on the PR either because admittedly I did not have a chance to spend a lot
> of time on it.  But my quick takeaway was  that we could take some very
> useful and hopefully (close to) ubiquitous distributed operators and pass
> them through to the engine "guts".
>
> I briefly looked through some of the flink and h2o code and noticed
> Flink's aggregateOperator [1]
> and h2o's MapReduce API and [2] my thought was that we could write pass
> through operators for some of the more useful operations from math-scala
> and then implement them fully in their respective packages.  Though I am
> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
> alot of time to look at these and see if this would work at all.
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
>
> I had thought though that we were trying to develop a fully engine
> agnostic algorithm library in on top of the R-Like distributed BLAS.
>
>
> So would the idea be to implement i.e. seq2sparse fully in the spark
> module?  It would seem to fracture the project a bit.
>
>
> Or to implement algorithms sequentially if mapBlock() will not suffice and
> then optimize them in their respective modules?
>
>
>
>
>> None of that means that R-like algebra cannot be engine agnostic. So
>> people
>> are unhappy about not being able to write the whole in totaly agnostic
>> way?
>> And so they (falsely) infer the pieces of their work cannot be helped by
>> agnosticism individually, or the tools are not being as good as they might
>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>
>> We proved algebra can be agnostic. I don't think this notion should be
>> disputed.
>>
>> And even if there were a shred of real benefit by making algebra tools
>> un-agnostic, it would not ever outweigh tons of good we could get for the
>> project by integrating with e.g. Flink folks. This one one the points
>> MLLib
>> will never be able to overcome -- to be truly shared ML platform where
>> people could create and share ML, but not just a bunch of ad-hoc spaghetty
>> of distributed api calls and Spark-nailed black boxes.
>>
>> Well yes methodology implementations will still have native distributed
>> calls. Just not nearly as many as they otherwise would, and will be much
>> more easier to support on another back-end using Strategy patterns. E.g.
>> implicit feedback problem that i originally wrote as quasi-method for
>> Spark
>> only, would've taken just an hour or so to add strategy for flink, since
>> it
>> retains all in-core and distributed algebra work as is.
>>
>> Not to mention benefit of single type pipelining.
>>
>> And once we add hardware-accelerated bindings for in-core stuff, all these
>> methods would immediately benefit from it.
>>
>> On MLLib interoperability issues,
>> =========================
>>
>> well, let me ask you this: what it means to be MLLib-interoperable? is
>> MLLib even interoperable within itself?
>>
>> E.g. i remember there was one most frequent request on the list here: how
>> can we cluster dimensionally-reduced data?
>>
>> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
>> produces collection of vectors (and where did our document ids go? not
>> sure); then we'd have to run svd or pca, both of which would accept
>> RowMatrix (bummer! but we have collection of vectors); which would produce
>> RowMatrix as well but kmeans training takes RDD of vectors (bummer
>> again!).
>>
>> Not directly pluggable, although semi-trivially or trivially convertible.
>> Plus strips off information that we potentially already have computed
>> earlier in the pipeline, so we'd need to compute it again. I think problem
>> is well demonstrated.
>>
>> Or, say, ALS stuff (implicit als in particular) is really an algebraic
>> problem. Should be taking input in form of matrices (that my feature
>> extraction algebraic pipeline perhaps has just prepared) but really takes
>> POJOs. Bummer again.
>>
>> So what it is exactly we should be interoperable with in this picture if
>> MLLib itself is not consistent?
>>
>> Let's look at the type system in flux there:
>>
>> we have
>> (1) collection of vectors,
>> (2) matrix of known dimensions for collection of vectors (row matrix),
>> (3) indexedRowMatrix which is matrix of known dimension with keys that can
>> be _only_ long; and
>> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>>
>> But ok, let's constrain ourselves to matrix types only.
>>
>> Multitude of matrix types creates problems for tasks that require
>> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
>> in the case of mllib). In the aforementioned case of dimensionality
>> reduction over document collection, there's simply no way to propagate
>> document ids to the rows of dimensionally-reduced data. As in none at all.
>> as in hard no-work-around-exists stop.
>>
>> So. There's truly no need for multiple incompatible matrix types. There
>> has
>> to be just single matrix type. Just flexible one. And everything algebraic
>> needs to use it.
>>
>> And if geometry is needed, then it could be either already known or lazily
>> computed, but if it is not needed, nobody bothers to compute it. (i.e.
>> truly no need And this knowledge should not be lost just because we have
>> to
>> convert between types.
>>
>> And if we want to express complex row keys such as for cluster assignments
>> for example (my real case) then we could have a type with keys like
>> Tuple2(rowKeyType, cluster-string).
>>
>> And that nobody really cares if intermediate results are really be row or
>> column partitioned.
>>
>> All within single type of things.
>>
>> Bottom line, "interoperability" with mllib is both hard and trivial.
>>
>> Trivial is because whenever you need to convert, it is one line of code
>> and
>> also a trivial distributed map fusion element. (I do have pipelines
>> streaming mllib methods within DRM-based pipelines, not just speculating).
>>
>> Hard is because there are so many types you may need/want to convert
>> between, so there's not much point to even try to write converters for all
>> possible cases but rather go on need-to-do basis.
>>
>> It is also hard because their type system obviously continues evolving as
>> we speak. So no point chase the rabbit in the making.
>>
>> Epilogue
>> =======
>> There's no problem with the philosophy of the distributed and
>> non-distributed algebra approach. It is incredibly useful in practice and
>> I
>> have proven it continuously (what is in public domain is just tip of the
>> iceberg).
>>
>> Rather, there's organizational anemia in the project. Like corporate legal
>> interests (that includes me not being able to do quick turnaround of
>> fixes), and not having been able to tap into university resources. But i
>> don't believe in any technical philosophy problem.
>>
>> So given that aforementioned resource/logistical anemia, it will likely
>> take some when it would seem it  gets worse  before it gets better. But
>> afaik there are multiple efforts going on behind the curtains to break red
>> tapes. so i'd just wait a bit.
>>
>>
>
> [1] https://github.com/apache/flink/blob/master/flink-java/
> src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
> [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/
> 5/docs-website/developuser/java.html
>
>
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I dont know why. I said i didnt see either as a problem. As far as i am
concerned. Had encountered both needs in the past, did not even notice it
was a problem. Both are not relevant to this thread. Not sure. Id suggest
starting a separate thread.

Speaking of my priorities, two biggest problems i see is in-core
performance and tons of archaic dependencies. But only one belongs here.
3rd biggest problem is general bugs and code tidiness.
On Feb 8, 2015 8:22 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> OK, well perhaps those two lines of code (actually I agree, there’s not
> much more) can be also applied to TF-IDF and several other algorithms to
> get a much higher level or interoperability and keep us from reinventing
> things when not necessary. Funny we have type conversions for so many
> things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in
> but it does solve problems we don’t need to reinvent. Frankly adopting the
> best of MLlib makes Mahout a superset along with all its other virtues.
>
> And yes, I forgot to also praise the DSL’s optimizer—now rectified.
>
> Why do we spend more time with engine agnostic decisions that these more
> pragmatic ones?
>
>
> On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
> application and conversion back is another line. I actually did that some
> time ago. I am sure you  can figure the details.
>
> Whether it is worth to retain some commonality, no, it is not worth it
> untill there's commonality across mllib.
>
> At which point we may just include conversions for those who is interested.
> Until  then all we can do is to maintain commonality with mllib kmeans
> specifically but not mllib as a whole.
> On Feb 8, 2015 7:45 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>
> > I completely understand that MLlib lacks anything like the completeness
> of
> > Mahout's DSL, I know of no other scalable solution to match.  I don’t
> know
> > how many times this has to be said. This is something we can all get
> behind
> > as *unique* to Mahout.
> >
> > But I stand by the statement that there should also be some lower level
> > data commonality. There is too much similarity to dismiss and go
> completely
> > non-overlapping ways. Even if you can ague for maintaining separate
> > parallel ways let’s have some type conversions (I hesitate to say easy to
> > use) They shouldn’t be all that hard.
> >
> > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> > solve my Kmeans use case. You know MLlib better than I so choose the best
> > level to perform type conversions or inheritance splicing. The point is
> to
> > make the two as seamless as possible. Doesn’t this seem a worthy goal?
> >
> > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > Pat,
> >
> > I *just* made a case in this thread explaining that mllib does not have a
> > single distributed matrix types and that its own methodologies do not
> > interoperate within itself for that reason. Therefore, it is
> fundamentally
> > impossible to be interoperable with mllib since nobody really can define
> > what it means in terms of distributed types.
> >
> > You are in fact referring  to their in-core type, not a distributed type.
> > But there's no linear algebra operation support to speak of there either.
> > It is, simply, not algebra, at the moment. The types in this hierarchy
> are
> > just memory storage models, and private scope converters to breeze
> storage
> > models, but they are not true linalg apis nor providers of such.
> >
> > One might concievably want to standardize on Breeze apis since those are
> > both linalg api and providers, but not the type you've been mentioning.
> >
> > However, it is not a very happy path either. Breeze is somewhat more
> > interesting substrate to build in-core operations on, but if you read
> spark
> > forum of late, even spark developers express a whiff of dissatisfaction
> > with it in favor of BIDMat (me too btw). But while they say Bidmat would
> be
> > a better choice for in-core operatros, they also recognize the fact that
> > they are too invested into breeze api by now and such move would not be
> > cheap across the board.
> >
> > And that demonstrates another problem on in-core mllib architectrue
> there:
> > on one side, they don't have sufficient public in-core dsl or api to
> speak
> > of; but they also do not have a sufficiently abstract api for in-core
> blas
> > plugins either to be truly agnostic of the available in-core
> methodologies.
> >
> > So what you are talking about, is simply not possible with current state
> of
> > things there. But if it were, i'd just suggest you to try to port
> algebraic
> > things you like in Mahout, to mllib.
> >
> > My guess however is that you'd find that porting algebraic optimizer with
> > proper level of consistency with in-core operations will not be easy for
> > reasons including, but not limited to, the ones i just mentioned;
> although
> > individual blas  like matrix square you've mentioned would be fairly easy
> > to do for one of the distributed matrix types in mllib. But that of
> course
> > would not be an R like environment and not an optimizer.
> >
> > I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> > environment for in-core operations either (and its dsl is neither Rlike
> nor
> > matlab like, so it takes a bit of adjusting to). For that reason even
> > Bidmat linalg types and dsl are not truly versatile enough for our (well,
> > my anyway) purposes (which are to find the best hardware or software
> > subroutine automatically given current hardware and software platform
> > architecture and parameters of the requested operation).
> > On Feb 8, 2015 9:05 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >
> >> Why aren’t we using linalg.Vector and its siblings? The same could be
> >> asked for linalg.Matrix. If we want to prune dependencies this would
> help
> >> and would also significantly increase interoperability.
> >>
> >> Case-now: I have a real need to cluster items in a CF type input matrix.
> >> The input matrix A’ has row of items. I need to drop this into a
> sequence
> >> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into
> an
> >> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too
> bad
> >> and maybe could be helped with some implicit conversions mahout.Vector
> > <->
> >> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> >> Kmeans).
> >>
> >> Case-possible: If we adopted linalg.Vector as the native format and
> >> perhaps even linalg.Matrix this would give immediate interoperability in
> >> some areas including my specific need. It would significantly pare down
> >> dependencies not provided by the environment (Mahout-math). It would
> also
> >> support creating distributed computation methods that would work on
> MLlib
> >> and Mahout datasets addressing Gokhan’s question.
> >>
> >> I looked at another “Case-now” possibility, which was to go all MLlib
> > with
> >> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> >> why would you want to do that?” Not even in the multiply form A’A, A’B,
> >> AA’, all used in item and row similarity. That stopped me from looking
> >> deeper.
> >>
> >> The strength and unique value of Mahout is the completeness of its
> >> generalized linear algebra DSL. But insistence on using Mahout specific
> >> data types is also a barrier for Spark people adopting the DSL. Not
> > having
> >> lower level interoperability is a barrier both ways to mixing Mahout and
> >> MLlib—creating unnecessary either/or choices for devs.
> >>
> >> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>
> >> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:
> >>
> >>> What I am saying is that for certain algorithms including both
> >>> engine-specific (such as aggregation) and DSL stuff, what is the best
> > way
> >>> of handling them?
> >>>
> >>> i) should we add the distributed operations to Mahout codebase as it is
> >>> proposed in #62?
> >>>
> >>
> >> Imo this can't go very well and very far (because of the engine
> > specifics)
> >> but i'd be willing to see an experiment with simple things like map and
> >> reduce.
> >>
> >> Bigger quesitons are, where exactly we'll have to stop (we can't
> abstract
> >> all capabilities out there becuase of "common denominator" issues), and
> >> what percentage of methods will it truly allow to migrate to full
> backend
> >> portability.
> >>
> >> And if after doing all this, we will still find ourselves writing engine
> >> specific mixes, why bother. Wouldn't it be better to find a good,
> >> easy-to-replicate, incrementally-developed pattern to register and apply
> >> engine-specific strategies for every method?
> >>
> >>
> >>>
> >>> ii) should we have [engine]-ml modules (like spark-bindings and
> >>> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> >>>
> >>
> >> This is not quite what i am proposing. Rather, engine-ml modules holding
> >> engine-specific _parts_ of algorithm.
> >>
> >> However, this really needs a POC over a guniea pig (similarly to how we
> >> POC'd algebra in the first place with ssvd and spca).
> >>
> >>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

OK, well perhaps those two lines of code (actually I agree, there’s not much more) can be also applied to TF-IDF and several other algorithms to get a much higher level or interoperability and keep us from reinventing things when not necessary. Funny we have type conversions for so many things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in but it does solve problems we don’t need to reinvent. Frankly adopting the best of MLlib makes Mahout a superset along with all its other virtues.  

And yes, I forgot to also praise the DSL’s optimizer—now rectified.

Why do we spend more time with engine agnostic decisions that these more pragmatic ones?

 
On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
application and conversion back is another line. I actually did that some
time ago. I am sure you  can figure the details.

Whether it is worth to retain some commonality, no, it is not worth it
untill there's commonality across mllib.

At which point we may just include conversions for those who is interested.
Until  then all we can do is to maintain commonality with mllib kmeans
specifically but not mllib as a whole.
On Feb 8, 2015 7:45 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> I completely understand that MLlib lacks anything like the completeness of
> Mahout's DSL, I know of no other scalable solution to match.  I don’t know
> how many times this has to be said. This is something we can all get behind
> as *unique* to Mahout.
> 
> But I stand by the statement that there should also be some lower level
> data commonality. There is too much similarity to dismiss and go completely
> non-overlapping ways. Even if you can ague for maintaining separate
> parallel ways let’s have some type conversions (I hesitate to say easy to
> use) They shouldn’t be all that hard.
> 
> A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> solve my Kmeans use case. You know MLlib better than I so choose the best
> level to perform type conversions or inheritance splicing. The point is to
> make the two as seamless as possible. Doesn’t this seem a worthy goal?
> 
> On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Pat,
> 
> I *just* made a case in this thread explaining that mllib does not have a
> single distributed matrix types and that its own methodologies do not
> interoperate within itself for that reason. Therefore, it is fundamentally
> impossible to be interoperable with mllib since nobody really can define
> what it means in terms of distributed types.
> 
> You are in fact referring  to their in-core type, not a distributed type.
> But there's no linear algebra operation support to speak of there either.
> It is, simply, not algebra, at the moment. The types in this hierarchy are
> just memory storage models, and private scope converters to breeze storage
> models, but they are not true linalg apis nor providers of such.
> 
> One might concievably want to standardize on Breeze apis since those are
> both linalg api and providers, but not the type you've been mentioning.
> 
> However, it is not a very happy path either. Breeze is somewhat more
> interesting substrate to build in-core operations on, but if you read spark
> forum of late, even spark developers express a whiff of dissatisfaction
> with it in favor of BIDMat (me too btw). But while they say Bidmat would be
> a better choice for in-core operatros, they also recognize the fact that
> they are too invested into breeze api by now and such move would not be
> cheap across the board.
> 
> And that demonstrates another problem on in-core mllib architectrue  there:
> on one side, they don't have sufficient public in-core dsl or api to speak
> of; but they also do not have a sufficiently abstract api for in-core blas
> plugins either to be truly agnostic of the available in-core methodologies.
> 
> So what you are talking about, is simply not possible with current state of
> things there. But if it were, i'd just suggest you to try to port algebraic
> things you like in Mahout, to mllib.
> 
> My guess however is that you'd find that porting algebraic optimizer with
> proper level of consistency with in-core operations will not be easy for
> reasons including, but not limited to, the ones i just mentioned; although
> individual blas  like matrix square you've mentioned would be fairly easy
> to do for one of the distributed matrix types in mllib. But that of course
> would not be an R like environment and not an optimizer.
> 
> I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> environment for in-core operations either (and its dsl is neither Rlike nor
> matlab like, so it takes a bit of adjusting to). For that reason even
> Bidmat linalg types and dsl are not truly versatile enough for our (well,
> my anyway) purposes (which are to find the best hardware or software
> subroutine automatically given current hardware and software platform
> architecture and parameters of the requested operation).
> On Feb 8, 2015 9:05 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> 
>> Why aren’t we using linalg.Vector and its siblings? The same could be
>> asked for linalg.Matrix. If we want to prune dependencies this would help
>> and would also significantly increase interoperability.
>> 
>> Case-now: I have a real need to cluster items in a CF type input matrix.
>> The input matrix A’ has row of items. I need to drop this into a sequence
>> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
>> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
>> and maybe could be helped with some implicit conversions mahout.Vector
> <->
>> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
>> Kmeans).
>> 
>> Case-possible: If we adopted linalg.Vector as the native format and
>> perhaps even linalg.Matrix this would give immediate interoperability in
>> some areas including my specific need. It would significantly pare down
>> dependencies not provided by the environment (Mahout-math). It would also
>> support creating distributed computation methods that would work on MLlib
>> and Mahout datasets addressing Gokhan’s question.
>> 
>> I looked at another “Case-now” possibility, which was to go all MLlib
> with
>> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
>> why would you want to do that?” Not even in the multiply form A’A, A’B,
>> AA’, all used in item and row similarity. That stopped me from looking
>> deeper.
>> 
>> The strength and unique value of Mahout is the completeness of its
>> generalized linear algebra DSL. But insistence on using Mahout specific
>> data types is also a barrier for Spark people adopting the DSL. Not
> having
>> lower level interoperability is a barrier both ways to mixing Mahout and
>> MLlib—creating unnecessary either/or choices for devs.
>> 
>> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:
>> 
>>> What I am saying is that for certain algorithms including both
>>> engine-specific (such as aggregation) and DSL stuff, what is the best
> way
>>> of handling them?
>>> 
>>> i) should we add the distributed operations to Mahout codebase as it is
>>> proposed in #62?
>>> 
>> 
>> Imo this can't go very well and very far (because of the engine
> specifics)
>> but i'd be willing to see an experiment with simple things like map and
>> reduce.
>> 
>> Bigger quesitons are, where exactly we'll have to stop (we can't abstract
>> all capabilities out there becuase of "common denominator" issues), and
>> what percentage of methods will it truly allow to migrate to full backend
>> portability.
>> 
>> And if after doing all this, we will still find ourselves writing engine
>> specific mixes, why bother. Wouldn't it be better to find a good,
>> easy-to-replicate, incrementally-developed pattern to register and apply
>> engine-specific strategies for every method?
>> 
>> 
>>> 
>>> ii) should we have [engine]-ml modules (like spark-bindings and
>>> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>>> 
>> 
>> This is not quite what i am proposing. Rather, engine-ml modules holding
>> engine-specific _parts_ of algorithm.
>> 
>> However, this really needs a POC over a guniea pig (similarly to how we
>> POC'd algebra in the first place with ssvd and spca).
>> 
>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
application and conversion back is another line. I actually did that some
time ago. I am sure you  can figure the details.

Whether it is worth to retain some commonality, no, it is not worth it
untill there's commonality across mllib.

At which point we may just include conversions for those who is interested.
Until  then all we can do is to maintain commonality with mllib kmeans
specifically but not mllib as a whole.
On Feb 8, 2015 7:45 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> I completely understand that MLlib lacks anything like the completeness of
> Mahout's DSL, I know of no other scalable solution to match.  I don’t know
> how many times this has to be said. This is something we can all get behind
> as *unique* to Mahout.
>
> But I stand by the statement that there should also be some lower level
> data commonality. There is too much similarity to dismiss and go completely
> non-overlapping ways. Even if you can ague for maintaining separate
> parallel ways let’s have some type conversions (I hesitate to say easy to
> use) They shouldn’t be all that hard.
>
> A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> solve my Kmeans use case. You know MLlib better than I so choose the best
> level to perform type conversions or inheritance splicing. The point is to
> make the two as seamless as possible. Doesn’t this seem a worthy goal?
>
> On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Pat,
>
> I *just* made a case in this thread explaining that mllib does not have a
> single distributed matrix types and that its own methodologies do not
> interoperate within itself for that reason. Therefore, it is fundamentally
> impossible to be interoperable with mllib since nobody really can define
> what it means in terms of distributed types.
>
> You are in fact referring  to their in-core type, not a distributed type.
> But there's no linear algebra operation support to speak of there either.
> It is, simply, not algebra, at the moment. The types in this hierarchy are
> just memory storage models, and private scope converters to breeze storage
> models, but they are not true linalg apis nor providers of such.
>
> One might concievably want to standardize on Breeze apis since those are
> both linalg api and providers, but not the type you've been mentioning.
>
> However, it is not a very happy path either. Breeze is somewhat more
> interesting substrate to build in-core operations on, but if you read spark
> forum of late, even spark developers express a whiff of dissatisfaction
> with it in favor of BIDMat (me too btw). But while they say Bidmat would be
> a better choice for in-core operatros, they also recognize the fact that
> they are too invested into breeze api by now and such move would not be
> cheap across the board.
>
> And that demonstrates another problem on in-core mllib architectrue  there:
> on one side, they don't have sufficient public in-core dsl or api to speak
> of; but they also do not have a sufficiently abstract api for in-core blas
> plugins either to be truly agnostic of the available in-core methodologies.
>
> So what you are talking about, is simply not possible with current state of
> things there. But if it were, i'd just suggest you to try to port algebraic
> things you like in Mahout, to mllib.
>
> My guess however is that you'd find that porting algebraic optimizer with
> proper level of consistency with in-core operations will not be easy for
> reasons including, but not limited to, the ones i just mentioned; although
> individual blas  like matrix square you've mentioned would be fairly easy
> to do for one of the distributed matrix types in mllib. But that of course
> would not be an R like environment and not an optimizer.
>
> I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> environment for in-core operations either (and its dsl is neither Rlike nor
> matlab like, so it takes a bit of adjusting to). For that reason even
> Bidmat linalg types and dsl are not truly versatile enough for our (well,
> my anyway) purposes (which are to find the best hardware or software
> subroutine automatically given current hardware and software platform
> architecture and parameters of the requested operation).
> On Feb 8, 2015 9:05 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>
> > Why aren’t we using linalg.Vector and its siblings? The same could be
> > asked for linalg.Matrix. If we want to prune dependencies this would help
> > and would also significantly increase interoperability.
> >
> > Case-now: I have a real need to cluster items in a CF type input matrix.
> > The input matrix A’ has row of items. I need to drop this into a sequence
> > file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> > RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> > and maybe could be helped with some implicit conversions mahout.Vector
> <->
> > linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> > Kmeans).
> >
> > Case-possible: If we adopted linalg.Vector as the native format and
> > perhaps even linalg.Matrix this would give immediate interoperability in
> > some areas including my specific need. It would significantly pare down
> > dependencies not provided by the environment (Mahout-math). It would also
> > support creating distributed computation methods that would work on MLlib
> > and Mahout datasets addressing Gokhan’s question.
> >
> > I looked at another “Case-now” possibility, which was to go all MLlib
> with
> > item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> > why would you want to do that?” Not even in the multiply form A’A, A’B,
> > AA’, all used in item and row similarity. That stopped me from looking
> > deeper.
> >
> > The strength and unique value of Mahout is the completeness of its
> > generalized linear algebra DSL. But insistence on using Mahout specific
> > data types is also a barrier for Spark people adopting the DSL. Not
> having
> > lower level interoperability is a barrier both ways to mixing Mahout and
> > MLlib—creating unnecessary either/or choices for devs.
> >
> > On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:
> >
> >> What I am saying is that for certain algorithms including both
> >> engine-specific (such as aggregation) and DSL stuff, what is the best
> way
> >> of handling them?
> >>
> >> i) should we add the distributed operations to Mahout codebase as it is
> >> proposed in #62?
> >>
> >
> > Imo this can't go very well and very far (because of the engine
> specifics)
> > but i'd be willing to see an experiment with simple things like map and
> > reduce.
> >
> > Bigger quesitons are, where exactly we'll have to stop (we can't abstract
> > all capabilities out there becuase of "common denominator" issues), and
> > what percentage of methods will it truly allow to migrate to full backend
> > portability.
> >
> > And if after doing all this, we will still find ourselves writing engine
> > specific mixes, why bother. Wouldn't it be better to find a good,
> > easy-to-replicate, incrementally-developed pattern to register and apply
> > engine-specific strategies for every method?
> >
> >
> >>
> >> ii) should we have [engine]-ml modules (like spark-bindings and
> >> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> >>
> >
> > This is not quite what i am proposing. Rather, engine-ml modules holding
> > engine-specific _parts_ of algorithm.
> >
> > However, this really needs a POC over a guniea pig (similarly to how we
> > POC'd algebra in the first place with ssvd and spca).
> >
> >
> >>
> >>
> >
> >
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I completely understand that MLlib lacks anything like the completeness of Mahout's DSL, I know of no other scalable solution to match.  I don’t know how many times this has to be said. This is something we can all get behind as *unique* to Mahout.

But I stand by the statement that there should also be some lower level data commonality. There is too much similarity to dismiss and go completely non-overlapping ways. Even if you can ague for maintaining separate parallel ways let’s have some type conversions (I hesitate to say easy to use) They shouldn’t be all that hard.

A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would solve my Kmeans use case. You know MLlib better than I so choose the best level to perform type conversions or inheritance splicing. The point is to make the two as seamless as possible. Doesn’t this seem a worthy goal?

On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Pat,

I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
what it means in terms of distributed types.

You are in fact referring  to their in-core type, not a distributed type.
But there's no linear algebra operation support to speak of there either.
It is, simply, not algebra, at the moment. The types in this hierarchy are
just memory storage models, and private scope converters to breeze storage
models, but they are not true linalg apis nor providers of such.

One might concievably want to standardize on Breeze apis since those are
both linalg api and providers, but not the type you've been mentioning.

However, it is not a very happy path either. Breeze is somewhat more
interesting substrate to build in-core operations on, but if you read spark
forum of late, even spark developers express a whiff of dissatisfaction
with it in favor of BIDMat (me too btw). But while they say Bidmat would be
a better choice for in-core operatros, they also recognize the fact that
they are too invested into breeze api by now and such move would not be
cheap across the board.

And that demonstrates another problem on in-core mllib architectrue  there:
on one side, they don't have sufficient public in-core dsl or api to speak
of; but they also do not have a sufficiently abstract api for in-core blas
plugins either to be truly agnostic of the available in-core methodologies.

So what you are talking about, is simply not possible with current state of
things there. But if it were, i'd just suggest you to try to port algebraic
things you like in Mahout, to mllib.

My guess however is that you'd find that porting algebraic optimizer with
proper level of consistency with in-core operations will not be easy for
reasons including, but not limited to, the ones i just mentioned; although
individual blas  like matrix square you've mentioned would be fairly easy
to do for one of the distributed matrix types in mllib. But that of course
would not be an R like environment and not an optimizer.

I like bidmat a lot though; but it is not truly hybrid and self-adjusting
environment for in-core operations either (and its dsl is neither Rlike nor
matlab like, so it takes a bit of adjusting to). For that reason even
Bidmat linalg types and dsl are not truly versatile enough for our (well,
my anyway) purposes (which are to find the best hardware or software
subroutine automatically given current hardware and software platform
architecture and parameters of the requested operation).
On Feb 8, 2015 9:05 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Why aren’t we using linalg.Vector and its siblings? The same could be
> asked for linalg.Matrix. If we want to prune dependencies this would help
> and would also significantly increase interoperability.
> 
> Case-now: I have a real need to cluster items in a CF type input matrix.
> The input matrix A’ has row of items. I need to drop this into a sequence
> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> and maybe could be helped with some implicit conversions mahout.Vector <->
> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> Kmeans).
> 
> Case-possible: If we adopted linalg.Vector as the native format and
> perhaps even linalg.Matrix this would give immediate interoperability in
> some areas including my specific need. It would significantly pare down
> dependencies not provided by the environment (Mahout-math). It would also
> support creating distributed computation methods that would work on MLlib
> and Mahout datasets addressing Gokhan’s question.
> 
> I looked at another “Case-now” possibility, which was to go all MLlib with
> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> why would you want to do that?” Not even in the multiply form A’A, A’B,
> AA’, all used in item and row similarity. That stopped me from looking
> deeper.
> 
> The strength and unique value of Mahout is the completeness of its
> generalized linear algebra DSL. But insistence on using Mahout specific
> data types is also a barrier for Spark people adopting the DSL. Not having
> lower level interoperability is a barrier both ways to mixing Mahout and
> MLlib—creating unnecessary either/or choices for devs.
> 
> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
>> What I am saying is that for certain algorithms including both
>> engine-specific (such as aggregation) and DSL stuff, what is the best way
>> of handling them?
>> 
>> i) should we add the distributed operations to Mahout codebase as it is
>> proposed in #62?
>> 
> 
> Imo this can't go very well and very far (because of the engine specifics)
> but i'd be willing to see an experiment with simple things like map and
> reduce.
> 
> Bigger quesitons are, where exactly we'll have to stop (we can't abstract
> all capabilities out there becuase of "common denominator" issues), and
> what percentage of methods will it truly allow to migrate to full backend
> portability.
> 
> And if after doing all this, we will still find ourselves writing engine
> specific mixes, why bother. Wouldn't it be better to find a good,
> easy-to-replicate, incrementally-developed pattern to register and apply
> engine-specific strategies for every method?
> 
> 
>> 
>> ii) should we have [engine]-ml modules (like spark-bindings and
>> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>> 
> 
> This is not quite what i am proposing. Rather, engine-ml modules holding
> engine-specific _parts_ of algorithm.
> 
> However, this really needs a POC over a guniea pig (similarly to how we
> POC'd algebra in the first place with ssvd and spca).
> 
> 
>> 
>> 
> 
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Pat,

I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
what it means in terms of distributed types.

You are in fact referring  to their in-core type, not a distributed type.
But there's no linear algebra operation support to speak of there either.
It is, simply, not algebra, at the moment. The types in this hierarchy are
just memory storage models, and private scope converters to breeze storage
models, but they are not true linalg apis nor providers of such.

One might concievably want to standardize on Breeze apis since those are
both linalg api and providers, but not the type you've been mentioning.

However, it is not a very happy path either. Breeze is somewhat more
interesting substrate to build in-core operations on, but if you read spark
forum of late, even spark developers express a whiff of dissatisfaction
with it in favor of BIDMat (me too btw). But while they say Bidmat would be
a better choice for in-core operatros, they also recognize the fact that
they are too invested into breeze api by now and such move would not be
cheap across the board.

And that demonstrates another problem on in-core mllib architectrue  there:
on one side, they don't have sufficient public in-core dsl or api to speak
of; but they also do not have a sufficiently abstract api for in-core blas
plugins either to be truly agnostic of the available in-core methodologies.

So what you are talking about, is simply not possible with current state of
things there. But if it were, i'd just suggest you to try to port algebraic
things you like in Mahout, to mllib.

My guess however is that you'd find that porting algebraic optimizer with
proper level of consistency with in-core operations will not be easy for
reasons including, but not limited to, the ones i just mentioned; although
individual blas  like matrix square you've mentioned would be fairly easy
to do for one of the distributed matrix types in mllib. But that of course
would not be an R like environment and not an optimizer.

I like bidmat a lot though; but it is not truly hybrid and self-adjusting
environment for in-core operations either (and its dsl is neither Rlike nor
matlab like, so it takes a bit of adjusting to). For that reason even
Bidmat linalg types and dsl are not truly versatile enough for our (well,
my anyway) purposes (which are to find the best hardware or software
subroutine automatically given current hardware and software platform
architecture and parameters of the requested operation).
On Feb 8, 2015 9:05 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Why aren’t we using linalg.Vector and its siblings? The same could be
> asked for linalg.Matrix. If we want to prune dependencies this would help
> and would also significantly increase interoperability.
>
> Case-now: I have a real need to cluster items in a CF type input matrix.
> The input matrix A’ has row of items. I need to drop this into a sequence
> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> and maybe could be helped with some implicit conversions mahout.Vector <->
> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> Kmeans).
>
> Case-possible: If we adopted linalg.Vector as the native format and
> perhaps even linalg.Matrix this would give immediate interoperability in
> some areas including my specific need. It would significantly pare down
> dependencies not provided by the environment (Mahout-math). It would also
> support creating distributed computation methods that would work on MLlib
> and Mahout datasets addressing Gokhan’s question.
>
> I looked at another “Case-now” possibility, which was to go all MLlib with
> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> why would you want to do that?” Not even in the multiply form A’A, A’B,
> AA’, all used in item and row similarity. That stopped me from looking
> deeper.
>
> The strength and unique value of Mahout is the completeness of its
> generalized linear algebra DSL. But insistence on using Mahout specific
> data types is also a barrier for Spark people adopting the DSL. Not having
> lower level interoperability is a barrier both ways to mixing Mahout and
> MLlib—creating unnecessary either/or choices for devs.
>
> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:
>
> > What I am saying is that for certain algorithms including both
> > engine-specific (such as aggregation) and DSL stuff, what is the best way
> > of handling them?
> >
> > i) should we add the distributed operations to Mahout codebase as it is
> > proposed in #62?
> >
>
> Imo this can't go very well and very far (because of the engine specifics)
> but i'd be willing to see an experiment with simple things like map and
> reduce.
>
> Bigger quesitons are, where exactly we'll have to stop (we can't abstract
> all capabilities out there becuase of "common denominator" issues), and
> what percentage of methods will it truly allow to migrate to full backend
> portability.
>
> And if after doing all this, we will still find ourselves writing engine
> specific mixes, why bother. Wouldn't it be better to find a good,
> easy-to-replicate, incrementally-developed pattern to register and apply
> engine-specific strategies for every method?
>
>
> >
> > ii) should we have [engine]-ml modules (like spark-bindings and
> > h2o-bindings) where we can mix the DSL and engine-specific stuff?
> >
>
> This is not quite what i am proposing. Rather, engine-ml modules holding
> engine-specific _parts_ of algorithm.
>
> However, this really needs a POC over a guniea pig (similarly to how we
> POC'd algebra in the first place with ssvd and spca).
>
>
> >
> >
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Why aren’t we using linalg.Vector and its siblings? The same could be asked for linalg.Matrix. If we want to prune dependencies this would help and would also significantly increase interoperability.

Case-now: I have a real need to cluster items in a CF type input matrix. The input matrix A’ has row of items. I need to drop this into a sequence file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad and maybe could be helped with some implicit conversions mahout.Vector <-> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for Kmeans).

Case-possible: If we adopted linalg.Vector as the native format and perhaps even linalg.Matrix this would give immediate interoperability in some areas including my specific need. It would significantly pare down dependencies not provided by the environment (Mahout-math). It would also support creating distributed computation methods that would work on MLlib and Mahout datasets addressing Gokhan’s question.

I looked at another “Case-now” possibility, which was to go all MLlib with item similarity. I found that MLlib doesn’t have a transpose—“transpose, why would you want to do that?” Not even in the multiply form A’A, A’B, AA’, all used in item and row similarity. That stopped me from looking deeper.

The strength and unique value of Mahout is the completeness of its generalized linear algebra DSL. But insistence on using Mahout specific data types is also a barrier for Spark people adopting the DSL. Not having lower level interoperability is a barrier both ways to mixing Mahout and MLlib—creating unnecessary either/or choices for devs.

On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
> 
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
> 

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?

> 
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> 

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).

> 
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
>
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
>

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?

>
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).

>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

From my own perspective:

I’m not aware of any rule to make all operations agnostic. In fact several engine specific exceptions are discussed in this long email. We’ve talked about reduce or join operations that would be difficult to make agnostic without a lot of knowledge of ALL other engines. Unless or until we get contributors from those engines reviewing commits, why put this burden on all of us?

An agnostic DSL was for linear algebra ops, not all distributed computation methods. We aren’t doing a generic engine only engine agnostic algebra. 

You have added stubs in H2O for the distributed aggregations. This seems fine but I wouldn’t vote to require that. If GSGD requires further use of Spark specific operations, so be it. This means that GSGD may live in the Spark module with any algebra bits required  added to math-scala. Does anyone have a problem with that?

My vote on #62—ship it.

On the point of interoperability with MLlib we still need talk about that but another email.

On Feb 5, 2015, at 1:14 AM, Gokhan Capan <gk...@gmail.com> wrote:

What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
> 
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
> 
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> BTW Ted and Andrew have both expressed interest in the distributed
>> aggregation stuff. It sounds like we are agreeing that
>> non-algebra—computation method type things can be engine specific.
>> 
>> So does anyone have an objection to Gokhan pushing his PR?
>> 
>> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
>> 
>>> 
>>> 
>>> 
>>> My thought was not to bring primitive engine specific aggregetors,
>>> combiners,  etc. into math-scala.
>>> 
>> 
>> Yeah. +1. I would like to support that as an experiment, see where it
> goes.
>> Clearly some distributed use cases are simple enough while also pervasive
>> enough.
>> 
>> 
>

Re: Codebase refactoring proposal

Posted by Gokhan Capan <gk...@gmail.com>.

What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
>
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
>
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > BTW Ted and Andrew have both expressed interest in the distributed
> > aggregation stuff. It sounds like we are agreeing that
> > non-algebra—computation method type things can be engine specific.
> >
> > So does anyone have an objection to Gokhan pushing his PR?
> >
> > On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
> >
> > >
> > >
> > >
> > > My thought was not to bring primitive engine specific aggregetors,
> > > combiners,  etc. into math-scala.
> > >
> >
> > Yeah. +1. I would like to support that as an experiment, see where it
> goes.
> > Clearly some distributed use cases are simple enough while also pervasive
> > enough.
> >
> >
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I took it Gokhan had objections himself, based on his comments. if we are
talking about #62.

He also expressed concerns about computing GSGD but i suspect it can still
be algebraically computed.

On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW Ted and Andrew have both expressed interest in the distributed
> aggregation stuff. It sounds like we are agreeing that
> non-algebra—computation method type things can be engine specific.
>
> So does anyone have an objection to Gokhan pushing his PR?
>
> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
> >
> >
> >
> > My thought was not to bring primitive engine specific aggregetors,
> > combiners,  etc. into math-scala.
> >
>
> Yeah. +1. I would like to support that as an experiment, see where it goes.
> Clearly some distributed use cases are simple enough while also pervasive
> enough.
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

BTW Ted and Andrew have both expressed interest in the distributed aggregation stuff. It sounds like we are agreeing that non-algebra—computation method type things can be engine specific.

So does anyone have an objection to Gokhan pushing his PR?

On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> 
> 
> 
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
> 

Yeah. +1. I would like to support that as an experiment, see where it goes.
Clearly some distributed use cases are simple enough while also pervasive
enough.

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:

>
>
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
>

Yeah. +1. I would like to support that as an experiment, see where it goes.
Clearly some distributed use cases are simple enough while also pervasive
enough.

Re: Codebase refactoring proposal

Posted by Andrew Palumbo <ap...@outlook.com>.

On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
> there since they are going beyond the scope of that PR's work to chase the
> root of the issue.
>
> on quasi-algebraic methods
> ========================
>
> What is the dilemma here? don't see any.
>
> I already explained that no more than 25% of algorithms are truly 100%
> algebraic. But about 80% cannot avoid using some algebra and close to 95%
> could benefit from using algebra (even stochastic and monte carlo stuff).
>
> So we are building system that allows us to cut developer's work by at
> least 60% and make his work also more readable by 3000%. As far as I am
> concerned, that fulfills the goal. And I am perfectly happy writing a mix
> of engine-specific primitives and algebra.
>
> That's why i am a bit skeptical about attempts to abstract non-algebraic
> primitives such as row-wise aggregators in one of the pull requests.
> Engine-specific primitives and algebra can perfectly co-exist in the guts.
> And that's how i am doing my stuff in practice, except i now can skip 80%
> effort on algebra and bridging incompatible intputs-outputs.
I am **definitely** not advocating messing with the algebraic 
optimizer.  That was what I saw as the plus side to Gokhan's PR- a 
separate engine abstraction for qasi/non-algebraic distributed methods. 
   I didn't comment on the PR either because admittedly I did not have a 
chance to spend a lot of time on it.  But my quick takeaway was  that we 
could take some very useful and hopefully (close to) ubiquitous 
distributed operators and pass them through to the engine "guts".

I briefly looked through some of the flink and h2o code and noticed 
Flink's aggregateOperator [1]
and h2o's MapReduce API and [2] my thought was that we could write pass 
through operators for some of the more useful operations from math-scala 
and then implement them fully in their respective packages.  Though I am 
not sure how this would work on either cases w.r.t. partitioning.  e.g. 
on h2o's distributed DataFrame. or flink for that matter.  Again, I 
havent had alot of time to look at these and see if this would work at all.

My thought was not to bring primitive engine specific aggregetors, 
combiners,  etc. into math-scala.

I had thought though that we were trying to develop a fully engine 
agnostic algorithm library in on top of the R-Like distributed BLAS.


So would the idea be to implement i.e. seq2sparse fully in the spark 
module?  It would seem to fracture the project a bit.


Or to implement algorithms sequentially if mapBlock() will not suffice 
and then optimize them in their respective modules?


>
> None of that means that R-like algebra cannot be engine agnostic. So people
> are unhappy about not being able to write the whole in totaly agnostic way?
> And so they (falsely) infer the pieces of their work cannot be helped by
> agnosticism individually, or the tools are not being as good as they might
> be without backend agnosticism? Sorry, but I fail to see the logic there.
>
> We proved algebra can be agnostic. I don't think this notion should be
> disputed.
>
> And even if there were a shred of real benefit by making algebra tools
> un-agnostic, it would not ever outweigh tons of good we could get for the
> project by integrating with e.g. Flink folks. This one one the points MLLib
> will never be able to overcome -- to be truly shared ML platform where
> people could create and share ML, but not just a bunch of ad-hoc spaghetty
> of distributed api calls and Spark-nailed black boxes.
>
> Well yes methodology implementations will still have native distributed
> calls. Just not nearly as many as they otherwise would, and will be much
> more easier to support on another back-end using Strategy patterns. E.g.
> implicit feedback problem that i originally wrote as quasi-method for Spark
> only, would've taken just an hour or so to add strategy for flink, since it
> retains all in-core and distributed algebra work as is.
>
> Not to mention benefit of single type pipelining.
>
> And once we add hardware-accelerated bindings for in-core stuff, all these
> methods would immediately benefit from it.
>
> On MLLib interoperability issues,
> =========================
>
> well, let me ask you this: what it means to be MLLib-interoperable? is
> MLLib even interoperable within itself?
>
> E.g. i remember there was one most frequent request on the list here: how
> can we cluster dimensionally-reduced data?
>
> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
> produces collection of vectors (and where did our document ids go? not
> sure); then we'd have to run svd or pca, both of which would accept
> RowMatrix (bummer! but we have collection of vectors); which would produce
> RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).
>
> Not directly pluggable, although semi-trivially or trivially convertible.
> Plus strips off information that we potentially already have computed
> earlier in the pipeline, so we'd need to compute it again. I think problem
> is well demonstrated.
>
> Or, say, ALS stuff (implicit als in particular) is really an algebraic
> problem. Should be taking input in form of matrices (that my feature
> extraction algebraic pipeline perhaps has just prepared) but really takes
> POJOs. Bummer again.
>
> So what it is exactly we should be interoperable with in this picture if
> MLLib itself is not consistent?
>
> Let's look at the type system in flux there:
>
> we have
> (1) collection of vectors,
> (2) matrix of known dimensions for collection of vectors (row matrix),
> (3) indexedRowMatrix which is matrix of known dimension with keys that can
> be _only_ long; and
> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>
> But ok, let's constrain ourselves to matrix types only.
>
> Multitude of matrix types creates problems for tasks that require
> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
> in the case of mllib). In the aforementioned case of dimensionality
> reduction over document collection, there's simply no way to propagate
> document ids to the rows of dimensionally-reduced data. As in none at all.
> as in hard no-work-around-exists stop.
>
> So. There's truly no need for multiple incompatible matrix types. There has
> to be just single matrix type. Just flexible one. And everything algebraic
> needs to use it.
>
> And if geometry is needed, then it could be either already known or lazily
> computed, but if it is not needed, nobody bothers to compute it. (i.e.
> truly no need And this knowledge should not be lost just because we have to
> convert between types.
>
> And if we want to express complex row keys such as for cluster assignments
> for example (my real case) then we could have a type with keys like
> Tuple2(rowKeyType, cluster-string).
>
> And that nobody really cares if intermediate results are really be row or
> column partitioned.
>
> All within single type of things.
>
> Bottom line, "interoperability" with mllib is both hard and trivial.
>
> Trivial is because whenever you need to convert, it is one line of code and
> also a trivial distributed map fusion element. (I do have pipelines
> streaming mllib methods within DRM-based pipelines, not just speculating).
>
> Hard is because there are so many types you may need/want to convert
> between, so there's not much point to even try to write converters for all
> possible cases but rather go on need-to-do basis.
>
> It is also hard because their type system obviously continues evolving as
> we speak. So no point chase the rabbit in the making.
>
> Epilogue
> =======
> There's no problem with the philosophy of the distributed and
> non-distributed algebra approach. It is incredibly useful in practice and I
> have proven it continuously (what is in public domain is just tip of the
> iceberg).
>
> Rather, there's organizational anemia in the project. Like corporate legal
> interests (that includes me not being able to do quick turnaround of
> fixes), and not having been able to tap into university resources. But i
> don't believe in any technical philosophy problem.
>
> So given that aforementioned resource/logistical anemia, it will likely
> take some when it would seem it  gets worse  before it gets better. But
> afaik there are multiple efforts going on behind the curtains to break red
> tapes. so i'd just wait a bit.
>


[1] 
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
[2] 
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/developuser/java.html

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.

on quasi-algebraic methods
========================

What is the dilemma here? don't see any.

I already explained that no more than 25% of algorithms are truly 100%
algebraic. But about 80% cannot avoid using some algebra and close to 95%
could benefit from using algebra (even stochastic and monte carlo stuff).

So we are building system that allows us to cut developer's work by at
least 60% and make his work also more readable by 3000%. As far as I am
concerned, that fulfills the goal. And I am perfectly happy writing a mix
of engine-specific primitives and algebra.

That's why i am a bit skeptical about attempts to abstract non-algebraic
primitives such as row-wise aggregators in one of the pull requests.
Engine-specific primitives and algebra can perfectly co-exist in the guts.
And that's how i am doing my stuff in practice, except i now can skip 80%
effort on algebra and bridging incompatible intputs-outputs.

None of that means that R-like algebra cannot be engine agnostic. So people
are unhappy about not being able to write the whole in totaly agnostic way?
And so they (falsely) infer the pieces of their work cannot be helped by
agnosticism individually, or the tools are not being as good as they might
be without backend agnosticism? Sorry, but I fail to see the logic there.

We proved algebra can be agnostic. I don't think this notion should be
disputed.

And even if there were a shred of real benefit by making algebra tools
un-agnostic, it would not ever outweigh tons of good we could get for the
project by integrating with e.g. Flink folks. This one one the points MLLib
will never be able to overcome -- to be truly shared ML platform where
people could create and share ML, but not just a bunch of ad-hoc spaghetty
of distributed api calls and Spark-nailed black boxes.

Well yes methodology implementations will still have native distributed
calls. Just not nearly as many as they otherwise would, and will be much
more easier to support on another back-end using Strategy patterns. E.g.
implicit feedback problem that i originally wrote as quasi-method for Spark
only, would've taken just an hour or so to add strategy for flink, since it
retains all in-core and distributed algebra work as is.

Not to mention benefit of single type pipelining.

And once we add hardware-accelerated bindings for in-core stuff, all these
methods would immediately benefit from it.

On MLLib interoperability issues,
=========================

well, let me ask you this: what it means to be MLLib-interoperable? is
MLLib even interoperable within itself?

E.g. i remember there was one most frequent request on the list here: how
can we cluster dimensionally-reduced data?

Let's look what it takes to do this in MLLib: First, we run tf-idf, which
produces collection of vectors (and where did our document ids go? not
sure); then we'd have to run svd or pca, both of which would accept
RowMatrix (bummer! but we have collection of vectors); which would produce
RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).

Not directly pluggable, although semi-trivially or trivially convertible.
Plus strips off information that we potentially already have computed
earlier in the pipeline, so we'd need to compute it again. I think problem
is well demonstrated.

Or, say, ALS stuff (implicit als in particular) is really an algebraic
problem. Should be taking input in form of matrices (that my feature
extraction algebraic pipeline perhaps has just prepared) but really takes
POJOs. Bummer again.

So what it is exactly we should be interoperable with in this picture if
MLLib itself is not consistent?

Let's look at the type system in flux there:

we have
(1) collection of vectors,
(2) matrix of known dimensions for collection of vectors (row matrix),
(3) indexedRowMatrix which is matrix of known dimension with keys that can
be _only_ long; and
(4) unknown but not infinitesimal amount of POJO-oriented approaches.

But ok, let's constrain ourselves to matrix types only.

Multitude of matrix types creates problems for tasks that require
consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
in the case of mllib). In the aforementioned case of dimensionality
reduction over document collection, there's simply no way to propagate
document ids to the rows of dimensionally-reduced data. As in none at all.
as in hard no-work-around-exists stop.

So. There's truly no need for multiple incompatible matrix types. There has
to be just single matrix type. Just flexible one. And everything algebraic
needs to use it.

And if geometry is needed, then it could be either already known or lazily
computed, but if it is not needed, nobody bothers to compute it. (i.e.
truly no need And this knowledge should not be lost just because we have to
convert between types.

And if we want to express complex row keys such as for cluster assignments
for example (my real case) then we could have a type with keys like
Tuple2(rowKeyType, cluster-string).

And that nobody really cares if intermediate results are really be row or
column partitioned.

All within single type of things.

Bottom line, "interoperability" with mllib is both hard and trivial.

Trivial is because whenever you need to convert, it is one line of code and
also a trivial distributed map fusion element. (I do have pipelines
streaming mllib methods within DRM-based pipelines, not just speculating).

Hard is because there are so many types you may need/want to convert
between, so there's not much point to even try to write converters for all
possible cases but rather go on need-to-do basis.

It is also hard because their type system obviously continues evolving as
we speak. So no point chase the rabbit in the making.

Epilogue
=======
There's no problem with the philosophy of the distributed and
non-distributed algebra approach. It is incredibly useful in practice and I
have proven it continuously (what is in public domain is just tip of the
iceberg).

Rather, there's organizational anemia in the project. Like corporate legal
interests (that includes me not being able to do quick turnaround of
fixes), and not having been able to tap into university resources. But i
don't believe in any technical philosophy problem.

So given that aforementioned resource/logistical anemia, it will likely
take some when it would seem it  gets worse  before it gets better. But
afaik there are multiple efforts going on behind the curtains to break red
tapes. so i'd just wait a bit.

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Just looked at Gokhan’s PR and quoting him below (since Andrew would like the PR). We really need to support interchangeability of data and algorithms with Spark/MLlib/SparkQL. Even if this breaks engine-neutrality and we adopt the lower level integration of data types. Why can’t we address this and become a better ML engine in the process?

=================================================================================================

The status is that I need to revise the code based on reviews.

But I have some concerns, summarized below:

Here is the story.

I'm going to contribute my recent work on distributed implementation of stochastic optimization to some open source library, and for me, the only reason that accumulating blocks matters is that I require it for averaging-based distributed stochastic gradient descent (DSGD).

I was an advocate of having Mahout as the ML and Matrix Computations core for distributed processing engines, and was thinking that the Matrix DSL would be sufficient for implementing such algorithms (such as DSGD) in an engine-agnostic way.

It seems that for implementing most optimization algorithms and ML models, one requires other-than-DSL operations. And those operations are highly engine-specific.

Repeating the aggregating operation in Mahout is duplicate work, just like MLlib's having some of Mahout's Matrix DSL capabilities duplicated in uglier ways. Plus, having an algorithm in Mahout but not in MLlib (or vice versa) really bothers me because other's users could not benefit.

Considering your recent codebase refactoring effort, @dlyubimov, I imagine the best way to use the DSL is by utilizing it inside MLlib (or whatever your favorite ML library is). That is, MLlib depends on Mahout Matrix-DSL implementation, Matrix I/O and computations are handled in Mahout, ML algorithms are handled in MLlib and/or other libraries.

Can we just slow this down and think about what should be contributed to where, and reconsider the ideal Mahout-Spark integration?

On Feb 4, 2015, at 10:37 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

btw a good seq2sparse and seqdirectory ports are the only thing that
separates us from having bigram, trigram based LSA tutorial.

On Wed, Feb 4, 2015 at 10:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> i think they are debating the details now, not the idea. Like how "NA" is
> different from "null" in classic dataframe representation etc.
> 
> On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <su...@gmail.com>
> wrote:
> 
>> I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
>> must admit Dmitriy had suggested this to me few months ago reusing
>> SchemaRDD if possible. Dmitriy was right "U told us".
>> 
>> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> This sound like a great idea but I wonder is we can get rid of Mahout
>> DRM
>>> as a native format. If we have DataFrames (have they actually renamed
>>> SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
>>> IndexedDatasets, right? This would be a huge step! If we get data
>>> interchangeability with MLlib its a win. If we get general row and
>> column
>>> IDs that follow the data through math, its a win. Need to think through
>> how
>>> to use a DataFrame in a streaming case, probably through some
>> checkpointing
>>> of the window DStream—hmm.
>>> 
>>> On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>> I'd suggest to consider this: remember all this talk about
>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>> 
>>>> so now Spark devs are noticing this generality as well and are
>> actually
>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>> structure. (my "told you so" moment of sorts :)
>>>> 
>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>> renamed
>>>> DataFrame our two major structures. In particular, standardize on
>> using
>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>> when it deals with non-matrix content.
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>> text
>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>> dictionaries and counts could be either in-memory maps or rdds for
>> use
>>> with
>>>>> joins? This would get rid of sequence files completely from the
>>> pipeline.
>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>> scalable using joins as an alternative with the same API allowing the
>>> user
>>>>> to trade-off footprint for speed.
>>> 
>>> I think you're right- should be relatively easy.  I've been looking at
>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>> of this problem.
>>> 
>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>> time in.
>>> 
>>> I think that this would be very worthy effort as well-  I believe
>>> seq2sparse is a particular strong mahout feature.
>>> 
>>> I'll start another thread since we're now way off topic from the
>>> refactoring proposal.
>>>>> 
>>>>> My use for TF-IDF is for row similarity and would take a DRM
>> (actually
>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>> only
>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>> but
>>>>> for text tokens something like cosine may be better.
>>>>> 
>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>> a
>>> lot
>>>>> like how CF preferences are downsampled. This would produce an
>>> sparsified
>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>> the
>>>>> terms before row similarity uses cosine. This is not so good for
>> search
>>> but
>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>> and
>>> does
>>>>> it for all pairs rather than one at a time.
>>>>> 
>>>>> In any case it can be used to do a create a personalized
>> content-based
>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>> 
>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>>>>> 
>>>>> 
>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>> actually
>>>>> looked at. There may be other things we can pick up from their
>>> committers
>>>>> since they have an abundance.
>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>> to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>> MLlib
>>>>> committer’s answer was something like “why would you want to do
>> that?”.
>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>> present
>>> you
>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>> stuff.
>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know).
>> If
>>> the
>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>> the
>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>> MLlib
>>>>> seems to be algorithms, not math.
>>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>> fall
>>>>> out of DStream backed DRMs. This isn’t the same as incremental
>> updates
>>> on
>>>>> streaming but it’s a start.
>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming
>> and
>>>>> especially incrementally updated streaming. Seems like we need to
>>> address
>>>>> this.
>>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>  I will put a PR up soon.
>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>> classes
>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>> available
>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>> require dictionary and a frequency count maps to vectorize incoming
>>> text-
>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>> they
>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>> Hopefully they'll be of some use.
>>>>> 
>>>>> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>>>>> But first I need to do massive fixes and improvements to the
>>> distributed
>>>>>>> optimizer itself. Still waiting on green light for that.
>>>>>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com>
>> wrote:
>>>>>>>>> BTW what level of difficulty would making the DSL run on MLlib
>>> Vectors
>>>>>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it
>> raises
>>>>>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
>>> further
>>>>>>>> reduce artifact size by a bunch.
>>>>>>>> 
>>>>>>>> Short answer, if it were possible, I'd not bother with Mahout code
>>>>> base at
>>>>>>>> all. The problem is it lacks sufficient flexibility semantics and
>>>>>>>> abstruction. Breeze is indefinitely better in that department but
>> at
>>>>> the
>>>>>>>> time it was sufficiently worse on abstracting interoperability of
>>>>> matrices
>>>>>>>> with different structures. And mllib does not expose breeze.
>>>>>>>> 
>>>>>>>> Looking forward toward hardware acellerated bolt-on work I just
>> must
>>>>> say
>>>>>>>> after reading breeze code for some time I still have much clearer
>>> plan
>>>>> how
>>>>>>>> such back hybridization and cost calibration might work with
>> current
>>>>> Mahout
>>>>>>>> math abstractions than with breeze. It is also more in line with
>> my
>>>>> current
>>>>>>>> work tasks.
>>>>>>>> 
>>>>>>>>> Also backing something like a DRM with DStreams. Periodic model
>>> recalc
>>>>>>>> with streams is maybe the first step towards truly streaming
>> algos.
>>>>> Looking
>>>>>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>>>>>>> similarity. Attach Kafka and get evergreen models, if not
>>> incrementally
>>>>>>>> updating models.
>>>>>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>>>>> bottom line compile-time dependencies are satisfied with no extra
>>>>> stuff
>>>>>>>>> from mr-legacy or its transitives. This is proven by virtue of
>>>>>>>> successful
>>>>>>>>> compilation with no dependency on mr-legacy on the tree.
>>>>>>>>> 
>>>>>>>>> Runtime sufficiency for no extra dependency is proven via running
>>>>> shell
>>>>>>>> or
>>>>>>>>> embedded tests (unit tests) which are successful too. This
>> implies
>>>>>>>>> embedding and shell apis.
>>>>>>>>> 
>>>>>>>>> Issue with guava is typical one. if it were an issue, i wouldn't
>> be
>>>>> able
>>>>>>>> to
>>>>>>>>> compile and/or run stuff. Now, question is what do we do if
>> drivers
>>>>> want
>>>>>>>>> extra stuff that is not found in Spark.
>>>>>>>>> 
>>>>>>>>> Now, It is so nice not to depend on anything extra so i am
>> hesitant
>>> to
>>>>>>>>> offer anything  here. either shading or lib with opt-in
>> dependency
>>>>> policy
>>>>>>>>> would suffice though, since it doesn't look like we'd have to
>> have
>>>>> tons
>>>>>>>> of
>>>>>>>>> extra for drivers.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <
>> pat@occamsmachete.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>>> I vaguely remember there being a Guava version problem where the
>>>>>>>> version
>>>>>>>>>> had to be rolled back in one of the hadoop modules. The
>> math-scala
>>>>>>>>>> IndexedDataset shouldn’t care about version.
>>>>>>>>>> 
>>>>>>>>>> BTW It seems pretty easy to take out the option parser and
>> replace
>>>>> with
>>>>>>>>>> match and tuples especially if we can extend the Scala App
>> class.
>>> It
>>>>>>>> might
>>>>>>>>>> actually simplify things since I can then use several case
>> classes
>>> to
>>>>>>>> hold
>>>>>>>>>> options (scopt needed one object), which in turn takes out all
>>> those
>>>>>>>> ugly
>>>>>>>>>> casts. I’ll take a look next time I’m in there.
>>>>>>>>>> 
>>>>>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> in 'spark' module it is overwritten with spark dependency, which
>>> also
>>>>>>>> comes
>>>>>>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>>>>>>> 
>>>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>>>>>>> mahout-spark_2.10 ---
>>>>>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>>>>>>> [INFO] |  |  |  +-
>> org.apache.commons:commons-math:jar:2.1:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>> commons-logging:commons-logging:jar:1.1.3:compile
>>>>>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>> commons-digester:commons-digester:jar:1.8:compile
>>>>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>>>>>>> [INFO] |  |  |  |  \-
>>>>>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>>>>>>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>>>> 
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>> javax.inject:javax.inject:jar:1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>> aopalliance:aopalliance:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |     \-
>>>>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>> com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  \-
>>>>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>>>> 
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>>>>> [INFO] |  |  \-
>>> commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>>>>> [INFO] |  +-
>> org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>>>> 
>>> 
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  |  \-
>>>>>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |     \-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>>>>>>>> [INFO] |  |        \-
>>>>>>>>>> 
>>>>> 
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>>>>>>>> [INFO] |  +-
>>>>>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +-
>>>>>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> 
>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |  \-
>>>>>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  |     \-
>>>>>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>>>>>>> d
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> looks like it is also requested by mahout-math, wonder what is
>>> using
>>>>>>>> it
>>>>>>>>>>> there.
>>>>>>>>>>> 
>>>>>>>>>>> At very least, it needs to be synchronized to the one currently
>>> used
>>>>>>>> by
>>>>>>>>>>> spark.
>>>>>>>>>>> 
>>>>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>>>>> mahout-hadoop
>>>>>>>>>>> ---
>>>>>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>>>>>>> *[INFO] +-
>> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>>>>>>> [INFO] +-
>>>>>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Looks like Guava is in Spark.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <
>> pat@occamsmachete.com>
>>>>>>>> wrote:
>>>>>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
>>> like
>>>>>>>> this
>>>>>>>>>>>> would not be included since I think it was taken from the
>>> mrlegacy
>>>>>>>> jar.
>>>>>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>>>>>>> To: <de...@mahout.apache.org>
>>>>>>>>>>>> Cc:
>>>>>>>>>>>> 
>>>>>>>>>>>>> When you get a chance a PR would be good.
>>>>>>>>>>>> Yes, it would. And not just for that.
>>>>>>>>>>>> 
>>>>>>>>>>>>> As I understand it you are putting some class jars somewhere
>> in
>>>>> the
>>>>>>>>>>>> classpath. Where? How?
>>>>>>>>>>>> /bin/mahout
>>>>>>>>>>>> 
>>>>>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath'
>> vs.
>>>>>>>>>>>> 'bin/mahout -spark'.)
>>>>>>>>>>>> 
>>>>>>>>>>>> If i interpret current shell code there correctky, legacy path
>>>>> tries
>>>>>>>> to
>>>>>>>>>>>> use
>>>>>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>>>>>>> motivation
>>>>>>>>>>>> of that significantly predates 2010 and i suspect only Benson
>>> knows
>>>>>>>>>> whole
>>>>>>>>>>>> true intent there.
>>>>>>>>>>>> 
>>>>>>>>>>>> The spark path, which is really a quick hack of the script,
>> tries
>>>>> to
>>>>>>>> get
>>>>>>>>>>>> only selected mahout jars and locally instlalled spark
>> classpath
>>>>>>>> which i
>>>>>>>>>>>> guess is just the shaded spark jar in recent spark releases.
>> It
>>>>> also
>>>>>>>>>>>> apparently tries to include /libs/*, which is never compiled
>> in
>>>>>>>>>> unpackaged
>>>>>>>>>>>> version, and now i think it is a bug it is included  because
>>>>> /libs/*
>>>>>>>> is
>>>>>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark
>> jobs
>>>>>>>> with a
>>>>>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find
>> time
>>>>> to
>>>>>>>>>>>> understand mahout build in all cases.
>>>>>>>>>>>> 
>>>>>>>>>>>> I am not even sure if packaged mahout will work with spark,
>>>>> honestly,
>>>>>>>>>>>> because of the /lib. Never tried that, since i mostly use
>>>>> application
>>>>>>>>>>>> embedding techniques.
>>>>>>>>>>>> 
>>>>>>>>>>>> The same solution may apply to adding external dependencies
>> and
>>>>>>>> removing
>>>>>>>>>>>> the assembly in the Spark module. Which would leave only one
>>> major
>>>>>>>> build
>>>>>>>>>>>> issue afaik.
>>>>>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>>>>> sufficiently
>>>>>>>>>>>> defined
>>>>>>>>>>>>> what i want to do in order to gauge if we may want to
>> advance it
>>>>>>>> some
>>>>>>>>>>>> time
>>>>>>>>>>>>> later. Goal is much lighter dependency for spark code.
>> Eliminate
>>>>>>>>>>>> everything
>>>>>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>>>>> legacy
>>>>>>>> MR
>>>>>>>>>>>> code
>>>>>>>>>>>>> which we of course don't use).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cant say i understand the remaining issues you are talking
>> about
>>>>>>>>>> though.
>>>>>>>>>>>>> If you are talking about compiling lib or shaded assembly,
>> no,
>>>>> this
>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> do anything about it. Although point is, as it stands, the
>>> algebra
>>>>>>>> and
>>>>>>>>>>>>> shell don't have any external dependencies but spark and
>> these 4
>>>>>>>> (5?)
>>>>>>>>>>>>> mahout jars so they technically don't even need an assembly
>> (as
>>>>>>>>>>>>> demonstrated).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As i said, it seems driver code is the only one that may need
>>> some
>>>>>>>>>>>> external
>>>>>>>>>>>>> dependencies, but that's a different scenario from those i am
>>>>>>>> talking
>>>>>>>>>>>>> about. But i am relatively happy with having the first two
>>> working
>>>>>>>>>>>> nicely
>>>>>>>>>>>>> at this point.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>>>>> pat@occamsmachete.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>>>>> would
>>>>>>>> be
>>>>>>>>>>>> nice
>>>>>>>>>>>>>> to see how you’ve structured that in case we can use the
>> same
>>>>>>>> model to
>>>>>>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <
>> squinn@gatech.edu>
>>>>>>>> wrote:
>>>>>>>>>>>>>> Also +1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> iPhone'd
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <
>> ap.dev@outlook.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
>>> Dmitriy
>>>>>>>>>>>> Lyubimov
>>>>>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>>>>>>> (GMT-05:00)
>>>>>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
>>>>> Codebase
>>>>>>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>>>>>>> </div>
>>>>>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>>>>>>> I did quick refactoring and it turns out it only
>> _irrevocably_
>>>>>>>>>> depends
>>>>>>>>>>>> on
>>>>>>>>>>>>>>> the following classes there:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>>>>> VarintWritable,
>>>>>>>>>> and
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>>>>>>> mahout-hadoop
>>>>>>>>>>>>>>> module (to signify stuff that is directly relevant to
>>>>> serializing
>>>>>>>>>>>> thigns
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
>>> from
>>>>>>>>>> spark
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So non-cli applications (shell scripts and embedded api
>> use)
>>>>>>>> actually
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME
>> classpath,
>>>>> of
>>>>>>>>>>>> course)
>>>>>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>>>>> mahout-hadoop
>>>>>>>> and
>>>>>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This of course still doesn't address driver problems that
>> want
>>>>> to
>>>>>>>>>>>> throw
>>>>>>>>>>>>>>> more stuff into front-end classpath (such as cli parser)
>> but
>>> at
>>>>>>>> least
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>>>>>>> worker-shipped
>>>>>>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How does that sound?
>>>>>>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>> 
> 
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

btw a good seq2sparse and seqdirectory ports are the only thing that
separates us from having bigram, trigram based LSA tutorial.

On Wed, Feb 4, 2015 at 10:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> i think they are debating the details now, not the idea. Like how "NA" is
> different from "null" in classic dataframe representation etc.
>
> On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <su...@gmail.com>
> wrote:
>
>> I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
>> must admit Dmitriy had suggested this to me few months ago reusing
>> SchemaRDD if possible. Dmitriy was right "U told us".
>>
>> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>> > This sound like a great idea but I wonder is we can get rid of Mahout
>> DRM
>> > as a native format. If we have DataFrames (have they actually renamed
>> > SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
>> > IndexedDatasets, right? This would be a huge step! If we get data
>> > interchangeability with MLlib its a win. If we get general row and
>> column
>> > IDs that follow the data through math, its a win. Need to think through
>> how
>> > to use a DataFrame in a streaming case, probably through some
>> checkpointing
>> > of the window DStream—hmm.
>> >
>> > On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>> >
>> >
>> > On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>> > > I'd suggest to consider this: remember all this talk about
>> > > language-integrated spark ql being basically dataframe manipulation
>> DSL?
>> > >
>> > > so now Spark devs are noticing this generality as well and are
>> actually
>> > > proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>> > > structure. (my "told you so" moment of sorts :)
>> > >
>> > > What i am getting at, i'd suggest to make DRM and Spark's newly
>> renamed
>> > > DataFrame our two major structures. In particular, standardize on
>> using
>> > > DataFrame for things that may include non-numerical data and require
>> more
>> > > grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>> > > when it deals with non-matrix content.
>> > Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>> > at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>> >
>> > On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> > >> Seems like seq2sparse would be really easy to replace since it takes
>> > text
>> > >> files to start with, then the whole pipeline could be kept in rdds.
>> The
>> > >> dictionaries and counts could be either in-memory maps or rdds for
>> use
>> > with
>> > >> joins? This would get rid of sequence files completely from the
>> > pipeline.
>> > >> Item similarity uses in-memory maps but the plan is to make it more
>> > >> scalable using joins as an alternative with the same API allowing the
>> > user
>> > >> to trade-off footprint for speed.
>> >
>> > I think you're right- should be relatively easy.  I've been looking at
>> > porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>> > is that we don't have a distributed data structure for strings..Seems
>> like
>> > getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>> > of this problem.
>> >
>> > The other issue i'm a little fuzzy on  is the distributed collocation
>> > mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>> > time in.
>> >
>> > I think that this would be very worthy effort as well-  I believe
>> > seq2sparse is a particular strong mahout feature.
>> >
>> > I'll start another thread since we're now way off topic from the
>> > refactoring proposal.
>> > >>
>> > >> My use for TF-IDF is for row similarity and would take a DRM
>> (actually
>> > >> IndexedDataset) and calculate row/doc similarities. It works now but
>> > only
>> > >> using LLR. This is OK when thinking of the items as tags or metadata
>> but
>> > >> for text tokens something like cosine may be better.
>> > >>
>> > >> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>> a
>> > lot
>> > >> like how CF preferences are downsampled. This would produce an
>> > sparsified
>> > >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>> the
>> > >> terms before row similarity uses cosine. This is not so good for
>> search
>> > but
>> > >> should produce much better similarities than Solr’s “moreLikeThis”
>> and
>> > does
>> > >> it for all pairs rather than one at a time.
>> > >>
>> > >> In any case it can be used to do a create a personalized
>> content-based
>> > >> recommender or augment a CF recommender with one more indicator type.
>> > >>
>> > >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>> > >>
>> > >>
>> > >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>> > >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> > >>>> Some issues WRT lower level Spark integration:
>> > >>>> 1) interoperability with Spark data. TF-IDF is one example I
>> actually
>> > >> looked at. There may be other things we can pick up from their
>> > committers
>> > >> since they have an abundance.
>> > >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>> to
>> > >> me when someone on the Spark list asked about matrix transpose and an
>> > MLlib
>> > >> committer’s answer was something like “why would you want to do
>> that?”.
>> > >> Usually you don’t actually execute the transpose but they don’t even
>> > >> support A’A, AA’, or A’B, which are core to what I work on. At
>> present
>> > you
>> > >> pretty much have to choose between MLlib or Mahout for sparse matrix
>> > stuff.
>> > >> Maybe a half-way measure is some implicit conversions (ugh, I know).
>> If
>> > the
>> > >> DSL could interchange datasets with MLlib, people would be pointed to
>> > the
>> > >> DSL for all of a bunch of “why would you want to do that?” features.
>> > MLlib
>> > >> seems to be algorithms, not math.
>> > >>>> 3) integration of Streaming. DStreams support most of the RDD
>> > >> interface. Doing a batch recalc on a moving time window would nearly
>> > fall
>> > >> out of DStream backed DRMs. This isn’t the same as incremental
>> updates
>> > on
>> > >> streaming but it’s a start.
>> > >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>> > >> faster compute engines. So we jumped. Now the need is for streaming
>> and
>> > >> especially incrementally updated streaming. Seems like we need to
>> > address
>> > >> this.
>> > >>>> Andrew, regardless of the above having TF-IDF would be super
>> > >> helpful—row similarity for content/text would benefit greatly.
>> > >>>   I will put a PR up soon.
>> > >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> > classes
>> > >> and Weight interface over from mr-legacy to math-scala. They're
>> > available
>> > >> now in spark-shell but won't be after this refactoring.  These still
>> > >> require dictionary and a frequency count maps to vectorize incoming
>> > text-
>> > >> so they're more for use with the old MR seq2sparse and I don't think
>> > they
>> > >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> > >> Hopefully they'll be of some use.
>> > >>
>> > >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > >>>> But first I need to do massive fixes and improvements to the
>> > distributed
>> > >>>> optimizer itself. Still waiting on green light for that.
>> > >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com>
>> wrote:
>> > >>>>
>> > >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com>
>> wrote:
>> > >>>>>> BTW what level of difficulty would making the DSL run on MLlib
>> > Vectors
>> > >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it
>> raises
>> > >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
>> > further
>> > >>>>> reduce artifact size by a bunch.
>> > >>>>>
>> > >>>>> Short answer, if it were possible, I'd not bother with Mahout code
>> > >> base at
>> > >>>>> all. The problem is it lacks sufficient flexibility semantics and
>> > >>>>> abstruction. Breeze is indefinitely better in that department but
>> at
>> > >> the
>> > >>>>> time it was sufficiently worse on abstracting interoperability of
>> > >> matrices
>> > >>>>> with different structures. And mllib does not expose breeze.
>> > >>>>>
>> > >>>>> Looking forward toward hardware acellerated bolt-on work I just
>> must
>> > >> say
>> > >>>>> after reading breeze code for some time I still have much clearer
>> > plan
>> > >> how
>> > >>>>> such back hybridization and cost calibration might work with
>> current
>> > >> Mahout
>> > >>>>> math abstractions than with breeze. It is also more in line with
>> my
>> > >> current
>> > >>>>> work tasks.
>> > >>>>>
>> > >>>>>> Also backing something like a DRM with DStreams. Periodic model
>> > recalc
>> > >>>>> with streams is maybe the first step towards truly streaming
>> algos.
>> > >> Looking
>> > >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>> > >>>>> similarity. Attach Kafka and get evergreen models, if not
>> > incrementally
>> > >>>>> updating models.
>> > >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > >> wrote:
>> > >>>>>> bottom line compile-time dependencies are satisfied with no extra
>> > >> stuff
>> > >>>>>> from mr-legacy or its transitives. This is proven by virtue of
>> > >>>>> successful
>> > >>>>>> compilation with no dependency on mr-legacy on the tree.
>> > >>>>>>
>> > >>>>>> Runtime sufficiency for no extra dependency is proven via running
>> > >> shell
>> > >>>>> or
>> > >>>>>> embedded tests (unit tests) which are successful too. This
>> implies
>> > >>>>>> embedding and shell apis.
>> > >>>>>>
>> > >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't
>> be
>> > >> able
>> > >>>>> to
>> > >>>>>> compile and/or run stuff. Now, question is what do we do if
>> drivers
>> > >> want
>> > >>>>>> extra stuff that is not found in Spark.
>> > >>>>>>
>> > >>>>>> Now, It is so nice not to depend on anything extra so i am
>> hesitant
>> > to
>> > >>>>>> offer anything  here. either shading or lib with opt-in
>> dependency
>> > >> policy
>> > >>>>>> would suffice though, since it doesn't look like we'd have to
>> have
>> > >> tons
>> > >>>>> of
>> > >>>>>> extra for drivers.
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <
>> pat@occamsmachete.com
>> > >
>> > >>>>> wrote:
>> > >>>>>>> I vaguely remember there being a Guava version problem where the
>> > >>>>> version
>> > >>>>>>> had to be rolled back in one of the hadoop modules. The
>> math-scala
>> > >>>>>>> IndexedDataset shouldn’t care about version.
>> > >>>>>>>
>> > >>>>>>> BTW It seems pretty easy to take out the option parser and
>> replace
>> > >> with
>> > >>>>>>> match and tuples especially if we can extend the Scala App
>> class.
>> > It
>> > >>>>> might
>> > >>>>>>> actually simplify things since I can then use several case
>> classes
>> > to
>> > >>>>> hold
>> > >>>>>>> options (scopt needed one object), which in turn takes out all
>> > those
>> > >>>>> ugly
>> > >>>>>>> casts. I’ll take a look next time I’m in there.
>> > >>>>>>>
>> > >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> > >>>>> wrote:
>> > >>>>>>> in 'spark' module it is overwritten with spark dependency, which
>> > also
>> > >>>>> comes
>> > >>>>>>> at the same version so happens. so should be fine with 1.1.x
>> > >>>>>>>
>> > >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> > >>>>>>> mahout-spark_2.10 ---
>> > >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>> > >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>> > >>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> org.apache.commons:commons-math:jar:2.1:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > commons-logging:commons-logging:jar:1.1.3:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >> commons-digester:commons-digester:jar:1.8:compile
>> > >>>>>>> [INFO] |  |  |  |  |  \-
>> > >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>> > >>>>>>> [INFO] |  |  |  |  \-
>> > >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>> > >>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>> > >>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>> > >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>>>>
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> javax.inject:javax.inject:jar:1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> aopalliance:aopalliance:jar:1.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  +-
>> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |     \-
>> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |        \-
>> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     \-
>> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> > com.sun.jersey:jersey-server:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >> com.sun.jersey:jersey-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |     \-
>> > >>>>>>> javax.activation:activation:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  \-
>> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  \-
>> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>>>>
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>> > >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>> > >>>>>>> [INFO] |  |  \-
>> > commons-httpclient:commons-httpclient:jar:3.1:compile
>> > >>>>>>> [INFO] |  +-
>> org.apache.curator:curator-recipes:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >> org.apache.curator:curator-framework:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >> org.apache.curator:curator-client:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>> > >>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>> > >>>>>>> [INFO] |  +-
>> > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > >>
>> >
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |     \-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>> > >>>>>>> [INFO] |  |        \-
>> > >>>>>>>
>> > >>
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>> > >>>>>>> [INFO] |  +-
>> > >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +-
>> > >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +-
>> > >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |     \-
>> > >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>> > >>>>>>> d
>> > >>>>>>>
>> > >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
>> > dlieu.7@gmail.com
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> looks like it is also requested by mahout-math, wonder what is
>> > using
>> > >>>>> it
>> > >>>>>>>> there.
>> > >>>>>>>>
>> > >>>>>>>> At very least, it needs to be synchronized to the one currently
>> > used
>> > >>>>> by
>> > >>>>>>>> spark.
>> > >>>>>>>>
>> > >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> > >>>>> mahout-hadoop
>> > >>>>>>>> ---
>> > >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>> > >>>>>>>> *[INFO] +-
>> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>> > >>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>> > >>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>> > >>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>> > >>>>>>>> [INFO] +-
>> > >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>> > >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> > >>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
>> > pat@occamsmachete.com>
>> > >>>>>>> wrote:
>> > >>>>>>>>> Looks like Guava is in Spark.
>> > >>>>>>>>>
>> > >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <
>> pat@occamsmachete.com>
>> > >>>>> wrote:
>> > >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
>> > like
>> > >>>>> this
>> > >>>>>>>>> would not be included since I think it was taken from the
>> > mrlegacy
>> > >>>>> jar.
>> > >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
>> > dlieu.7@gmail.com>
>> > >>>>>>> wrote:
>> > >>>>>>>>> ---------- Forwarded message ----------
>> > >>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>> > >>>>>>>>> Date: Jan 25, 2015 9:39 AM
>> > >>>>>>>>> Subject: Re: Codebase refactoring proposal
>> > >>>>>>>>> To: <de...@mahout.apache.org>
>> > >>>>>>>>> Cc:
>> > >>>>>>>>>
>> > >>>>>>>>>> When you get a chance a PR would be good.
>> > >>>>>>>>> Yes, it would. And not just for that.
>> > >>>>>>>>>
>> > >>>>>>>>>> As I understand it you are putting some class jars somewhere
>> in
>> > >> the
>> > >>>>>>>>> classpath. Where? How?
>> > >>>>>>>>> /bin/mahout
>> > >>>>>>>>>
>> > >>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath'
>> vs.
>> > >>>>>>>>> 'bin/mahout -spark'.)
>> > >>>>>>>>>
>> > >>>>>>>>> If i interpret current shell code there correctky, legacy path
>> > >> tries
>> > >>>>> to
>> > >>>>>>>>> use
>> > >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>> > >>>>>>> motivation
>> > >>>>>>>>> of that significantly predates 2010 and i suspect only Benson
>> > knows
>> > >>>>>>> whole
>> > >>>>>>>>> true intent there.
>> > >>>>>>>>>
>> > >>>>>>>>> The spark path, which is really a quick hack of the script,
>> tries
>> > >> to
>> > >>>>> get
>> > >>>>>>>>> only selected mahout jars and locally instlalled spark
>> classpath
>> > >>>>> which i
>> > >>>>>>>>> guess is just the shaded spark jar in recent spark releases.
>> It
>> > >> also
>> > >>>>>>>>> apparently tries to include /libs/*, which is never compiled
>> in
>> > >>>>>>> unpackaged
>> > >>>>>>>>> version, and now i think it is a bug it is included  because
>> > >> /libs/*
>> > >>>>> is
>> > >>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark
>> jobs
>> > >>>>> with a
>> > >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find
>> time
>> > >> to
>> > >>>>>>>>> understand mahout build in all cases.
>> > >>>>>>>>>
>> > >>>>>>>>> I am not even sure if packaged mahout will work with spark,
>> > >> honestly,
>> > >>>>>>>>> because of the /lib. Never tried that, since i mostly use
>> > >> application
>> > >>>>>>>>> embedding techniques.
>> > >>>>>>>>>
>> > >>>>>>>>> The same solution may apply to adding external dependencies
>> and
>> > >>>>> removing
>> > >>>>>>>>> the assembly in the Spark module. Which would leave only one
>> > major
>> > >>>>> build
>> > >>>>>>>>> issue afaik.
>> > >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
>> > dlieu.7@gmail.com
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>> > >> sufficiently
>> > >>>>>>>>> defined
>> > >>>>>>>>>> what i want to do in order to gauge if we may want to
>> advance it
>> > >>>>> some
>> > >>>>>>>>> time
>> > >>>>>>>>>> later. Goal is much lighter dependency for spark code.
>> Eliminate
>> > >>>>>>>>> everything
>> > >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>> > >> legacy
>> > >>>>> MR
>> > >>>>>>>>> code
>> > >>>>>>>>>> which we of course don't use).
>> > >>>>>>>>>>
>> > >>>>>>>>>> Cant say i understand the remaining issues you are talking
>> about
>> > >>>>>>> though.
>> > >>>>>>>>>> If you are talking about compiling lib or shaded assembly,
>> no,
>> > >> this
>> > >>>>>>>>> doesn't
>> > >>>>>>>>>> do anything about it. Although point is, as it stands, the
>> > algebra
>> > >>>>> and
>> > >>>>>>>>>> shell don't have any external dependencies but spark and
>> these 4
>> > >>>>> (5?)
>> > >>>>>>>>>> mahout jars so they technically don't even need an assembly
>> (as
>> > >>>>>>>>>> demonstrated).
>> > >>>>>>>>>>
>> > >>>>>>>>>> As i said, it seems driver code is the only one that may need
>> > some
>> > >>>>>>>>> external
>> > >>>>>>>>>> dependencies, but that's a different scenario from those i am
>> > >>>>> talking
>> > >>>>>>>>>> about. But i am relatively happy with having the first two
>> > working
>> > >>>>>>>>> nicely
>> > >>>>>>>>>> at this point.
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>> > >> pat@occamsmachete.com>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>>> +1
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>> > >> would
>> > >>>>> be
>> > >>>>>>>>> nice
>> > >>>>>>>>>>> to see how you’ve structured that in case we can use the
>> same
>> > >>>>> model to
>> > >>>>>>>>>>> solve the two remaining refactoring issues.
>> > >>>>>>>>>>> 1) external dependencies in the spark module
>> > >>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <
>> squinn@gatech.edu>
>> > >>>>> wrote:
>> > >>>>>>>>>>> Also +1
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> iPhone'd
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <
>> ap.dev@outlook.com
>> > >
>> > >>>>>>> wrote:
>> > >>>>>>>>>>>> +1
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
>> > Dmitriy
>> > >>>>>>>>> Lyubimov
>> > >>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>> > >>>>> (GMT-05:00)
>> > >>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
>> > >> Codebase
>> > >>>>>>>>>>> refactoring proposal </div><div>
>> > >>>>>>>>>>>> </div>
>> > >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>> > >>>>>>>>>>>> I did quick refactoring and it turns out it only
>> _irrevocably_
>> > >>>>>>> depends
>> > >>>>>>>>> on
>> > >>>>>>>>>>>> the following classes there:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>> > >> VarintWritable,
>> > >>>>>>> and
>> > >>>>>>>>>>> ...
>> > >>>>>>>>>>>> *sigh* o.a.m.common.Pair
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>> > >>>>>>>>> mahout-hadoop
>> > >>>>>>>>>>>> module (to signify stuff that is directly relevant to
>> > >> serializing
>> > >>>>>>>>> thigns
>> > >>>>>>>>>>> to
>> > >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
>> > from
>> > >>>>>>> spark
>> > >>>>>>>>>>> and
>> > >>>>>>>>>>>> spark-shell dependencies.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api
>> use)
>> > >>>>> actually
>> > >>>>>>>>>>> only
>> > >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME
>> classpath,
>> > >> of
>> > >>>>>>>>> course)
>> > >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>> > >> mahout-hadoop
>> > >>>>> and
>> > >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> This of course still doesn't address driver problems that
>> want
>> > >> to
>> > >>>>>>>>> throw
>> > >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser)
>> but
>> > at
>> > >>>>> least
>> > >>>>>>>>> it
>> > >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>> > >>>>>>>>> worker-shipped
>> > >>>>>>>>>>>> jars) much more tolerable.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> How does that sound?
>> > >>>>>>>>>
>> > >>
>> > >>
>> >
>> >
>> >
>>
>
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

i think they are debating the details now, not the idea. Like how "NA" is
different from "null" in classic dataframe representation etc.

On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <su...@gmail.com>
wrote:

> I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
> must admit Dmitriy had suggested this to me few months ago reusing
> SchemaRDD if possible. Dmitriy was right "U told us".
>
> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > This sound like a great idea but I wonder is we can get rid of Mahout DRM
> > as a native format. If we have DataFrames (have they actually renamed
> > SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
> > IndexedDatasets, right? This would be a huge step! If we get data
> > interchangeability with MLlib its a win. If we get general row and column
> > IDs that follow the data through math, its a win. Need to think through
> how
> > to use a DataFrame in a streaming case, probably through some
> checkpointing
> > of the window DStream—hmm.
> >
> > On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap...@outlook.com> wrote:
> >
> >
> > On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> > > I'd suggest to consider this: remember all this talk about
> > > language-integrated spark ql being basically dataframe manipulation
> DSL?
> > >
> > > so now Spark devs are noticing this generality as well and are actually
> > > proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
> > > structure. (my "told you so" moment of sorts :)
> > >
> > > What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> > > DataFrame our two major structures. In particular, standardize on using
> > > DataFrame for things that may include non-numerical data and require
> more
> > > grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
> > > when it deals with non-matrix content.
> > Sounds like a worthy effort to me.  We'd be basically implementing an API
> > at the math-scala level for SchemaRDD/Dataframe datastructures correct?
> >
> > On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > >> Seems like seq2sparse would be really easy to replace since it takes
> > text
> > >> files to start with, then the whole pipeline could be kept in rdds.
> The
> > >> dictionaries and counts could be either in-memory maps or rdds for use
> > with
> > >> joins? This would get rid of sequence files completely from the
> > pipeline.
> > >> Item similarity uses in-memory maps but the plan is to make it more
> > >> scalable using joins as an alternative with the same API allowing the
> > user
> > >> to trade-off footprint for speed.
> >
> > I think you're right- should be relatively easy.  I've been looking at
> > porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
> > is that we don't have a distributed data structure for strings..Seems
> like
> > getting a DataFrame implemented as Dmitriy mentioned above would take
> care
> > of this problem.
> >
> > The other issue i'm a little fuzzy on  is the distributed collocation
> > mapping-  it's a part of the seq2sparse code that I've not spent too much
> > time in.
> >
> > I think that this would be very worthy effort as well-  I believe
> > seq2sparse is a particular strong mahout feature.
> >
> > I'll start another thread since we're now way off topic from the
> > refactoring proposal.
> > >>
> > >> My use for TF-IDF is for row similarity and would take a DRM (actually
> > >> IndexedDataset) and calculate row/doc similarities. It works now but
> > only
> > >> using LLR. This is OK when thinking of the items as tags or metadata
> but
> > >> for text tokens something like cosine may be better.
> > >>
> > >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> > lot
> > >> like how CF preferences are downsampled. This would produce an
> > sparsified
> > >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
> the
> > >> terms before row similarity uses cosine. This is not so good for
> search
> > but
> > >> should produce much better similarities than Solr’s “moreLikeThis” and
> > does
> > >> it for all pairs rather than one at a time.
> > >>
> > >> In any case it can be used to do a create a personalized content-based
> > >> recommender or augment a CF recommender with one more indicator type.
> > >>
> > >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
> > >>
> > >>
> > >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> > >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> > >>>> Some issues WRT lower level Spark integration:
> > >>>> 1) interoperability with Spark data. TF-IDF is one example I
> actually
> > >> looked at. There may be other things we can pick up from their
> > committers
> > >> since they have an abundance.
> > >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
> to
> > >> me when someone on the Spark list asked about matrix transpose and an
> > MLlib
> > >> committer’s answer was something like “why would you want to do
> that?”.
> > >> Usually you don’t actually execute the transpose but they don’t even
> > >> support A’A, AA’, or A’B, which are core to what I work on. At present
> > you
> > >> pretty much have to choose between MLlib or Mahout for sparse matrix
> > stuff.
> > >> Maybe a half-way measure is some implicit conversions (ugh, I know).
> If
> > the
> > >> DSL could interchange datasets with MLlib, people would be pointed to
> > the
> > >> DSL for all of a bunch of “why would you want to do that?” features.
> > MLlib
> > >> seems to be algorithms, not math.
> > >>>> 3) integration of Streaming. DStreams support most of the RDD
> > >> interface. Doing a batch recalc on a moving time window would nearly
> > fall
> > >> out of DStream backed DRMs. This isn’t the same as incremental updates
> > on
> > >> streaming but it’s a start.
> > >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> > >> faster compute engines. So we jumped. Now the need is for streaming
> and
> > >> especially incrementally updated streaming. Seems like we need to
> > address
> > >> this.
> > >>>> Andrew, regardless of the above having TF-IDF would be super
> > >> helpful—row similarity for content/text would benefit greatly.
> > >>>   I will put a PR up soon.
> > >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> > classes
> > >> and Weight interface over from mr-legacy to math-scala. They're
> > available
> > >> now in spark-shell but won't be after this refactoring.  These still
> > >> require dictionary and a frequency count maps to vectorize incoming
> > text-
> > >> so they're more for use with the old MR seq2sparse and I don't think
> > they
> > >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> > >> Hopefully they'll be of some use.
> > >>
> > >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > >>>> But first I need to do massive fixes and improvements to the
> > distributed
> > >>>> optimizer itself. Still waiting on green light for that.
> > >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com>
> wrote:
> > >>>>
> > >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com>
> wrote:
> > >>>>>> BTW what level of difficulty would making the DSL run on MLlib
> > Vectors
> > >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it
> raises
> > >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
> > further
> > >>>>> reduce artifact size by a bunch.
> > >>>>>
> > >>>>> Short answer, if it were possible, I'd not bother with Mahout code
> > >> base at
> > >>>>> all. The problem is it lacks sufficient flexibility semantics and
> > >>>>> abstruction. Breeze is indefinitely better in that department but
> at
> > >> the
> > >>>>> time it was sufficiently worse on abstracting interoperability of
> > >> matrices
> > >>>>> with different structures. And mllib does not expose breeze.
> > >>>>>
> > >>>>> Looking forward toward hardware acellerated bolt-on work I just
> must
> > >> say
> > >>>>> after reading breeze code for some time I still have much clearer
> > plan
> > >> how
> > >>>>> such back hybridization and cost calibration might work with
> current
> > >> Mahout
> > >>>>> math abstractions than with breeze. It is also more in line with my
> > >> current
> > >>>>> work tasks.
> > >>>>>
> > >>>>>> Also backing something like a DRM with DStreams. Periodic model
> > recalc
> > >>>>> with streams is maybe the first step towards truly streaming algos.
> > >> Looking
> > >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> > >>>>> similarity. Attach Kafka and get evergreen models, if not
> > incrementally
> > >>>>> updating models.
> > >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > >> wrote:
> > >>>>>> bottom line compile-time dependencies are satisfied with no extra
> > >> stuff
> > >>>>>> from mr-legacy or its transitives. This is proven by virtue of
> > >>>>> successful
> > >>>>>> compilation with no dependency on mr-legacy on the tree.
> > >>>>>>
> > >>>>>> Runtime sufficiency for no extra dependency is proven via running
> > >> shell
> > >>>>> or
> > >>>>>> embedded tests (unit tests) which are successful too. This implies
> > >>>>>> embedding and shell apis.
> > >>>>>>
> > >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't
> be
> > >> able
> > >>>>> to
> > >>>>>> compile and/or run stuff. Now, question is what do we do if
> drivers
> > >> want
> > >>>>>> extra stuff that is not found in Spark.
> > >>>>>>
> > >>>>>> Now, It is so nice not to depend on anything extra so i am
> hesitant
> > to
> > >>>>>> offer anything  here. either shading or lib with opt-in dependency
> > >> policy
> > >>>>>> would suffice though, since it doesn't look like we'd have to have
> > >> tons
> > >>>>> of
> > >>>>>> extra for drivers.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <
> pat@occamsmachete.com
> > >
> > >>>>> wrote:
> > >>>>>>> I vaguely remember there being a Guava version problem where the
> > >>>>> version
> > >>>>>>> had to be rolled back in one of the hadoop modules. The
> math-scala
> > >>>>>>> IndexedDataset shouldn’t care about version.
> > >>>>>>>
> > >>>>>>> BTW It seems pretty easy to take out the option parser and
> replace
> > >> with
> > >>>>>>> match and tuples especially if we can extend the Scala App class.
> > It
> > >>>>> might
> > >>>>>>> actually simplify things since I can then use several case
> classes
> > to
> > >>>>> hold
> > >>>>>>> options (scopt needed one object), which in turn takes out all
> > those
> > >>>>> ugly
> > >>>>>>> casts. I’ll take a look next time I’m in there.
> > >>>>>>>
> > >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > >>>>> wrote:
> > >>>>>>> in 'spark' module it is overwritten with spark dependency, which
> > also
> > >>>>> comes
> > >>>>>>> at the same version so happens. so should be fine with 1.1.x
> > >>>>>>>
> > >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > >>>>>>> mahout-spark_2.10 ---
> > >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> > >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> > >>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> > >>>>>>> [INFO] |  |  |  +-
> org.apache.commons:commons-math:jar:2.1:compile
> > >>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> > >>>>>>> [INFO] |  |  |  +-
> > commons-logging:commons-logging:jar:1.1.3:compile
> > >>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> > >>>>>>> [INFO] |  |  |  +-
> > >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
> > >>>>>>> [INFO] |  |  |  |  +-
> > >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
> > >>>>>>> [INFO] |  |  |  |  +-
> > >> commons-digester:commons-digester:jar:1.8:compile
> > >>>>>>> [INFO] |  |  |  |  |  \-
> > >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> > >>>>>>> [INFO] |  |  |  |  \-
> > >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> > >>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> > >>>>>>> [INFO] |  |  |  +-
> > >> com.google.protobuf:protobuf-java:jar:2.5.0:compile
> > >>>>>>> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
> > >>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> > >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  +-
> > >>>>>>>
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  |  +-
> > >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> javax.inject:javax.inject:jar:1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> aopalliance:aopalliance:jar:1.0:compile
> > >>>>>>> [INFO] |  |  |  |  |  +-
> > >>>>>>>
> > >>>>>>>
> > >>
> >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>>>>
> > >>>>>>>
> > >>
> >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  +-
> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     +-
> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     |     \-
> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     |        \-
> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     +-
> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     +-
> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |     \-
> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  +-
> > com.sun.jersey:jersey-server:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> > >> com.sun.jersey:jersey-core:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  +-
> com.sun.jersey:jersey-json:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |     \-
> > >>>>>>> javax.activation:activation:jar:1.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> > >>>>>>> [INFO] |  |  |  |  |  \-
> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  \-
> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>>>>
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>>
> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  \-
> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> > >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> > >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> > >>>>>>> [INFO] |  |  \-
> > commons-httpclient:commons-httpclient:jar:3.1:compile
> > >>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >> org.apache.curator:curator-framework:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >> org.apache.curator:curator-client:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > >>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> > >>>>>>> [INFO] |  +-
> > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>>
> > >>
> > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  |  +-
> > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  \-
> > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |     \-
> > >>>>>>>
> > >>>>>>>
> > >>
> >
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > >>>>>>> [INFO] |  |        \-
> > >>>>>>>
> > >>
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > >>>>>>> [INFO] |  +-
> > >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  +-
> > >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  +-
> > >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>>
> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  \-
> > >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |     \-
> > >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > >>>>>>> d
> > >>>>>>>
> > >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> looks like it is also requested by mahout-math, wonder what is
> > using
> > >>>>> it
> > >>>>>>>> there.
> > >>>>>>>>
> > >>>>>>>> At very least, it needs to be synchronized to the one currently
> > used
> > >>>>> by
> > >>>>>>>> spark.
> > >>>>>>>>
> > >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > >>>>> mahout-hadoop
> > >>>>>>>> ---
> > >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > >>>>>>>> *[INFO] +-
> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > >>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > >>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > >>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > >>>>>>>> [INFO] +-
> > >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > >>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
> > pat@occamsmachete.com>
> > >>>>>>> wrote:
> > >>>>>>>>> Looks like Guava is in Spark.
> > >>>>>>>>>
> > >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pat@occamsmachete.com
> >
> > >>>>> wrote:
> > >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
> > like
> > >>>>> this
> > >>>>>>>>> would not be included since I think it was taken from the
> > mrlegacy
> > >>>>> jar.
> > >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>>> ---------- Forwarded message ----------
> > >>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
> > >>>>>>>>> Date: Jan 25, 2015 9:39 AM
> > >>>>>>>>> Subject: Re: Codebase refactoring proposal
> > >>>>>>>>> To: <de...@mahout.apache.org>
> > >>>>>>>>> Cc:
> > >>>>>>>>>
> > >>>>>>>>>> When you get a chance a PR would be good.
> > >>>>>>>>> Yes, it would. And not just for that.
> > >>>>>>>>>
> > >>>>>>>>>> As I understand it you are putting some class jars somewhere
> in
> > >> the
> > >>>>>>>>> classpath. Where? How?
> > >>>>>>>>> /bin/mahout
> > >>>>>>>>>
> > >>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath'
> vs.
> > >>>>>>>>> 'bin/mahout -spark'.)
> > >>>>>>>>>
> > >>>>>>>>> If i interpret current shell code there correctky, legacy path
> > >> tries
> > >>>>> to
> > >>>>>>>>> use
> > >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
> > >>>>>>> motivation
> > >>>>>>>>> of that significantly predates 2010 and i suspect only Benson
> > knows
> > >>>>>>> whole
> > >>>>>>>>> true intent there.
> > >>>>>>>>>
> > >>>>>>>>> The spark path, which is really a quick hack of the script,
> tries
> > >> to
> > >>>>> get
> > >>>>>>>>> only selected mahout jars and locally instlalled spark
> classpath
> > >>>>> which i
> > >>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
> > >> also
> > >>>>>>>>> apparently tries to include /libs/*, which is never compiled in
> > >>>>>>> unpackaged
> > >>>>>>>>> version, and now i think it is a bug it is included  because
> > >> /libs/*
> > >>>>> is
> > >>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark
> jobs
> > >>>>> with a
> > >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find
> time
> > >> to
> > >>>>>>>>> understand mahout build in all cases.
> > >>>>>>>>>
> > >>>>>>>>> I am not even sure if packaged mahout will work with spark,
> > >> honestly,
> > >>>>>>>>> because of the /lib. Never tried that, since i mostly use
> > >> application
> > >>>>>>>>> embedding techniques.
> > >>>>>>>>>
> > >>>>>>>>> The same solution may apply to adding external dependencies and
> > >>>>> removing
> > >>>>>>>>> the assembly in the Spark module. Which would leave only one
> > major
> > >>>>> build
> > >>>>>>>>> issue afaik.
> > >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >>>>>>>>> wrote:
> > >>>>>>>>>> No, no PR. Only experiment on private. But i believe i
> > >> sufficiently
> > >>>>>>>>> defined
> > >>>>>>>>>> what i want to do in order to gauge if we may want to advance
> it
> > >>>>> some
> > >>>>>>>>> time
> > >>>>>>>>>> later. Goal is much lighter dependency for spark code.
> Eliminate
> > >>>>>>>>> everything
> > >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
> > >> legacy
> > >>>>> MR
> > >>>>>>>>> code
> > >>>>>>>>>> which we of course don't use).
> > >>>>>>>>>>
> > >>>>>>>>>> Cant say i understand the remaining issues you are talking
> about
> > >>>>>>> though.
> > >>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
> > >> this
> > >>>>>>>>> doesn't
> > >>>>>>>>>> do anything about it. Although point is, as it stands, the
> > algebra
> > >>>>> and
> > >>>>>>>>>> shell don't have any external dependencies but spark and
> these 4
> > >>>>> (5?)
> > >>>>>>>>>> mahout jars so they technically don't even need an assembly
> (as
> > >>>>>>>>>> demonstrated).
> > >>>>>>>>>>
> > >>>>>>>>>> As i said, it seems driver code is the only one that may need
> > some
> > >>>>>>>>> external
> > >>>>>>>>>> dependencies, but that's a different scenario from those i am
> > >>>>> talking
> > >>>>>>>>>> about. But i am relatively happy with having the first two
> > working
> > >>>>>>>>> nicely
> > >>>>>>>>>> at this point.
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
> > >> pat@occamsmachete.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>> +1
> > >>>>>>>>>>>
> > >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
> > >> would
> > >>>>> be
> > >>>>>>>>> nice
> > >>>>>>>>>>> to see how you’ve structured that in case we can use the same
> > >>>>> model to
> > >>>>>>>>>>> solve the two remaining refactoring issues.
> > >>>>>>>>>>> 1) external dependencies in the spark module
> > >>>>>>>>>>> 2) no spark or h2o in the release artifacts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <
> squinn@gatech.edu>
> > >>>>> wrote:
> > >>>>>>>>>>> Also +1
> > >>>>>>>>>>>
> > >>>>>>>>>>> iPhone'd
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <
> ap.dev@outlook.com
> > >
> > >>>>>>> wrote:
> > >>>>>>>>>>>> +1
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
> > Dmitriy
> > >>>>>>>>> Lyubimov
> > >>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> > >>>>> (GMT-05:00)
> > >>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
> > >> Codebase
> > >>>>>>>>>>> refactoring proposal </div><div>
> > >>>>>>>>>>>> </div>
> > >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
> > >>>>>>>>>>>> I did quick refactoring and it turns out it only
> _irrevocably_
> > >>>>>>> depends
> > >>>>>>>>> on
> > >>>>>>>>>>>> the following classes there:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
> > >> VarintWritable,
> > >>>>>>> and
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>> *sigh* o.a.m.common.Pair
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
> > >>>>>>>>> mahout-hadoop
> > >>>>>>>>>>>> module (to signify stuff that is directly relevant to
> > >> serializing
> > >>>>>>>>> thigns
> > >>>>>>>>>>> to
> > >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
> > from
> > >>>>>>> spark
> > >>>>>>>>>>> and
> > >>>>>>>>>>>> spark-shell dependencies.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
> > >>>>> actually
> > >>>>>>>>>>> only
> > >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME
> classpath,
> > >> of
> > >>>>>>>>> course)
> > >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
> > >> mahout-hadoop
> > >>>>> and
> > >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This of course still doesn't address driver problems that
> want
> > >> to
> > >>>>>>>>> throw
> > >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but
> > at
> > >>>>> least
> > >>>>>>>>> it
> > >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
> > >>>>>>>>> worker-shipped
> > >>>>>>>>>>>> jars) much more tolerable.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> How does that sound?
> > >>>>>>>>>
> > >>
> > >>
> >
> >
> >
>

Re: Codebase refactoring proposal

Posted by Suneel Marthi <su...@gmail.com>.

I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
must admit Dmitriy had suggested this to me few months ago reusing
SchemaRDD if possible. Dmitriy was right "U told us".

On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This sound like a great idea but I wonder is we can get rid of Mahout DRM
> as a native format. If we have DataFrames (have they actually renamed
> SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
> IndexedDatasets, right? This would be a huge step! If we get data
> interchangeability with MLlib its a win. If we get general row and column
> IDs that follow the data through math, its a win. Need to think through how
> to use a DataFrame in a streaming case, probably through some checkpointing
> of the window DStream—hmm.
>
> On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>
> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> > I'd suggest to consider this: remember all this talk about
> > language-integrated spark ql being basically dataframe manipulation DSL?
> >
> > so now Spark devs are noticing this generality as well and are actually
> > proposing to rename SchemaRDD into DataFrame and make it mainstream data
> > structure. (my "told you so" moment of sorts :)
> >
> > What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> > DataFrame our two major structures. In particular, standardize on using
> > DataFrame for things that may include non-numerical data and require more
> > grace about column naming and manipulation. Maybe relevant to TF-IDF work
> > when it deals with non-matrix content.
> Sounds like a worthy effort to me.  We'd be basically implementing an API
> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>
> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >> Seems like seq2sparse would be really easy to replace since it takes
> text
> >> files to start with, then the whole pipeline could be kept in rdds. The
> >> dictionaries and counts could be either in-memory maps or rdds for use
> with
> >> joins? This would get rid of sequence files completely from the
> pipeline.
> >> Item similarity uses in-memory maps but the plan is to make it more
> >> scalable using joins as an alternative with the same API allowing the
> user
> >> to trade-off footprint for speed.
>
> I think you're right- should be relatively easy.  I've been looking at
> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
> is that we don't have a distributed data structure for strings..Seems like
> getting a DataFrame implemented as Dmitriy mentioned above would take care
> of this problem.
>
> The other issue i'm a little fuzzy on  is the distributed collocation
> mapping-  it's a part of the seq2sparse code that I've not spent too much
> time in.
>
> I think that this would be very worthy effort as well-  I believe
> seq2sparse is a particular strong mahout feature.
>
> I'll start another thread since we're now way off topic from the
> refactoring proposal.
> >>
> >> My use for TF-IDF is for row similarity and would take a DRM (actually
> >> IndexedDataset) and calculate row/doc similarities. It works now but
> only
> >> using LLR. This is OK when thinking of the items as tags or metadata but
> >> for text tokens something like cosine may be better.
> >>
> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
> >> like how CF preferences are downsampled. This would produce an
> sparsified
> >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> >> terms before row similarity uses cosine. This is not so good for search
> but
> >> should produce much better similarities than Solr’s “moreLikeThis” and
> does
> >> it for all pairs rather than one at a time.
> >>
> >> In any case it can be used to do a create a personalized content-based
> >> recommender or augment a CF recommender with one more indicator type.
> >>
> >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
> >>
> >>
> >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>> Some issues WRT lower level Spark integration:
> >>>> 1) interoperability with Spark data. TF-IDF is one example I actually
> >> looked at. There may be other things we can pick up from their
> committers
> >> since they have an abundance.
> >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> >> me when someone on the Spark list asked about matrix transpose and an
> MLlib
> >> committer’s answer was something like “why would you want to do that?”.
> >> Usually you don’t actually execute the transpose but they don’t even
> >> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
> >> pretty much have to choose between MLlib or Mahout for sparse matrix
> stuff.
> >> Maybe a half-way measure is some implicit conversions (ugh, I know). If
> the
> >> DSL could interchange datasets with MLlib, people would be pointed to
> the
> >> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
> >> seems to be algorithms, not math.
> >>>> 3) integration of Streaming. DStreams support most of the RDD
> >> interface. Doing a batch recalc on a moving time window would nearly
> fall
> >> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
> >> streaming but it’s a start.
> >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >> faster compute engines. So we jumped. Now the need is for streaming and
> >> especially incrementally updated streaming. Seems like we need to
> address
> >> this.
> >>>> Andrew, regardless of the above having TF-IDF would be super
> >> helpful—row similarity for content/text would benefit greatly.
> >>>   I will put a PR up soon.
> >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
> >> and Weight interface over from mr-legacy to math-scala. They're
> available
> >> now in spark-shell but won't be after this refactoring.  These still
> >> require dictionary and a frequency count maps to vectorize incoming
> text-
> >> so they're more for use with the old MR seq2sparse and I don't think
> they
> >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >> Hopefully they'll be of some use.
> >>
> >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>>> But first I need to do massive fixes and improvements to the
> distributed
> >>>> optimizer itself. Still waiting on green light for that.
> >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
> >>>>
> >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >>>>>> BTW what level of difficulty would making the DSL run on MLlib
> Vectors
> >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
> further
> >>>>> reduce artifact size by a bunch.
> >>>>>
> >>>>> Short answer, if it were possible, I'd not bother with Mahout code
> >> base at
> >>>>> all. The problem is it lacks sufficient flexibility semantics and
> >>>>> abstruction. Breeze is indefinitely better in that department but at
> >> the
> >>>>> time it was sufficiently worse on abstracting interoperability of
> >> matrices
> >>>>> with different structures. And mllib does not expose breeze.
> >>>>>
> >>>>> Looking forward toward hardware acellerated bolt-on work I just must
> >> say
> >>>>> after reading breeze code for some time I still have much clearer
> plan
> >> how
> >>>>> such back hybridization and cost calibration might work with current
> >> Mahout
> >>>>> math abstractions than with breeze. It is also more in line with my
> >> current
> >>>>> work tasks.
> >>>>>
> >>>>>> Also backing something like a DRM with DStreams. Periodic model
> recalc
> >>>>> with streams is maybe the first step towards truly streaming algos.
> >> Looking
> >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> >>>>> similarity. Attach Kafka and get evergreen models, if not
> incrementally
> >>>>> updating models.
> >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>>>>> bottom line compile-time dependencies are satisfied with no extra
> >> stuff
> >>>>>> from mr-legacy or its transitives. This is proven by virtue of
> >>>>> successful
> >>>>>> compilation with no dependency on mr-legacy on the tree.
> >>>>>>
> >>>>>> Runtime sufficiency for no extra dependency is proven via running
> >> shell
> >>>>> or
> >>>>>> embedded tests (unit tests) which are successful too. This implies
> >>>>>> embedding and shell apis.
> >>>>>>
> >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't be
> >> able
> >>>>> to
> >>>>>> compile and/or run stuff. Now, question is what do we do if drivers
> >> want
> >>>>>> extra stuff that is not found in Spark.
> >>>>>>
> >>>>>> Now, It is so nice not to depend on anything extra so i am hesitant
> to
> >>>>>> offer anything  here. either shading or lib with opt-in dependency
> >> policy
> >>>>>> would suffice though, since it doesn't look like we'd have to have
> >> tons
> >>>>> of
> >>>>>> extra for drivers.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pat@occamsmachete.com
> >
> >>>>> wrote:
> >>>>>>> I vaguely remember there being a Guava version problem where the
> >>>>> version
> >>>>>>> had to be rolled back in one of the hadoop modules. The math-scala
> >>>>>>> IndexedDataset shouldn’t care about version.
> >>>>>>>
> >>>>>>> BTW It seems pretty easy to take out the option parser and replace
> >> with
> >>>>>>> match and tuples especially if we can extend the Scala App class.
> It
> >>>>> might
> >>>>>>> actually simplify things since I can then use several case classes
> to
> >>>>> hold
> >>>>>>> options (scopt needed one object), which in turn takes out all
> those
> >>>>> ugly
> >>>>>>> casts. I’ll take a look next time I’m in there.
> >>>>>>>
> >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>>> in 'spark' module it is overwritten with spark dependency, which
> also
> >>>>> comes
> >>>>>>> at the same version so happens. so should be fine with 1.1.x
> >>>>>>>
> >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>>>>>> mahout-spark_2.10 ---
> >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> >>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> >>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> >>>>>>> [INFO] |  |  |  +-
> commons-logging:commons-logging:jar:1.1.3:compile
> >>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >> commons-digester:commons-digester:jar:1.8:compile
> >>>>>>> [INFO] |  |  |  |  |  \-
> >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> >>>>>>> [INFO] |  |  |  |  \-
> >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> >>>>>>> [INFO] |  |  |  +-
> >> com.google.protobuf:protobuf-java:jar:2.5.0:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
> >>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  +-
> >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |     \-
> >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |        \-
> >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     \-
> >>>>> org.glassfish:javax.servlet:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> com.sun.jersey:jersey-server:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >> com.sun.jersey:jersey-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |     \-
> >>>>>>> javax.activation:activation:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  \-
> >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  \-
> >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> >>>>>>> [INFO] |  |  \-
> commons-httpclient:commons-httpclient:jar:3.1:compile
> >>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  +-
> >> org.apache.curator:curator-framework:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >> org.apache.curator:curator-client:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> >>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> >>>>>>> [INFO] |  +-
> >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> >>
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> >>>>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  \-
> >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |     \-
> >>>>>>>
> >>>>>>>
> >>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> >>>>>>> [INFO] |  |        \-
> >>>>>>>
> >> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> >>>>>>> [INFO] |  +-
> >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +-
> >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +-
> >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  \-
> >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |     \-
> >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> >>>>>>> d
> >>>>>>>
> >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> looks like it is also requested by mahout-math, wonder what is
> using
> >>>>> it
> >>>>>>>> there.
> >>>>>>>>
> >>>>>>>> At very least, it needs to be synchronized to the one currently
> used
> >>>>> by
> >>>>>>>> spark.
> >>>>>>>>
> >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>>>> mahout-hadoop
> >>>>>>>> ---
> >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >>>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> >>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> >>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> >>>>>>>> [INFO] +-
> >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
> pat@occamsmachete.com>
> >>>>>>> wrote:
> >>>>>>>>> Looks like Guava is in Spark.
> >>>>>>>>>
> >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>>>> wrote:
> >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
> like
> >>>>> this
> >>>>>>>>> would not be included since I think it was taken from the
> mrlegacy
> >>>>> jar.
> >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>> ---------- Forwarded message ----------
> >>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
> >>>>>>>>> Date: Jan 25, 2015 9:39 AM
> >>>>>>>>> Subject: Re: Codebase refactoring proposal
> >>>>>>>>> To: <de...@mahout.apache.org>
> >>>>>>>>> Cc:
> >>>>>>>>>
> >>>>>>>>>> When you get a chance a PR would be good.
> >>>>>>>>> Yes, it would. And not just for that.
> >>>>>>>>>
> >>>>>>>>>> As I understand it you are putting some class jars somewhere in
> >> the
> >>>>>>>>> classpath. Where? How?
> >>>>>>>>> /bin/mahout
> >>>>>>>>>
> >>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >>>>>>>>> 'bin/mahout -spark'.)
> >>>>>>>>>
> >>>>>>>>> If i interpret current shell code there correctky, legacy path
> >> tries
> >>>>> to
> >>>>>>>>> use
> >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
> >>>>>>> motivation
> >>>>>>>>> of that significantly predates 2010 and i suspect only Benson
> knows
> >>>>>>> whole
> >>>>>>>>> true intent there.
> >>>>>>>>>
> >>>>>>>>> The spark path, which is really a quick hack of the script, tries
> >> to
> >>>>> get
> >>>>>>>>> only selected mahout jars and locally instlalled spark classpath
> >>>>> which i
> >>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
> >> also
> >>>>>>>>> apparently tries to include /libs/*, which is never compiled in
> >>>>>>> unpackaged
> >>>>>>>>> version, and now i think it is a bug it is included  because
> >> /libs/*
> >>>>> is
> >>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
> >>>>> with a
> >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
> >> to
> >>>>>>>>> understand mahout build in all cases.
> >>>>>>>>>
> >>>>>>>>> I am not even sure if packaged mahout will work with spark,
> >> honestly,
> >>>>>>>>> because of the /lib. Never tried that, since i mostly use
> >> application
> >>>>>>>>> embedding techniques.
> >>>>>>>>>
> >>>>>>>>> The same solution may apply to adding external dependencies and
> >>>>> removing
> >>>>>>>>> the assembly in the Spark module. Which would leave only one
> major
> >>>>> build
> >>>>>>>>> issue afaik.
> >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>> No, no PR. Only experiment on private. But i believe i
> >> sufficiently
> >>>>>>>>> defined
> >>>>>>>>>> what i want to do in order to gauge if we may want to advance it
> >>>>> some
> >>>>>>>>> time
> >>>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
> >>>>>>>>> everything
> >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
> >> legacy
> >>>>> MR
> >>>>>>>>> code
> >>>>>>>>>> which we of course don't use).
> >>>>>>>>>>
> >>>>>>>>>> Cant say i understand the remaining issues you are talking about
> >>>>>>> though.
> >>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
> >> this
> >>>>>>>>> doesn't
> >>>>>>>>>> do anything about it. Although point is, as it stands, the
> algebra
> >>>>> and
> >>>>>>>>>> shell don't have any external dependencies but spark and these 4
> >>>>> (5?)
> >>>>>>>>>> mahout jars so they technically don't even need an assembly (as
> >>>>>>>>>> demonstrated).
> >>>>>>>>>>
> >>>>>>>>>> As i said, it seems driver code is the only one that may need
> some
> >>>>>>>>> external
> >>>>>>>>>> dependencies, but that's a different scenario from those i am
> >>>>> talking
> >>>>>>>>>> about. But i am relatively happy with having the first two
> working
> >>>>>>>>> nicely
> >>>>>>>>>> at this point.
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
> >> pat@occamsmachete.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>> +1
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
> >> would
> >>>>> be
> >>>>>>>>> nice
> >>>>>>>>>>> to see how you’ve structured that in case we can use the same
> >>>>> model to
> >>>>>>>>>>> solve the two remaining refactoring issues.
> >>>>>>>>>>> 1) external dependencies in the spark module
> >>>>>>>>>>> 2) no spark or h2o in the release artifacts.
> >>>>>>>>>>>
> >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
> >>>>> wrote:
> >>>>>>>>>>> Also +1
> >>>>>>>>>>>
> >>>>>>>>>>> iPhone'd
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap.dev@outlook.com
> >
> >>>>>>> wrote:
> >>>>>>>>>>>> +1
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> >>>>>>>>>>>>
> >>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
> Dmitriy
> >>>>>>>>> Lyubimov
> >>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> >>>>> (GMT-05:00)
> >>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
> >> Codebase
> >>>>>>>>>>> refactoring proposal </div><div>
> >>>>>>>>>>>> </div>
> >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
> >>>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> >>>>>>> depends
> >>>>>>>>> on
> >>>>>>>>>>>> the following classes there:
> >>>>>>>>>>>>
> >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
> >> VarintWritable,
> >>>>>>> and
> >>>>>>>>>>> ...
> >>>>>>>>>>>> *sigh* o.a.m.common.Pair
> >>>>>>>>>>>>
> >>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
> >>>>>>>>> mahout-hadoop
> >>>>>>>>>>>> module (to signify stuff that is directly relevant to
> >> serializing
> >>>>>>>>> thigns
> >>>>>>>>>>> to
> >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
> from
> >>>>>>> spark
> >>>>>>>>>>> and
> >>>>>>>>>>>> spark-shell dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
> >>>>> actually
> >>>>>>>>>>> only
> >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
> >> of
> >>>>>>>>> course)
> >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
> >> mahout-hadoop
> >>>>> and
> >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
> >>>>>>>>>>>>
> >>>>>>>>>>>> This of course still doesn't address driver problems that want
> >> to
> >>>>>>>>> throw
> >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but
> at
> >>>>> least
> >>>>>>>>> it
> >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
> >>>>>>>>> worker-shipped
> >>>>>>>>>>>> jars) much more tolerable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> How does that sound?
> >>>>>>>>>
> >>
> >>
>
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This sound like a great idea but I wonder is we can get rid of Mahout DRM as a native format. If we have DataFrames (have they actually renamed SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or IndexedDatasets, right? This would be a huge step! If we get data interchangeability with MLlib its a win. If we get general row and column IDs that follow the data through math, its a win. Need to think through how to use a DataFrame in a streaming case, probably through some checkpointing of the window DStream—hmm.

On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap...@outlook.com> wrote:


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
> 
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts :)
> 
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring proposal.
>> 
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>> 
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search but
>> should produce much better similarities than Solr’s “moreLikeThis” and does
>> it for all pairs rather than one at a time.
>> 
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>> 
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>> 
>> 
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>> me when someone on the Spark list asked about matrix transpose and an MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>>> 3) integration of Streaming. DStreams support most of the RDD
>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>>> Andrew, regardless of the above having TF-IDF would be super
>> helpful—row similarity for content/text would benefit greatly.
>>>   I will put a PR up soon.
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>> 
>> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> But first I need to do massive fixes and improvements to the distributed
>>>> optimizer itself. Still waiting on green light for that.
>>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>>>> 
>>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>>>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>>>>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>>>>> reduce artifact size by a bunch.
>>>>> 
>>>>> Short answer, if it were possible, I'd not bother with Mahout code
>> base at
>>>>> all. The problem is it lacks sufficient flexibility semantics and
>>>>> abstruction. Breeze is indefinitely better in that department but at
>> the
>>>>> time it was sufficiently worse on abstracting interoperability of
>> matrices
>>>>> with different structures. And mllib does not expose breeze.
>>>>> 
>>>>> Looking forward toward hardware acellerated bolt-on work I just must
>> say
>>>>> after reading breeze code for some time I still have much clearer plan
>> how
>>>>> such back hybridization and cost calibration might work with current
>> Mahout
>>>>> math abstractions than with breeze. It is also more in line with my
>> current
>>>>> work tasks.
>>>>> 
>>>>>> Also backing something like a DRM with DStreams. Periodic model recalc
>>>>> with streams is maybe the first step towards truly streaming algos.
>> Looking
>>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>>>> similarity. Attach Kafka and get evergreen models, if not incrementally
>>>>> updating models.
>>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>>>> bottom line compile-time dependencies are satisfied with no extra
>> stuff
>>>>>> from mr-legacy or its transitives. This is proven by virtue of
>>>>> successful
>>>>>> compilation with no dependency on mr-legacy on the tree.
>>>>>> 
>>>>>> Runtime sufficiency for no extra dependency is proven via running
>> shell
>>>>> or
>>>>>> embedded tests (unit tests) which are successful too. This implies
>>>>>> embedding and shell apis.
>>>>>> 
>>>>>> Issue with guava is typical one. if it were an issue, i wouldn't be
>> able
>>>>> to
>>>>>> compile and/or run stuff. Now, question is what do we do if drivers
>> want
>>>>>> extra stuff that is not found in Spark.
>>>>>> 
>>>>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>>>>> offer anything  here. either shading or lib with opt-in dependency
>> policy
>>>>>> would suffice though, since it doesn't look like we'd have to have
>> tons
>>>>> of
>>>>>> extra for drivers.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>> I vaguely remember there being a Guava version problem where the
>>>>> version
>>>>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>>>>> IndexedDataset shouldn’t care about version.
>>>>>>> 
>>>>>>> BTW It seems pretty easy to take out the option parser and replace
>> with
>>>>>>> match and tuples especially if we can extend the Scala App class. It
>>>>> might
>>>>>>> actually simplify things since I can then use several case classes to
>>>>> hold
>>>>>>> options (scopt needed one object), which in turn takes out all those
>>>>> ugly
>>>>>>> casts. I’ll take a look next time I’m in there.
>>>>>>> 
>>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>>> in 'spark' module it is overwritten with spark dependency, which also
>>>>> comes
>>>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>>>> 
>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>>>> mahout-spark_2.10 ---
>>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>>>> [INFO] |  |  |  |  +-
>> commons-digester:commons-digester:jar:1.8:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>>>> [INFO] |  |  |  +-
>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>> 
>>>>>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> 
>>>>>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     \-
>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>> [INFO] |  |  \-
>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>> [INFO] |  |  +-
>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>> [INFO] |  |  |  \-
>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  +-
>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>>>> 
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>>>>> [INFO] |  |        \-
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>>>> d
>>>>>>> 
>>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>>>> it
>>>>>>>> there.
>>>>>>>> 
>>>>>>>> At very least, it needs to be synchronized to the one currently used
>>>>> by
>>>>>>>> spark.
>>>>>>>> 
>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>> mahout-hadoop
>>>>>>>> ---
>>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>>>> [INFO] +-
>>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>>> Looks like Guava is in Spark.
>>>>>>>>> 
>>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>>>> this
>>>>>>>>> would not be included since I think it was taken from the mrlegacy
>>>>> jar.
>>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>>>> To: <de...@mahout.apache.org>
>>>>>>>>> Cc:
>>>>>>>>> 
>>>>>>>>>> When you get a chance a PR would be good.
>>>>>>>>> Yes, it would. And not just for that.
>>>>>>>>> 
>>>>>>>>>> As I understand it you are putting some class jars somewhere in
>> the
>>>>>>>>> classpath. Where? How?
>>>>>>>>> /bin/mahout
>>>>>>>>> 
>>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>>>>> 'bin/mahout -spark'.)
>>>>>>>>> 
>>>>>>>>> If i interpret current shell code there correctky, legacy path
>> tries
>>>>> to
>>>>>>>>> use
>>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>>>> motivation
>>>>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>>>>> whole
>>>>>>>>> true intent there.
>>>>>>>>> 
>>>>>>>>> The spark path, which is really a quick hack of the script, tries
>> to
>>>>> get
>>>>>>>>> only selected mahout jars and locally instlalled spark classpath
>>>>> which i
>>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
>> also
>>>>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>>>>> unpackaged
>>>>>>>>> version, and now i think it is a bug it is included  because
>> /libs/*
>>>>> is
>>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>>>>> with a
>>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
>> to
>>>>>>>>> understand mahout build in all cases.
>>>>>>>>> 
>>>>>>>>> I am not even sure if packaged mahout will work with spark,
>> honestly,
>>>>>>>>> because of the /lib. Never tried that, since i mostly use
>> application
>>>>>>>>> embedding techniques.
>>>>>>>>> 
>>>>>>>>> The same solution may apply to adding external dependencies and
>>>>> removing
>>>>>>>>> the assembly in the Spark module. Which would leave only one major
>>>>> build
>>>>>>>>> issue afaik.
>>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>> sufficiently
>>>>>>>>> defined
>>>>>>>>>> what i want to do in order to gauge if we may want to advance it
>>>>> some
>>>>>>>>> time
>>>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>>>>> everything
>>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>> legacy
>>>>> MR
>>>>>>>>> code
>>>>>>>>>> which we of course don't use).
>>>>>>>>>> 
>>>>>>>>>> Cant say i understand the remaining issues you are talking about
>>>>>>> though.
>>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
>> this
>>>>>>>>> doesn't
>>>>>>>>>> do anything about it. Although point is, as it stands, the algebra
>>>>> and
>>>>>>>>>> shell don't have any external dependencies but spark and these 4
>>>>> (5?)
>>>>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>>>>> demonstrated).
>>>>>>>>>> 
>>>>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>>>>> external
>>>>>>>>>> dependencies, but that's a different scenario from those i am
>>>>> talking
>>>>>>>>>> about. But i am relatively happy with having the first two working
>>>>>>>>> nicely
>>>>>>>>>> at this point.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>> pat@occamsmachete.com>
>>>>>>>>> wrote:
>>>>>>>>>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>> would
>>>>> be
>>>>>>>>> nice
>>>>>>>>>>> to see how you’ve structured that in case we can use the same
>>>>> model to
>>>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>>>> 
>>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
>>>>> wrote:
>>>>>>>>>>> Also +1
>>>>>>>>>>> 
>>>>>>>>>>> iPhone'd
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>>>>>> wrote:
>>>>>>>>>>>> +1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>>>> 
>>>>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>>>>> Lyubimov
>>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>>>> (GMT-05:00)
>>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
>> Codebase
>>>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>>>> </div>
>>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>>>>> depends
>>>>>>>>> on
>>>>>>>>>>>> the following classes there:
>>>>>>>>>>>> 
>>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>> VarintWritable,
>>>>>>> and
>>>>>>>>>>> ...
>>>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>>> 
>>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>>>> mahout-hadoop
>>>>>>>>>>>> module (to signify stuff that is directly relevant to
>> serializing
>>>>>>>>> thigns
>>>>>>>>>>> to
>>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>>>>> spark
>>>>>>>>>>> and
>>>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>>> 
>>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>>>>> actually
>>>>>>>>>>> only
>>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
>> of
>>>>>>>>> course)
>>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>> mahout-hadoop
>>>>> and
>>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>>>> 
>>>>>>>>>>>> This of course still doesn't address driver problems that want
>> to
>>>>>>>>> throw
>>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>>>>> least
>>>>>>>>> it
>>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>>>> worker-shipped
>>>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>>> 
>>>>>>>>>>>> How does that sound?
>>>>>>>>> 
>> 
>>

Re: Codebase refactoring proposal

Posted by Andrew Palumbo <ap...@outlook.com>.

On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
>
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts :)
>
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an 
API at the math-scala level for SchemaRDD/Dataframe datastructures correct?

  On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at 
porting seq2sparse  to the DSL for bit now and the stopper at the DSL 
level is that we don't have a distributed data structure for 
strings..Seems like getting a DataFrame implemented as Dmitriy mentioned 
above would take care of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation 
mapping-  it's a part of the seq2sparse code that I've not spent too 
much time in.

I think that this would be very worthy effort as well-  I believe 
seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the 
refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search but
>> should produce much better similarities than Solr’s “moreLikeThis” and does
>> it for all pairs rather than one at a time.
>>
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>>
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>>
>>
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>> me when someone on the Spark list asked about matrix transpose and an MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>>> 3) integration of Streaming. DStreams support most of the RDD
>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>>> Andrew, regardless of the above having TF-IDF would be super
>> helpful—row similarity for content/text would benefit greatly.
>>>    I will put a PR up soon.
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>>
>> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> But first I need to do massive fixes and improvements to the distributed
>>>> optimizer itself. Still waiting on green light for that.
>>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>>>>
>>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>>>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>>>>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>>>>> reduce artifact size by a bunch.
>>>>>
>>>>> Short answer, if it were possible, I'd not bother with Mahout code
>> base at
>>>>> all. The problem is it lacks sufficient flexibility semantics and
>>>>> abstruction. Breeze is indefinitely better in that department but at
>> the
>>>>> time it was sufficiently worse on abstracting interoperability of
>> matrices
>>>>> with different structures. And mllib does not expose breeze.
>>>>>
>>>>> Looking forward toward hardware acellerated bolt-on work I just must
>> say
>>>>> after reading breeze code for some time I still have much clearer plan
>> how
>>>>> such back hybridization and cost calibration might work with current
>> Mahout
>>>>> math abstractions than with breeze. It is also more in line with my
>> current
>>>>> work tasks.
>>>>>
>>>>>> Also backing something like a DRM with DStreams. Periodic model recalc
>>>>> with streams is maybe the first step towards truly streaming algos.
>> Looking
>>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>>>> similarity. Attach Kafka and get evergreen models, if not incrementally
>>>>> updating models.
>>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>>>> bottom line compile-time dependencies are satisfied with no extra
>> stuff
>>>>>> from mr-legacy or its transitives. This is proven by virtue of
>>>>> successful
>>>>>> compilation with no dependency on mr-legacy on the tree.
>>>>>>
>>>>>> Runtime sufficiency for no extra dependency is proven via running
>> shell
>>>>> or
>>>>>> embedded tests (unit tests) which are successful too. This implies
>>>>>> embedding and shell apis.
>>>>>>
>>>>>> Issue with guava is typical one. if it were an issue, i wouldn't be
>> able
>>>>> to
>>>>>> compile and/or run stuff. Now, question is what do we do if drivers
>> want
>>>>>> extra stuff that is not found in Spark.
>>>>>>
>>>>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>>>>> offer anything  here. either shading or lib with opt-in dependency
>> policy
>>>>>> would suffice though, since it doesn't look like we'd have to have
>> tons
>>>>> of
>>>>>> extra for drivers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>> I vaguely remember there being a Guava version problem where the
>>>>> version
>>>>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>>>>> IndexedDataset shouldn’t care about version.
>>>>>>>
>>>>>>> BTW It seems pretty easy to take out the option parser and replace
>> with
>>>>>>> match and tuples especially if we can extend the Scala App class. It
>>>>> might
>>>>>>> actually simplify things since I can then use several case classes to
>>>>> hold
>>>>>>> options (scopt needed one object), which in turn takes out all those
>>>>> ugly
>>>>>>> casts. I’ll take a look next time I’m in there.
>>>>>>>
>>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>>> in 'spark' module it is overwritten with spark dependency, which also
>>>>> comes
>>>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>>>>
>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>>>> mahout-spark_2.10 ---
>>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>>>> [INFO] |  |  |  |  +-
>> commons-digester:commons-digester:jar:1.8:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>>>> [INFO] |  |  |  +-
>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>>
>>>>>>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>
>>>>>>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     \-
>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>> [INFO] |  |  \-
>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>> [INFO] |  |  +-
>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>> [INFO] |  |  |  \-
>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>>
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  +-
>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>>>>
>>>>>>>
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>>>>> [INFO] |  |        \-
>>>>>>>
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>>>> d
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>>>> it
>>>>>>>> there.
>>>>>>>>
>>>>>>>> At very least, it needs to be synchronized to the one currently used
>>>>> by
>>>>>>>> spark.
>>>>>>>>
>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>> mahout-hadoop
>>>>>>>> ---
>>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>>>> [INFO] +-
>>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>>> Looks like Guava is in Spark.
>>>>>>>>>
>>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>>>> this
>>>>>>>>> would not be included since I think it was taken from the mrlegacy
>>>>> jar.
>>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>>>> To: <de...@mahout.apache.org>
>>>>>>>>> Cc:
>>>>>>>>>
>>>>>>>>>> When you get a chance a PR would be good.
>>>>>>>>> Yes, it would. And not just for that.
>>>>>>>>>
>>>>>>>>>> As I understand it you are putting some class jars somewhere in
>> the
>>>>>>>>> classpath. Where? How?
>>>>>>>>> /bin/mahout
>>>>>>>>>
>>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>>>>> 'bin/mahout -spark'.)
>>>>>>>>>
>>>>>>>>> If i interpret current shell code there correctky, legacy path
>> tries
>>>>> to
>>>>>>>>> use
>>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>>>> motivation
>>>>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>>>>> whole
>>>>>>>>> true intent there.
>>>>>>>>>
>>>>>>>>> The spark path, which is really a quick hack of the script, tries
>> to
>>>>> get
>>>>>>>>> only selected mahout jars and locally instlalled spark classpath
>>>>> which i
>>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
>> also
>>>>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>>>>> unpackaged
>>>>>>>>> version, and now i think it is a bug it is included  because
>> /libs/*
>>>>> is
>>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>>>>> with a
>>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
>> to
>>>>>>>>> understand mahout build in all cases.
>>>>>>>>>
>>>>>>>>> I am not even sure if packaged mahout will work with spark,
>> honestly,
>>>>>>>>> because of the /lib. Never tried that, since i mostly use
>> application
>>>>>>>>> embedding techniques.
>>>>>>>>>
>>>>>>>>> The same solution may apply to adding external dependencies and
>>>>> removing
>>>>>>>>> the assembly in the Spark module. Which would leave only one major
>>>>> build
>>>>>>>>> issue afaik.
>>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>> sufficiently
>>>>>>>>> defined
>>>>>>>>>> what i want to do in order to gauge if we may want to advance it
>>>>> some
>>>>>>>>> time
>>>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>>>>> everything
>>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>> legacy
>>>>> MR
>>>>>>>>> code
>>>>>>>>>> which we of course don't use).
>>>>>>>>>>
>>>>>>>>>> Cant say i understand the remaining issues you are talking about
>>>>>>> though.
>>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
>> this
>>>>>>>>> doesn't
>>>>>>>>>> do anything about it. Although point is, as it stands, the algebra
>>>>> and
>>>>>>>>>> shell don't have any external dependencies but spark and these 4
>>>>> (5?)
>>>>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>>>>> demonstrated).
>>>>>>>>>>
>>>>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>>>>> external
>>>>>>>>>> dependencies, but that's a different scenario from those i am
>>>>> talking
>>>>>>>>>> about. But i am relatively happy with having the first two working
>>>>>>>>> nicely
>>>>>>>>>> at this point.
>>>>>>>>>>
>>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>> pat@occamsmachete.com>
>>>>>>>>> wrote:
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>> would
>>>>> be
>>>>>>>>> nice
>>>>>>>>>>> to see how you’ve structured that in case we can use the same
>>>>> model to
>>>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>>>>
>>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
>>>>> wrote:
>>>>>>>>>>> Also +1
>>>>>>>>>>>
>>>>>>>>>>> iPhone'd
>>>>>>>>>>>
>>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>>>>>> wrote:
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>>>>
>>>>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>>>>> Lyubimov
>>>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>>>> (GMT-05:00)
>>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
>> Codebase
>>>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>>>> </div>
>>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>>>>> depends
>>>>>>>>> on
>>>>>>>>>>>> the following classes there:
>>>>>>>>>>>>
>>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>> VarintWritable,
>>>>>>> and
>>>>>>>>>>> ...
>>>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>>>
>>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>>>> mahout-hadoop
>>>>>>>>>>>> module (to signify stuff that is directly relevant to
>> serializing
>>>>>>>>> thigns
>>>>>>>>>>> to
>>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>>>>> spark
>>>>>>>>>>> and
>>>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>>>
>>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>>>>> actually
>>>>>>>>>>> only
>>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
>> of
>>>>>>>>> course)
>>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>> mahout-hadoop
>>>>> and
>>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>>>>
>>>>>>>>>>>> This of course still doesn't address driver problems that want
>> to
>>>>>>>>> throw
>>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>>>>> least
>>>>>>>>> it
>>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>>>> worker-shipped
>>>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>>>
>>>>>>>>>>>> How does that sound?
>>>>>>>>>
>>
>>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I'd suggest to consider this: remember all this talk about
language-integrated spark ql being basically dataframe manipulation DSL?

so now Spark devs are noticing this generality as well and are actually
proposing to rename SchemaRDD into DataFrame and make it mainstream data
structure. (my "told you so" moment of sorts :)

What i am getting at, i'd suggest to make DRM and Spark's newly renamed
DataFrame our two major structures. In particular, standardize on using
DataFrame for things that may include non-numerical data and require more
grace about column naming and manipulation. Maybe relevant to TF-IDF work
when it deals with non-matrix content.

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Seems like seq2sparse would be really easy to replace since it takes text
> files to start with, then the whole pipeline could be kept in rdds. The
> dictionaries and counts could be either in-memory maps or rdds for use with
> joins? This would get rid of sequence files completely from the pipeline.
> Item similarity uses in-memory maps but the plan is to make it more
> scalable using joins as an alternative with the same API allowing the user
> to trade-off footprint for speed.
>
> My use for TF-IDF is for row similarity and would take a DRM (actually
> IndexedDataset) and calculate row/doc similarities. It works now but only
> using LLR. This is OK when thinking of the items as tags or metadata but
> for text tokens something like cosine may be better.
>
> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
> like how CF preferences are downsampled. This would produce an sparsified
> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> terms before row similarity uses cosine. This is not so good for search but
> should produce much better similarities than Solr’s “moreLikeThis” and does
> it for all pairs rather than one at a time.
>
> In any case it can be used to do a create a personalized content-based
> recommender or augment a CF recommender with one more indicator type.
>
> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:
>
>
> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >
> > On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >> Some issues WRT lower level Spark integration:
> >> 1) interoperability with Spark data. TF-IDF is one example I actually
> looked at. There may be other things we can pick up from their committers
> since they have an abundance.
> >> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> me when someone on the Spark list asked about matrix transpose and an MLlib
> committer’s answer was something like “why would you want to do that?”.
> Usually you don’t actually execute the transpose but they don’t even
> support A’A, AA’, or A’B, which are core to what I work on. At present you
> pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
> Maybe a half-way measure is some implicit conversions (ugh, I know). If the
> DSL could interchange datasets with MLlib, people would be pointed to the
> DSL for all of a bunch of “why would you want to do that?” features. MLlib
> seems to be algorithms, not math.
> >> 3) integration of Streaming. DStreams support most of the RDD
> interface. Doing a batch recalc on a moving time window would nearly fall
> out of DStream backed DRMs. This isn’t the same as incremental updates on
> streaming but it’s a start.
> >>
> >> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> faster compute engines. So we jumped. Now the need is for streaming and
> especially incrementally updated streaming. Seems like we need to address
> this.
> >>
> >> Andrew, regardless of the above having TF-IDF would be super
> helpful—row similarity for content/text would benefit greatly.
> >
> >   I will put a PR up soon.
>
> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
> and Weight interface over from mr-legacy to math-scala. They're available
> now in spark-shell but won't be after this refactoring.  These still
> require dictionary and a frequency count maps to vectorize incoming text-
> so they're more for use with the old MR seq2sparse and I don't think they
> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> Hopefully they'll be of some use.
>
> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>
> >> But first I need to do massive fixes and improvements to the distributed
> >> optimizer itself. Still waiting on green light for that.
> >> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
> >>
> >>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
> >>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> >>> impedance mismatch between DRM and MLlib RowMatrix. This would further
> >>> reduce artifact size by a bunch.
> >>>
> >>> Short answer, if it were possible, I'd not bother with Mahout code
> base at
> >>> all. The problem is it lacks sufficient flexibility semantics and
> >>> abstruction. Breeze is indefinitely better in that department but at
> the
> >>> time it was sufficiently worse on abstracting interoperability of
> matrices
> >>> with different structures. And mllib does not expose breeze.
> >>>
> >>> Looking forward toward hardware acellerated bolt-on work I just must
> say
> >>> after reading breeze code for some time I still have much clearer plan
> how
> >>> such back hybridization and cost calibration might work with current
> Mahout
> >>> math abstractions than with breeze. It is also more in line with my
> current
> >>> work tasks.
> >>>
> >>>> Also backing something like a DRM with DStreams. Periodic model recalc
> >>> with streams is maybe the first step towards truly streaming algos.
> Looking
> >>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> >>> similarity. Attach Kafka and get evergreen models, if not incrementally
> >>> updating models.
> >>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>>
> >>>> bottom line compile-time dependencies are satisfied with no extra
> stuff
> >>>> from mr-legacy or its transitives. This is proven by virtue of
> >>> successful
> >>>> compilation with no dependency on mr-legacy on the tree.
> >>>>
> >>>> Runtime sufficiency for no extra dependency is proven via running
> shell
> >>> or
> >>>> embedded tests (unit tests) which are successful too. This implies
> >>>> embedding and shell apis.
> >>>>
> >>>> Issue with guava is typical one. if it were an issue, i wouldn't be
> able
> >>> to
> >>>> compile and/or run stuff. Now, question is what do we do if drivers
> want
> >>>> extra stuff that is not found in Spark.
> >>>>
> >>>> Now, It is so nice not to depend on anything extra so i am hesitant to
> >>>> offer anything  here. either shading or lib with opt-in dependency
> policy
> >>>> would suffice though, since it doesn't look like we'd have to have
> tons
> >>> of
> >>>> extra for drivers.
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>> I vaguely remember there being a Guava version problem where the
> >>> version
> >>>>> had to be rolled back in one of the hadoop modules. The math-scala
> >>>>> IndexedDataset shouldn’t care about version.
> >>>>>
> >>>>> BTW It seems pretty easy to take out the option parser and replace
> with
> >>>>> match and tuples especially if we can extend the Scala App class. It
> >>> might
> >>>>> actually simplify things since I can then use several case classes to
> >>> hold
> >>>>> options (scopt needed one object), which in turn takes out all those
> >>> ugly
> >>>>> casts. I’ll take a look next time I’m in there.
> >>>>>
> >>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>>>> in 'spark' module it is overwritten with spark dependency, which also
> >>> comes
> >>>>> at the same version so happens. so should be fine with 1.1.x
> >>>>>
> >>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>>>> mahout-spark_2.10 ---
> >>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> >>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> >>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> >>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> >>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> >>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> >>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> >>>>> [INFO] |  |  |  +-
> >>>>> commons-configuration:commons-configuration:jar:1.6:compile
> >>>>> [INFO] |  |  |  |  +-
> >>>>> commons-collections:commons-collections:jar:3.2.1:compile
> >>>>> [INFO] |  |  |  |  +-
> commons-digester:commons-digester:jar:1.8:compile
> >>>>> [INFO] |  |  |  |  |  \-
> >>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> >>>>> [INFO] |  |  |  |  \-
> >>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> >>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> >>>>> [INFO] |  |  |  +-
> com.google.protobuf:protobuf-java:jar:2.5.0:compile
> >>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  \-
> >>> org.apache.commons:commons-compress:jar:1.4.1:compile
> >>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> >>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  |  +-
> >>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> >>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> >>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> >>>>> [INFO] |  |  |  |  |  +-
> >>>>>
> >>>>>
> >>>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>
> >>>>>
> >>>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |  |  +-
> >>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> >>>>> [INFO] |  |  |  |  |  |  |  \-
> >>> com.sun.jersey:jersey-client:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |  \-
> >>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |     +-
> >>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> >>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> >>>>> [INFO] |  |  |  |  |  |     |     \-
> >>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> >>>>> [INFO] |  |  |  |  |  |     |        \-
> >>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> >>>>> [INFO] |  |  |  |  |  |     +-
> >>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> >>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> >>>>> [INFO] |  |  |  |  |  |     +-
> >>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> >>>>> [INFO] |  |  |  |  |  |     \-
> >>> org.glassfish:javax.servlet:jar:3.1:compile
> >>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> >>>>> [INFO] |  |  |  |  |  |  \-
> com.sun.jersey:jersey-core:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  |  |  +-
> >>> org.codehaus.jettison:jettison:jar:1.1:compile
> >>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> >>>>> [INFO] |  |  |  |  |  |  +-
> >>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> >>>>> [INFO] |  |  |  |  |  |  |  \-
> >>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> >>>>> [INFO] |  |  |  |  |  |  |     \-
> >>>>> javax.activation:activation:jar:1.1:compile
> >>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> >>>>> [INFO] |  |  |  |  |  |  \-
> >>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> >>>>> [INFO] |  |  |  |  |  \-
> >>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> >>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  \-
> >>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> >>>>> [INFO] |  |  \-
> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> >>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> >>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> >>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> >>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> >>>>> [INFO] |  |  +-
> org.apache.curator:curator-framework:jar:2.4.0:compile
> >>>>> [INFO] |  |  |  \-
> org.apache.curator:curator-client:jar:2.4.0:compile
> >>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> >>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> >>>>> [INFO] |  +-
> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  +-
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> >>>>> [INFO] |  |  +-
> >>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  |  +-
> >>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  |  \-
> >>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  \-
> >>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |     \-
> >>>>>
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> >>>>> [INFO] |  |        \-
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> >>>>> [INFO] |  +-
> >>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +-
> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +-
> >>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  \-
> >>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |     \-
> >>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> >>>>> d
> >>>>>
> >>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>> looks like it is also requested by mahout-math, wonder what is using
> >>> it
> >>>>>> there.
> >>>>>>
> >>>>>> At very least, it needs to be synchronized to the one currently used
> >>> by
> >>>>>> spark.
> >>>>>>
> >>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>> mahout-hadoop
> >>>>>> ---
> >>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> >>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> >>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> >>>>>> [INFO] +-
> >>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> >>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>>>> wrote:
> >>>>>>> Looks like Guava is in Spark.
> >>>>>>>
> >>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> >>> this
> >>>>>>> would not be included since I think it was taken from the mrlegacy
> >>> jar.
> >>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>>> ---------- Forwarded message ----------
> >>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
> >>>>>>> Date: Jan 25, 2015 9:39 AM
> >>>>>>> Subject: Re: Codebase refactoring proposal
> >>>>>>> To: <de...@mahout.apache.org>
> >>>>>>> Cc:
> >>>>>>>
> >>>>>>>> When you get a chance a PR would be good.
> >>>>>>> Yes, it would. And not just for that.
> >>>>>>>
> >>>>>>>> As I understand it you are putting some class jars somewhere in
> the
> >>>>>>> classpath. Where? How?
> >>>>>>> /bin/mahout
> >>>>>>>
> >>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >>>>>>> 'bin/mahout -spark'.)
> >>>>>>>
> >>>>>>> If i interpret current shell code there correctky, legacy path
> tries
> >>> to
> >>>>>>> use
> >>>>>>> examples assemblies if not packaged, or /lib if packaged. True
> >>>>> motivation
> >>>>>>> of that significantly predates 2010 and i suspect only Benson knows
> >>>>> whole
> >>>>>>> true intent there.
> >>>>>>>
> >>>>>>> The spark path, which is really a quick hack of the script, tries
> to
> >>> get
> >>>>>>> only selected mahout jars and locally instlalled spark classpath
> >>> which i
> >>>>>>> guess is just the shaded spark jar in recent spark releases. It
> also
> >>>>>>> apparently tries to include /libs/*, which is never compiled in
> >>>>> unpackaged
> >>>>>>> version, and now i think it is a bug it is included  because
> /libs/*
> >>> is
> >>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
> >>> with a
> >>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
> to
> >>>>>>> understand mahout build in all cases.
> >>>>>>>
> >>>>>>> I am not even sure if packaged mahout will work with spark,
> honestly,
> >>>>>>> because of the /lib. Never tried that, since i mostly use
> application
> >>>>>>> embedding techniques.
> >>>>>>>
> >>>>>>> The same solution may apply to adding external dependencies and
> >>> removing
> >>>>>>> the assembly in the Spark module. Which would leave only one major
> >>> build
> >>>>>>> issue afaik.
> >>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>> No, no PR. Only experiment on private. But i believe i
> sufficiently
> >>>>>>> defined
> >>>>>>>> what i want to do in order to gauge if we may want to advance it
> >>> some
> >>>>>>> time
> >>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
> >>>>>>> everything
> >>>>>>>> that is not compile-time dependent. (and a lot of it is thru
> legacy
> >>> MR
> >>>>>>> code
> >>>>>>>> which we of course don't use).
> >>>>>>>>
> >>>>>>>> Cant say i understand the remaining issues you are talking about
> >>>>> though.
> >>>>>>>> If you are talking about compiling lib or shaded assembly, no,
> this
> >>>>>>> doesn't
> >>>>>>>> do anything about it. Although point is, as it stands, the algebra
> >>> and
> >>>>>>>> shell don't have any external dependencies but spark and these 4
> >>> (5?)
> >>>>>>>> mahout jars so they technically don't even need an assembly (as
> >>>>>>>> demonstrated).
> >>>>>>>>
> >>>>>>>> As i said, it seems driver code is the only one that may need some
> >>>>>>> external
> >>>>>>>> dependencies, but that's a different scenario from those i am
> >>> talking
> >>>>>>>> about. But i am relatively happy with having the first two working
> >>>>>>> nicely
> >>>>>>>> at this point.
> >>>>>>>>
> >>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
> pat@occamsmachete.com>
> >>>>>>> wrote:
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
> would
> >>> be
> >>>>>>> nice
> >>>>>>>>> to see how you’ve structured that in case we can use the same
> >>> model to
> >>>>>>>>> solve the two remaining refactoring issues.
> >>>>>>>>> 1) external dependencies in the spark module
> >>>>>>>>> 2) no spark or h2o in the release artifacts.
> >>>>>>>>>
> >>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
> >>> wrote:
> >>>>>>>>> Also +1
> >>>>>>>>>
> >>>>>>>>> iPhone'd
> >>>>>>>>>
> >>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
> >>>>> wrote:
> >>>>>>>>>> +1
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> >>>>>>>>>>
> >>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
> >>>>>>> Lyubimov
> >>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> >>> (GMT-05:00)
> >>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
> Codebase
> >>>>>>>>> refactoring proposal </div><div>
> >>>>>>>>>> </div>
> >>>>>>>>>> So right now mahout-spark depends on mr-legacy.
> >>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> >>>>> depends
> >>>>>>> on
> >>>>>>>>>> the following classes there:
> >>>>>>>>>>
> >>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
> VarintWritable,
> >>>>> and
> >>>>>>>>> ...
> >>>>>>>>>> *sigh* o.a.m.common.Pair
> >>>>>>>>>>
> >>>>>>>>>> So  I just dropped those five classes into new a new tiny
> >>>>>>> mahout-hadoop
> >>>>>>>>>> module (to signify stuff that is directly relevant to
> serializing
> >>>>>>> thigns
> >>>>>>>>> to
> >>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
> >>>>> spark
> >>>>>>>>> and
> >>>>>>>>>> spark-shell dependencies.
> >>>>>>>>>>
> >>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
> >>> actually
> >>>>>>>>> only
> >>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
> of
> >>>>>>> course)
> >>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
> mahout-hadoop
> >>> and
> >>>>>>>>>> optionally mahout-spark-shell (for running shell)).
> >>>>>>>>>>
> >>>>>>>>>> This of course still doesn't address driver problems that want
> to
> >>>>>>> throw
> >>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
> >>> least
> >>>>>>> it
> >>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
> >>>>>>> worker-shipped
> >>>>>>>>>> jars) much more tolerable.
> >>>>>>>>>>
> >>>>>>>>>> How does that sound?
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >
>
>
>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed.

My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so good for search but should produce much better similarities than Solr’s “moreLikeThis” and does it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap...@outlook.com> wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> 
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math.
>> 3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start.
>> 
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for streaming and especially incrementally updated streaming. Seems like we need to address this.
>> 
>> Andrew, regardless of the above having TF-IDF would be super helpful—row similarity for content/text would benefit greatly.
> 
>   I will put a PR up soon.

Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes and Weight interface over from mr-legacy to math-scala. They're available now in spark-shell but won't be after this refactoring.  These still require dictionary and a frequency count maps to vectorize incoming text- so they're more for use with the old MR seq2sparse and I don't think they can be used with Spark's HashingTF and IDF.  I'll put them up soon.  Hopefully they'll be of some use.

On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> But first I need to do massive fixes and improvements to the distributed
>> optimizer itself. Still waiting on green light for that.
>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>> 
>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>>> reduce artifact size by a bunch.
>>> 
>>> Short answer, if it were possible, I'd not bother with Mahout code base at
>>> all. The problem is it lacks sufficient flexibility semantics and
>>> abstruction. Breeze is indefinitely better in that department but at the
>>> time it was sufficiently worse on abstracting interoperability of matrices
>>> with different structures. And mllib does not expose breeze.
>>> 
>>> Looking forward toward hardware acellerated bolt-on work I just must say
>>> after reading breeze code for some time I still have much clearer plan how
>>> such back hybridization and cost calibration might work with current Mahout
>>> math abstractions than with breeze. It is also more in line with my current
>>> work tasks.
>>> 
>>>> Also backing something like a DRM with DStreams. Periodic model recalc
>>> with streams is maybe the first step towards truly streaming algos. Looking
>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>> similarity. Attach Kafka and get evergreen models, if not incrementally
>>> updating models.
>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> 
>>>> bottom line compile-time dependencies are satisfied with no extra stuff
>>>> from mr-legacy or its transitives. This is proven by virtue of
>>> successful
>>>> compilation with no dependency on mr-legacy on the tree.
>>>> 
>>>> Runtime sufficiency for no extra dependency is proven via running shell
>>> or
>>>> embedded tests (unit tests) which are successful too. This implies
>>>> embedding and shell apis.
>>>> 
>>>> Issue with guava is typical one. if it were an issue, i wouldn't be able
>>> to
>>>> compile and/or run stuff. Now, question is what do we do if drivers want
>>>> extra stuff that is not found in Spark.
>>>> 
>>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>>> offer anything  here. either shading or lib with opt-in dependency policy
>>>> would suffice though, since it doesn't look like we'd have to have tons
>>> of
>>>> extra for drivers.
>>>> 
>>>> 
>>>> 
>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>> I vaguely remember there being a Guava version problem where the
>>> version
>>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>>> IndexedDataset shouldn’t care about version.
>>>>> 
>>>>> BTW It seems pretty easy to take out the option parser and replace with
>>>>> match and tuples especially if we can extend the Scala App class. It
>>> might
>>>>> actually simplify things since I can then use several case classes to
>>> hold
>>>>> options (scopt needed one object), which in turn takes out all those
>>> ugly
>>>>> casts. I’ll take a look next time I’m in there.
>>>>> 
>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> in 'spark' module it is overwritten with spark dependency, which also
>>> comes
>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>> 
>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>> mahout-spark_2.10 ---
>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>> [INFO] |  |  |  +-
>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>> [INFO] |  |  |  |  +-
>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
>>>>> [INFO] |  |  |  |  |  \-
>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>> [INFO] |  |  |  |  \-
>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>> [INFO] |  |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  |  +-
>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>>> [INFO] |  |  |  |  |  +-
>>>>> 
>>>>> 
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile 
>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> 
>>>>> 
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile 
>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     \-
>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \-
>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>> javax.activation:activation:jar:1.1:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>> [INFO] |  |  |  |  |  \-
>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>> [INFO] |  |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>> 
>>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile 
>>>>> [INFO] |  |  +-
>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  +-
>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  \-
>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |     \-
>>>>> 
>>>>> 
>>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile 
>>>>> [INFO] |  |        \-
>>>>> 
>>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile 
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |     \-
>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>> d
>>>>> 
>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>> it
>>>>>> there.
>>>>>> 
>>>>>> At very least, it needs to be synchronized to the one currently used
>>> by
>>>>>> spark.
>>>>>> 
>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>> mahout-hadoop
>>>>>> ---
>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>> [INFO] +-
>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>> Looks like Guava is in Spark.
>>>>>>> 
>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>> this
>>>>>>> would not be included since I think it was taken from the mrlegacy
>>> jar.
>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>> To: <de...@mahout.apache.org>
>>>>>>> Cc:
>>>>>>> 
>>>>>>>> When you get a chance a PR would be good.
>>>>>>> Yes, it would. And not just for that.
>>>>>>> 
>>>>>>>> As I understand it you are putting some class jars somewhere in the
>>>>>>> classpath. Where? How?
>>>>>>> /bin/mahout
>>>>>>> 
>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>>> 'bin/mahout -spark'.)
>>>>>>> 
>>>>>>> If i interpret current shell code there correctky, legacy path tries
>>> to
>>>>>>> use
>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>> motivation
>>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>>> whole
>>>>>>> true intent there.
>>>>>>> 
>>>>>>> The spark path, which is really a quick hack of the script, tries to
>>> get
>>>>>>> only selected mahout jars and locally instlalled spark classpath
>>> which i
>>>>>>> guess is just the shaded spark jar in recent spark releases. It also
>>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>>> unpackaged
>>>>>>> version, and now i think it is a bug it is included  because /libs/*
>>> is
>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>>> with a
>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>>>>>> understand mahout build in all cases.
>>>>>>> 
>>>>>>> I am not even sure if packaged mahout will work with spark, honestly,
>>>>>>> because of the /lib. Never tried that, since i mostly use application
>>>>>>> embedding techniques.
>>>>>>> 
>>>>>>> The same solution may apply to adding external dependencies and
>>> removing
>>>>>>> the assembly in the Spark module. Which would leave only one major
>>> build
>>>>>>> issue afaik.
>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>>> wrote:
>>>>>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>>>>>> defined
>>>>>>>> what i want to do in order to gauge if we may want to advance it
>>> some
>>>>>>> time
>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>>> everything
>>>>>>>> that is not compile-time dependent. (and a lot of it is thru legacy
>>> MR
>>>>>>> code
>>>>>>>> which we of course don't use).
>>>>>>>> 
>>>>>>>> Cant say i understand the remaining issues you are talking about
>>>>> though.
>>>>>>>> If you are talking about compiling lib or shaded assembly, no, this
>>>>>>> doesn't
>>>>>>>> do anything about it. Although point is, as it stands, the algebra
>>> and
>>>>>>>> shell don't have any external dependencies but spark and these 4
>>> (5?)
>>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>>> demonstrated).
>>>>>>>> 
>>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>>> external
>>>>>>>> dependencies, but that's a different scenario from those i am
>>> talking
>>>>>>>> about. But i am relatively happy with having the first two working
>>>>>>> nicely
>>>>>>>> at this point.
>>>>>>>> 
>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>>> +1
>>>>>>>>> 
>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
>>> be
>>>>>>> nice
>>>>>>>>> to see how you’ve structured that in case we can use the same
>>> model to
>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>> 
>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
>>> wrote:
>>>>>>>>> Also +1
>>>>>>>>> 
>>>>>>>>> iPhone'd
>>>>>>>>> 
>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>>>> wrote:
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>> 
>>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>>> Lyubimov
>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>> (GMT-05:00)
>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>> </div>
>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>>> depends
>>>>>>> on
>>>>>>>>>> the following classes there:
>>>>>>>>>> 
>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
>>>>> and
>>>>>>>>> ...
>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>> 
>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>> mahout-hadoop
>>>>>>>>>> module (to signify stuff that is directly relevant to serializing
>>>>>>> thigns
>>>>>>>>> to
>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>>> spark
>>>>>>>>> and
>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>> 
>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>>> actually
>>>>>>>>> only
>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>>>>>> course)
>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
>>> and
>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>> 
>>>>>>>>>> This of course still doesn't address driver problems that want to
>>>>>>> throw
>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>>> least
>>>>>>> it
>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>> worker-shipped
>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>> 
>>>>>>>>>> How does that sound?
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>

Re: Codebase refactoring proposal

Posted by Andrew Palumbo <ap...@outlook.com>.

On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually 
>> looked at. There may be other things we can pick up from their 
>> committers since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to 
>> me when someone on the Spark list asked about matrix transpose and an 
>> MLlib committer’s answer was something like “why would you want to do 
>> that?”. Usually you don’t actually execute the transpose but they 
>> don’t even support A’A, AA’, or A’B, which are core to what I work 
>> on. At present you pretty much have to choose between MLlib or Mahout 
>> for sparse matrix stuff. Maybe a half-way measure is some implicit 
>> conversions (ugh, I know). If the DSL could interchange datasets with 
>> MLlib, people would be pointed to the DSL for all of a bunch of “why 
>> would you want to do that?” features. MLlib seems to be algorithms, 
>> not math.
>> 3) integration of Streaming. DStreams support most of the RDD 
>> interface. Doing a batch recalc on a moving time window would nearly 
>> fall out of DStream backed DRMs. This isn’t the same as incremental 
>> updates on streaming but it’s a start.
>>
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink 
>> faster compute engines. So we jumped. Now the need is for streaming 
>> and especially incrementally updated streaming. Seems like we need to 
>> address this.
>>
>> Andrew, regardless of the above having TF-IDF would be super 
>> helpful—row similarity for content/text would benefit greatly.
>
>    I will put a PR up soon.

Just to clarify, I'll be porting over the (very simple) TF, TFIDF 
classes and Weight interface over from mr-legacy to math-scala. They're 
available now in spark-shell but won't be after this refactoring.  These 
still require dictionary and a frequency count maps to vectorize 
incoming text- so they're more for use with the old MR seq2sparse and I 
don't think they can be used with Spark's HashingTF and IDF.  I'll put 
them up soon.  Hopefully they'll be of some use.

On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> But first I need to do massive fixes and improvements to the distributed
>> optimizer itself. Still waiting on green light for that.
>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>>
>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>>> reduce artifact size by a bunch.
>>>
>>> Short answer, if it were possible, I'd not bother with Mahout code 
>>> base at
>>> all. The problem is it lacks sufficient flexibility semantics and
>>> abstruction. Breeze is indefinitely better in that department but at 
>>> the
>>> time it was sufficiently worse on abstracting interoperability of 
>>> matrices
>>> with different structures. And mllib does not expose breeze.
>>>
>>> Looking forward toward hardware acellerated bolt-on work I just must 
>>> say
>>> after reading breeze code for some time I still have much clearer 
>>> plan how
>>> such back hybridization and cost calibration might work with current 
>>> Mahout
>>> math abstractions than with breeze. It is also more in line with my 
>>> current
>>> work tasks.
>>>
>>>> Also backing something like a DRM with DStreams. Periodic model recalc
>>> with streams is maybe the first step towards truly streaming algos. 
>>> Looking
>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>> similarity. Attach Kafka and get evergreen models, if not incrementally
>>> updating models.
>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> 
>>>> wrote:
>>>>
>>>> bottom line compile-time dependencies are satisfied with no extra 
>>>> stuff
>>>> from mr-legacy or its transitives. This is proven by virtue of
>>> successful
>>>> compilation with no dependency on mr-legacy on the tree.
>>>>
>>>> Runtime sufficiency for no extra dependency is proven via running 
>>>> shell
>>> or
>>>> embedded tests (unit tests) which are successful too. This implies
>>>> embedding and shell apis.
>>>>
>>>> Issue with guava is typical one. if it were an issue, i wouldn't be 
>>>> able
>>> to
>>>> compile and/or run stuff. Now, question is what do we do if drivers 
>>>> want
>>>> extra stuff that is not found in Spark.
>>>>
>>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>>> offer anything  here. either shading or lib with opt-in dependency 
>>>> policy
>>>> would suffice though, since it doesn't look like we'd have to have 
>>>> tons
>>> of
>>>> extra for drivers.
>>>>
>>>>
>>>>
>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>> I vaguely remember there being a Guava version problem where the
>>> version
>>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>>> IndexedDataset shouldn’t care about version.
>>>>>
>>>>> BTW It seems pretty easy to take out the option parser and replace 
>>>>> with
>>>>> match and tuples especially if we can extend the Scala App class. It
>>> might
>>>>> actually simplify things since I can then use several case classes to
>>> hold
>>>>> options (scopt needed one object), which in turn takes out all those
>>> ugly
>>>>> casts. I’ll take a look next time I’m in there.
>>>>>
>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> in 'spark' module it is overwritten with spark dependency, which also
>>> comes
>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>>
>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>> mahout-spark_2.10 ---
>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>> [INFO] |  |  |  +-
>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>> [INFO] |  |  |  |  +-
>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>> [INFO] |  |  |  |  +- 
>>>>> commons-digester:commons-digester:jar:1.8:compile
>>>>> [INFO] |  |  |  |  |  \-
>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>> [INFO] |  |  |  |  \-
>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>> [INFO] |  |  |  +- 
>>>>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>> [INFO] |  |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  |  +-
>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>>> [INFO] |  |  |  |  |  +-
>>>>>
>>>>>
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile 
>>>
>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>
>>>>>
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile 
>>>
>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     +-
>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>> [INFO] |  |  |  |  |  |     \-
>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>> [INFO] |  |  |  |  |  |  \- 
>>>>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>> [INFO] |  |  |  |  |  |  |  \-
>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>> javax.activation:activation:jar:1.1:compile
>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>> [INFO] |  |  |  |  |  \-
>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>> [INFO] |  |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>> [INFO] |  |  |  \-
>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>> [INFO] |  |  +-
>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>> [INFO] |  |  \- 
>>>>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>> [INFO] |  |  +- 
>>>>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>> [INFO] |  |  |  \- 
>>>>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>> [INFO] |  +- 
>>>>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>>
>>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile 
>>>
>>>>> [INFO] |  |  +-
>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  +-
>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  \-
>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |     \-
>>>>>
>>>>>
>>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile 
>>>
>>>>> [INFO] |  |        \-
>>>>>
>>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile 
>>>
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- 
>>>>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |     \-
>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>> d
>>>>>
>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>> it
>>>>>> there.
>>>>>>
>>>>>> At very least, it needs to be synchronized to the one currently used
>>> by
>>>>>> spark.
>>>>>>
>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>> mahout-hadoop
>>>>>> ---
>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>> [INFO] +-
>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>>> Looks like Guava is in Spark.
>>>>>>>
>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>> this
>>>>>>> would not be included since I think it was taken from the mrlegacy
>>> jar.
>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>> To: <de...@mahout.apache.org>
>>>>>>> Cc:
>>>>>>>
>>>>>>>> When you get a chance a PR would be good.
>>>>>>> Yes, it would. And not just for that.
>>>>>>>
>>>>>>>> As I understand it you are putting some class jars somewhere in 
>>>>>>>> the
>>>>>>> classpath. Where? How?
>>>>>>> /bin/mahout
>>>>>>>
>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>>> 'bin/mahout -spark'.)
>>>>>>>
>>>>>>> If i interpret current shell code there correctky, legacy path 
>>>>>>> tries
>>> to
>>>>>>> use
>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>> motivation
>>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>>> whole
>>>>>>> true intent there.
>>>>>>>
>>>>>>> The spark path, which is really a quick hack of the script, 
>>>>>>> tries to
>>> get
>>>>>>> only selected mahout jars and locally instlalled spark classpath
>>> which i
>>>>>>> guess is just the shaded spark jar in recent spark releases. It 
>>>>>>> also
>>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>>> unpackaged
>>>>>>> version, and now i think it is a bug it is included  because 
>>>>>>> /libs/*
>>> is
>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>>> with a
>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find 
>>>>>>> time to
>>>>>>> understand mahout build in all cases.
>>>>>>>
>>>>>>> I am not even sure if packaged mahout will work with spark, 
>>>>>>> honestly,
>>>>>>> because of the /lib. Never tried that, since i mostly use 
>>>>>>> application
>>>>>>> embedding techniques.
>>>>>>>
>>>>>>> The same solution may apply to adding external dependencies and
>>> removing
>>>>>>> the assembly in the Spark module. Which would leave only one major
>>> build
>>>>>>> issue afaik.
>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>>> wrote:
>>>>>>>> No, no PR. Only experiment on private. But i believe i 
>>>>>>>> sufficiently
>>>>>>> defined
>>>>>>>> what i want to do in order to gauge if we may want to advance it
>>> some
>>>>>>> time
>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>>> everything
>>>>>>>> that is not compile-time dependent. (and a lot of it is thru 
>>>>>>>> legacy
>>> MR
>>>>>>> code
>>>>>>>> which we of course don't use).
>>>>>>>>
>>>>>>>> Cant say i understand the remaining issues you are talking about
>>>>> though.
>>>>>>>> If you are talking about compiling lib or shaded assembly, no, 
>>>>>>>> this
>>>>>>> doesn't
>>>>>>>> do anything about it. Although point is, as it stands, the algebra
>>> and
>>>>>>>> shell don't have any external dependencies but spark and these 4
>>> (5?)
>>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>>> demonstrated).
>>>>>>>>
>>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>>> external
>>>>>>>> dependencies, but that's a different scenario from those i am
>>> talking
>>>>>>>> about. But i am relatively happy with having the first two working
>>>>>>> nicely
>>>>>>>> at this point.
>>>>>>>>
>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel 
>>>>>>>> <pa...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It 
>>>>>>>>> would
>>> be
>>>>>>> nice
>>>>>>>>> to see how you’ve structured that in case we can use the same
>>> model to
>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>>
>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
>>> wrote:
>>>>>>>>> Also +1
>>>>>>>>>
>>>>>>>>> iPhone'd
>>>>>>>>>
>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>>>> wrote:
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>>
>>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>>> Lyubimov
>>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>> (GMT-05:00)
>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>> </div>
>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>>> depends
>>>>>>> on
>>>>>>>>>> the following classes there:
>>>>>>>>>>
>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and 
>>>>>>>>>> VarintWritable,
>>>>> and
>>>>>>>>> ...
>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>
>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>> mahout-hadoop
>>>>>>>>>> module (to signify stuff that is directly relevant to 
>>>>>>>>>> serializing
>>>>>>> thigns
>>>>>>>>> to
>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>>> spark
>>>>>>>>> and
>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>
>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>>> actually
>>>>>>>>> only
>>>>>>>>>> need spark dependencies (which come from SPARK_HOME 
>>>>>>>>>> classpath, of
>>>>>>> course)
>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), 
>>>>>>>>>> mahout-hadoop
>>> and
>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>>
>>>>>>>>>> This of course still doesn't address driver problems that 
>>>>>>>>>> want to
>>>>>>> throw
>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>>> least
>>>>>>> it
>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>> worker-shipped
>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>
>>>>>>>>>> How does that sound?
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>

Re: Codebase refactoring proposal

Posted by Andrew Palumbo <ap...@outlook.com>.

On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> Some issues WRT lower level Spark integration:
> 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance.
> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math.
> 3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start.
>
> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for streaming and especially incrementally updated streaming. Seems like we need to address this.
>
> Andrew, regardless of the above having TF-IDF would be super helpful—row similarity for content/text would benefit greatly.

    I will put a PR up soon.

> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> But first I need to do massive fixes and improvements to the distributed
> optimizer itself. Still waiting on green light for that.
> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>
>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>> reduce artifact size by a bunch.
>>
>> Short answer, if it were possible, I'd not bother with Mahout code base at
>> all. The problem is it lacks sufficient flexibility semantics and
>> abstruction. Breeze is indefinitely better in that department but at the
>> time it was sufficiently worse on abstracting interoperability of matrices
>> with different structures. And mllib does not expose breeze.
>>
>> Looking forward toward hardware acellerated bolt-on work I just must say
>> after reading breeze code for some time I still have much clearer plan how
>> such back hybridization and cost calibration might work with current Mahout
>> math abstractions than with breeze. It is also more in line with my current
>> work tasks.
>>
>>> Also backing something like a DRM with DStreams. Periodic model recalc
>> with streams is maybe the first step towards truly streaming algos. Looking
>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>> similarity. Attach Kafka and get evergreen models, if not incrementally
>> updating models.
>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> bottom line compile-time dependencies are satisfied with no extra stuff
>>> from mr-legacy or its transitives. This is proven by virtue of
>> successful
>>> compilation with no dependency on mr-legacy on the tree.
>>>
>>> Runtime sufficiency for no extra dependency is proven via running shell
>> or
>>> embedded tests (unit tests) which are successful too. This implies
>>> embedding and shell apis.
>>>
>>> Issue with guava is typical one. if it were an issue, i wouldn't be able
>> to
>>> compile and/or run stuff. Now, question is what do we do if drivers want
>>> extra stuff that is not found in Spark.
>>>
>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>> offer anything  here. either shading or lib with opt-in dependency policy
>>> would suffice though, since it doesn't look like we'd have to have tons
>> of
>>> extra for drivers.
>>>
>>>
>>>
>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>> I vaguely remember there being a Guava version problem where the
>> version
>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>> IndexedDataset shouldn’t care about version.
>>>>
>>>> BTW It seems pretty easy to take out the option parser and replace with
>>>> match and tuples especially if we can extend the Scala App class. It
>> might
>>>> actually simplify things since I can then use several case classes to
>> hold
>>>> options (scopt needed one object), which in turn takes out all those
>> ugly
>>>> casts. I’ll take a look next time I’m in there.
>>>>
>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>> in 'spark' module it is overwritten with spark dependency, which also
>> comes
>>>> at the same version so happens. so should be fine with 1.1.x
>>>>
>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>> mahout-spark_2.10 ---
>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>> [INFO] |  |  |  +-
>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>> [INFO] |  |  |  |  +-
>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
>>>> [INFO] |  |  |  |  |  \-
>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>> [INFO] |  |  |  |  \-
>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>> [INFO] |  |  |  \-
>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>> [INFO] |  |  +-
>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>> [INFO] |  |  |  +-
>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>> [INFO] |  |  |  |  +-
>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>> [INFO] |  |  |  |  |  +-
>>>>
>>>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |  +-
>>>>
>>>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |  |  +-
>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>> [INFO] |  |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |     +-
>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>> [INFO] |  |  |  |  |  |     |  \-
>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>> [INFO] |  |  |  |  |  |     |     \-
>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>> [INFO] |  |  |  |  |  |     |        \-
>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>> [INFO] |  |  |  |  |  |     +-
>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>> [INFO] |  |  |  |  |  |     |  \-
>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>> [INFO] |  |  |  |  |  |     +-
>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>> [INFO] |  |  |  |  |  |     \-
>> org.glassfish:javax.servlet:jar:3.1:compile
>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>> [INFO] |  |  |  |  |  |  +-
>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>> [INFO] |  |  |  |  |  |  +-
>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>> [INFO] |  |  |  |  |  |  |  \-
>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>> [INFO] |  |  |  |  |  |  |     \-
>>>> javax.activation:activation:jar:1.1:compile
>>>> [INFO] |  |  |  |  |  |  +-
>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>> [INFO] |  |  |  |  |  |  \-
>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>> [INFO] |  |  |  |  |  \-
>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>> [INFO] |  |  |  |  \-
>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>> [INFO] |  |  |  \-
>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>> [INFO] |  |  +-
>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>> [INFO] |  |  |  \-
>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>> [INFO] |  |  +-
>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
>>>> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  +-
>>>>
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>> [INFO] |  |  +-
>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  |  +-
>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  |  \-
>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  \-
>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |     \-
>>>>
>>>>
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>> [INFO] |  |        \-
>>>>
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  +-
>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>> [INFO] |  |  +-
>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |  \-
>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>> [INFO] |  |     \-
>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>> d
>>>>
>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>
>>>>> looks like it is also requested by mahout-math, wonder what is using
>> it
>>>>> there.
>>>>>
>>>>> At very least, it needs to be synchronized to the one currently used
>> by
>>>>> spark.
>>>>>
>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> mahout-hadoop
>>>>> ---
>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>> [INFO] +-
>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>
>>>>>
>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>>> Looks like Guava is in Spark.
>>>>>>
>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>> this
>>>>>> would not be included since I think it was taken from the mrlegacy
>> jar.
>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>>> ---------- Forwarded message ----------
>>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>> To: <de...@mahout.apache.org>
>>>>>> Cc:
>>>>>>
>>>>>>> When you get a chance a PR would be good.
>>>>>> Yes, it would. And not just for that.
>>>>>>
>>>>>>> As I understand it you are putting some class jars somewhere in the
>>>>>> classpath. Where? How?
>>>>>> /bin/mahout
>>>>>>
>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>> 'bin/mahout -spark'.)
>>>>>>
>>>>>> If i interpret current shell code there correctky, legacy path tries
>> to
>>>>>> use
>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>> motivation
>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>> whole
>>>>>> true intent there.
>>>>>>
>>>>>> The spark path, which is really a quick hack of the script, tries to
>> get
>>>>>> only selected mahout jars and locally instlalled spark classpath
>> which i
>>>>>> guess is just the shaded spark jar in recent spark releases. It also
>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>> unpackaged
>>>>>> version, and now i think it is a bug it is included  because /libs/*
>> is
>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>> with a
>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>>>>> understand mahout build in all cases.
>>>>>>
>>>>>> I am not even sure if packaged mahout will work with spark, honestly,
>>>>>> because of the /lib. Never tried that, since i mostly use application
>>>>>> embedding techniques.
>>>>>>
>>>>>> The same solution may apply to adding external dependencies and
>> removing
>>>>>> the assembly in the Spark module. Which would leave only one major
>> build
>>>>>> issue afaik.
>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>>> wrote:
>>>>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>>>>> defined
>>>>>>> what i want to do in order to gauge if we may want to advance it
>> some
>>>>>> time
>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>> everything
>>>>>>> that is not compile-time dependent. (and a lot of it is thru legacy
>> MR
>>>>>> code
>>>>>>> which we of course don't use).
>>>>>>>
>>>>>>> Cant say i understand the remaining issues you are talking about
>>>> though.
>>>>>>> If you are talking about compiling lib or shaded assembly, no, this
>>>>>> doesn't
>>>>>>> do anything about it. Although point is, as it stands, the algebra
>> and
>>>>>>> shell don't have any external dependencies but spark and these 4
>> (5?)
>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>> demonstrated).
>>>>>>>
>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>> external
>>>>>>> dependencies, but that's a different scenario from those i am
>> talking
>>>>>>> about. But i am relatively happy with having the first two working
>>>>>> nicely
>>>>>>> at this point.
>>>>>>>
>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
>> be
>>>>>> nice
>>>>>>>> to see how you’ve structured that in case we can use the same
>> model to
>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>> 1) external dependencies in the spark module
>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>
>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
>> wrote:
>>>>>>>> Also +1
>>>>>>>>
>>>>>>>> iPhone'd
>>>>>>>>
>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>>> wrote:
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>
>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>> Lyubimov
>>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>> (GMT-05:00)
>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>>>>> refactoring proposal </div><div>
>>>>>>>>> </div>
>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>> depends
>>>>>> on
>>>>>>>>> the following classes there:
>>>>>>>>>
>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
>>>> and
>>>>>>>> ...
>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>
>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>> mahout-hadoop
>>>>>>>>> module (to signify stuff that is directly relevant to serializing
>>>>>> thigns
>>>>>>>> to
>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>> spark
>>>>>>>> and
>>>>>>>>> spark-shell dependencies.
>>>>>>>>>
>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>> actually
>>>>>>>> only
>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>>>>> course)
>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
>> and
>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>
>>>>>>>>> This of course still doesn't address driver problems that want to
>>>>>> throw
>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>> least
>>>>>> it
>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>> worker-shipped
>>>>>>>>> jars) much more tolerable.
>>>>>>>>>
>>>>>>>>> How does that sound?
>>>>>>>>
>>>>>>
>>>>>>
>>>>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Some issues WRT lower level Spark integration:
1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance.
2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math.
3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start.

Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for streaming and especially incrementally updated streaming. Seems like we need to address this.

Andrew, regardless of the above having TF-IDF would be super helpful—row similarity for content/text would benefit greatly. 


On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

But first I need to do massive fixes and improvements to the distributed
optimizer itself. Still waiting on green light for that.
On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:

> 
> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>> 
>> BTW what level of difficulty would making the DSL run on MLlib Vectors
> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> impedance mismatch between DRM and MLlib RowMatrix. This would further
> reduce artifact size by a bunch.
> 
> Short answer, if it were possible, I'd not bother with Mahout code base at
> all. The problem is it lacks sufficient flexibility semantics and
> abstruction. Breeze is indefinitely better in that department but at the
> time it was sufficiently worse on abstracting interoperability of matrices
> with different structures. And mllib does not expose breeze.
> 
> Looking forward toward hardware acellerated bolt-on work I just must say
> after reading breeze code for some time I still have much clearer plan how
> such back hybridization and cost calibration might work with current Mahout
> math abstractions than with breeze. It is also more in line with my current
> work tasks.
> 
>> 
>> Also backing something like a DRM with DStreams. Periodic model recalc
> with streams is maybe the first step towards truly streaming algos. Looking
> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> similarity. Attach Kafka and get evergreen models, if not incrementally
> updating models.
>> 
>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> bottom line compile-time dependencies are satisfied with no extra stuff
>> from mr-legacy or its transitives. This is proven by virtue of
> successful
>> compilation with no dependency on mr-legacy on the tree.
>> 
>> Runtime sufficiency for no extra dependency is proven via running shell
> or
>> embedded tests (unit tests) which are successful too. This implies
>> embedding and shell apis.
>> 
>> Issue with guava is typical one. if it were an issue, i wouldn't be able
> to
>> compile and/or run stuff. Now, question is what do we do if drivers want
>> extra stuff that is not found in Spark.
>> 
>> Now, It is so nice not to depend on anything extra so i am hesitant to
>> offer anything  here. either shading or lib with opt-in dependency policy
>> would suffice though, since it doesn't look like we'd have to have tons
> of
>> extra for drivers.
>> 
>> 
>> 
>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> I vaguely remember there being a Guava version problem where the
> version
>>> had to be rolled back in one of the hadoop modules. The math-scala
>>> IndexedDataset shouldn’t care about version.
>>> 
>>> BTW It seems pretty easy to take out the option parser and replace with
>>> match and tuples especially if we can extend the Scala App class. It
> might
>>> actually simplify things since I can then use several case classes to
> hold
>>> options (scopt needed one object), which in turn takes out all those
> ugly
>>> casts. I’ll take a look next time I’m in there.
>>> 
>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>> 
>>> in 'spark' module it is overwritten with spark dependency, which also
> comes
>>> at the same version so happens. so should be fine with 1.1.x
>>> 
>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>> mahout-spark_2.10 ---
>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>> [INFO] |  |  |  +-
>>> commons-configuration:commons-configuration:jar:1.6:compile
>>> [INFO] |  |  |  |  +-
>>> commons-collections:commons-collections:jar:3.2.1:compile
>>> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
>>> [INFO] |  |  |  |  |  \-
>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>> [INFO] |  |  |  |  \-
>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>> [INFO] |  |  |  \-
> org.apache.commons:commons-compress:jar:1.4.1:compile
>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>> [INFO] |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>> [INFO] |  |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>> [INFO] |  |  |  |  +-
>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>> [INFO] |  |  |  |  |  +-
>>> 
>>> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |  +-
>>> 
>>> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |  |  +-
>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>> [INFO] |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-client:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |  \-
> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |     +-
>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>> [INFO] |  |  |  |  |  |     |  \-
>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>> [INFO] |  |  |  |  |  |     |     \-
>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>> [INFO] |  |  |  |  |  |     |        \-
>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>> [INFO] |  |  |  |  |  |     +-
>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>> [INFO] |  |  |  |  |  |     |  \-
>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>> [INFO] |  |  |  |  |  |     +-
>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>> [INFO] |  |  |  |  |  |     \-
> org.glassfish:javax.servlet:jar:3.1:compile
>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>> [INFO] |  |  |  |  |  |  +-
> org.codehaus.jettison:jettison:jar:1.1:compile
>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>> [INFO] |  |  |  |  |  |  +-
> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>> [INFO] |  |  |  |  |  |  |  \-
> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>> [INFO] |  |  |  |  |  |  |     \-
>>> javax.activation:activation:jar:1.1:compile
>>> [INFO] |  |  |  |  |  |  +-
>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>> [INFO] |  |  |  |  |  |  \-
>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>> [INFO] |  |  |  |  |  \-
>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>> [INFO] |  |  |  |  \-
>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>> [INFO] |  |  |  \-
>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>> [INFO] |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>> [INFO] |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
>>> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  +-
>>> 
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>> [INFO] |  |  +-
> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  |  +-
> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  |  \-
>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  \-
> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>> [INFO] |  |     \-
>>> 
>>> 
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>> [INFO] |  |        \-
>>> 
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>> [INFO] |  +-
> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>> [INFO] |  +-
> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  +-
>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>> [INFO] |  |  +-
>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>> [INFO] |  |     \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>> d
>>> 
>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>>> looks like it is also requested by mahout-math, wonder what is using
> it
>>>> there.
>>>> 
>>>> At very least, it needs to be synchronized to the one currently used
> by
>>>> spark.
>>>> 
>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-hadoop
>>>> ---
>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>> [INFO] +-
> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>> 
>>>> 
>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Looks like Guava is in Spark.
>>>>> 
>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>>>> 
>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> this
>>>>> would not be included since I think it was taken from the mrlegacy
> jar.
>>>>> 
>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> 
>>>>> ---------- Forwarded message ----------
>>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>>> Date: Jan 25, 2015 9:39 AM
>>>>> Subject: Re: Codebase refactoring proposal
>>>>> To: <de...@mahout.apache.org>
>>>>> Cc:
>>>>> 
>>>>>> When you get a chance a PR would be good.
>>>>> 
>>>>> Yes, it would. And not just for that.
>>>>> 
>>>>>> As I understand it you are putting some class jars somewhere in the
>>>>> classpath. Where? How?
>>>>>> 
>>>>> 
>>>>> /bin/mahout
>>>>> 
>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>> 'bin/mahout -spark'.)
>>>>> 
>>>>> If i interpret current shell code there correctky, legacy path tries
> to
>>>>> use
>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>> motivation
>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>> whole
>>>>> true intent there.
>>>>> 
>>>>> The spark path, which is really a quick hack of the script, tries to
> get
>>>>> only selected mahout jars and locally instlalled spark classpath
> which i
>>>>> guess is just the shaded spark jar in recent spark releases. It also
>>>>> apparently tries to include /libs/*, which is never compiled in
>>> unpackaged
>>>>> version, and now i think it is a bug it is included  because /libs/*
> is
>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
> with a
>>>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>>>> understand mahout build in all cases.
>>>>> 
>>>>> I am not even sure if packaged mahout will work with spark, honestly,
>>>>> because of the /lib. Never tried that, since i mostly use application
>>>>> embedding techniques.
>>>>> 
>>>>> The same solution may apply to adding external dependencies and
> removing
>>>>> the assembly in the Spark module. Which would leave only one major
> build
>>>>> issue afaik.
>>>>>> 
>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>>>> defined
>>>>>> what i want to do in order to gauge if we may want to advance it
> some
>>>>> time
>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>> everything
>>>>>> that is not compile-time dependent. (and a lot of it is thru legacy
> MR
>>>>> code
>>>>>> which we of course don't use).
>>>>>> 
>>>>>> Cant say i understand the remaining issues you are talking about
>>> though.
>>>>>> 
>>>>>> If you are talking about compiling lib or shaded assembly, no, this
>>>>> doesn't
>>>>>> do anything about it. Although point is, as it stands, the algebra
> and
>>>>>> shell don't have any external dependencies but spark and these 4
> (5?)
>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>> demonstrated).
>>>>>> 
>>>>>> As i said, it seems driver code is the only one that may need some
>>>>> external
>>>>>> dependencies, but that's a different scenario from those i am
> talking
>>>>>> about. But i am relatively happy with having the first two working
>>>>> nicely
>>>>>> at this point.
>>>>>> 
>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
> be
>>>>> nice
>>>>>>> to see how you’ve structured that in case we can use the same
> model to
>>>>>>> solve the two remaining refactoring issues.
>>>>>>> 1) external dependencies in the spark module
>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>> 
>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
> wrote:
>>>>>>> 
>>>>>>> Also +1
>>>>>>> 
>>>>>>> iPhone'd
>>>>>>> 
>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>>> wrote:
>>>>>>>> 
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>> 
>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>> Lyubimov
>>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> (GMT-05:00)
>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>>>> refactoring proposal </div><div>
>>>>>>>> </div>
>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>> depends
>>>>> on
>>>>>>>> the following classes there:
>>>>>>>> 
>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
>>> and
>>>>>>> ...
>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>> 
>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>> mahout-hadoop
>>>>>>>> module (to signify stuff that is directly relevant to serializing
>>>>> thigns
>>>>>>> to
>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>> spark
>>>>>>> and
>>>>>>>> spark-shell dependencies.
>>>>>>>> 
>>>>>>>> So non-cli applications (shell scripts and embedded api use)
> actually
>>>>>>> only
>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>>>> course)
>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
> and
>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>> 
>>>>>>>> This of course still doesn't address driver problems that want to
>>>>> throw
>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
> least
>>>>> it
>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>> worker-shipped
>>>>>>>> jars) much more tolerable.
>>>>>>>> 
>>>>>>>> How does that sound?
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

But first I need to do massive fixes and improvements to the distributed
optimizer itself. Still waiting on green light for that.
On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:

>
> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >
> > BTW what level of difficulty would making the DSL run on MLlib Vectors
> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> impedance mismatch between DRM and MLlib RowMatrix. This would further
> reduce artifact size by a bunch.
>
> Short answer, if it were possible, I'd not bother with Mahout code base at
> all. The problem is it lacks sufficient flexibility semantics and
> abstruction. Breeze is indefinitely better in that department but at the
> time it was sufficiently worse on abstracting interoperability of matrices
> with different structures. And mllib does not expose breeze.
>
> Looking forward toward hardware acellerated bolt-on work I just must say
> after reading breeze code for some time I still have much clearer plan how
> such back hybridization and cost calibration might work with current Mahout
> math abstractions than with breeze. It is also more in line with my current
> work tasks.
>
> >
> > Also backing something like a DRM with DStreams. Periodic model recalc
> with streams is maybe the first step towards truly streaming algos. Looking
> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> similarity. Attach Kafka and get evergreen models, if not incrementally
> updating models.
> >
> > On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > bottom line compile-time dependencies are satisfied with no extra stuff
> > from mr-legacy or its transitives. This is proven by virtue of
> successful
> > compilation with no dependency on mr-legacy on the tree.
> >
> > Runtime sufficiency for no extra dependency is proven via running shell
> or
> > embedded tests (unit tests) which are successful too. This implies
> > embedding and shell apis.
> >
> > Issue with guava is typical one. if it were an issue, i wouldn't be able
> to
> > compile and/or run stuff. Now, question is what do we do if drivers want
> > extra stuff that is not found in Spark.
> >
> > Now, It is so nice not to depend on anything extra so i am hesitant to
> > offer anything  here. either shading or lib with opt-in dependency policy
> > would suffice though, since it doesn't look like we'd have to have tons
> of
> > extra for drivers.
> >
> >
> >
> > On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> > > I vaguely remember there being a Guava version problem where the
> version
> > > had to be rolled back in one of the hadoop modules. The math-scala
> > > IndexedDataset shouldn’t care about version.
> > >
> > > BTW It seems pretty easy to take out the option parser and replace with
> > > match and tuples especially if we can extend the Scala App class. It
> might
> > > actually simplify things since I can then use several case classes to
> hold
> > > options (scopt needed one object), which in turn takes out all those
> ugly
> > > casts. I’ll take a look next time I’m in there.
> > >
> > > On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > >
> > > in 'spark' module it is overwritten with spark dependency, which also
> comes
> > > at the same version so happens. so should be fine with 1.1.x
> > >
> > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > > mahout-spark_2.10 ---
> > > [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> > > [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> > > [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > > [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> > > [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> > > [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> > > [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> > > [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> > > [INFO] |  |  |  +-
> > > commons-configuration:commons-configuration:jar:1.6:compile
> > > [INFO] |  |  |  |  +-
> > > commons-collections:commons-collections:jar:3.2.1:compile
> > > [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> > > [INFO] |  |  |  |  |  \-
> > > commons-beanutils:commons-beanutils:jar:1.7.0:compile
> > > [INFO] |  |  |  |  \-
> > > commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> > > [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> > > [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> > > [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> org.apache.commons:commons-compress:jar:1.4.1:compile
> > > [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> > > [INFO] |  |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> > > [INFO] |  |  |  |  +-
> > > org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> > > [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> > > [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> > > [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> > > [INFO] |  |  |  |  |  +-
> > >
> > >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +-
> > >
> > >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  |  +-
> > > javax.servlet:javax.servlet-api:jar:3.0.1:compile
> > > [INFO] |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-client:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  \-
> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |  \-
> > > org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |     \-
> > > org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> > > [INFO] |  |  |  |  |  |     |        \-
> > > org.glassfish.external:management-api:jar:3.0.0-b012:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |  \-
> > > org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     \-
> org.glassfish:javax.servlet:jar:3.1:compile
> > > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> > > [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> > > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +-
> org.codehaus.jettison:jettison:jar:1.1:compile
> > > [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> > > [INFO] |  |  |  |  |  |  +-
> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > > [INFO] |  |  |  |  |  |  |  \-
> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> > > [INFO] |  |  |  |  |  |  |     \-
> > > javax.activation:activation:jar:1.1:compile
> > > [INFO] |  |  |  |  |  |  +-
> > > org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> > > [INFO] |  |  |  |  |  |  \-
> > > org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> > > [INFO] |  |  |  |  |  \-
> > > com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> > > [INFO] |  |  |  |  \-
> > > org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> > > org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> > > [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> > > [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> > > [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> > > [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> > > [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> > > [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> > > [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> > > [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > > [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > >
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > > [INFO] |  |  +-
> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  +-
> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  \-
> > > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > > [INFO] |  |     \-
> > >
> > >
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > > [INFO] |  |        \-
> > >
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > > [INFO] |  |     \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > > d
> > >
> > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > >> looks like it is also requested by mahout-math, wonder what is using
> it
> > >> there.
> > >>
> > >> At very least, it needs to be synchronized to the one currently used
> by
> > >> spark.
> > >>
> > >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-hadoop
> > >> ---
> > >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > >> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > >> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > >> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > >> [INFO] +-
> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > >> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > >>
> > >>
> > >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > >>
> > >>> Looks like Guava is in Spark.
> > >>>
> > >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > >>>
> > >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> this
> > >>> would not be included since I think it was taken from the mrlegacy
> jar.
> > >>>
> > >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >>>
> > >>> ---------- Forwarded message ----------
> > >>> From: "Pat Ferrel" <pa...@occamsmachete.com>
> > >>> Date: Jan 25, 2015 9:39 AM
> > >>> Subject: Re: Codebase refactoring proposal
> > >>> To: <de...@mahout.apache.org>
> > >>> Cc:
> > >>>
> > >>>> When you get a chance a PR would be good.
> > >>>
> > >>> Yes, it would. And not just for that.
> > >>>
> > >>>> As I understand it you are putting some class jars somewhere in the
> > >>> classpath. Where? How?
> > >>>>
> > >>>
> > >>> /bin/mahout
> > >>>
> > >>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> > >>> 'bin/mahout -spark'.)
> > >>>
> > >>> If i interpret current shell code there correctky, legacy path tries
> to
> > >>> use
> > >>> examples assemblies if not packaged, or /lib if packaged. True
> > > motivation
> > >>> of that significantly predates 2010 and i suspect only Benson knows
> > > whole
> > >>> true intent there.
> > >>>
> > >>> The spark path, which is really a quick hack of the script, tries to
> get
> > >>> only selected mahout jars and locally instlalled spark classpath
> which i
> > >>> guess is just the shaded spark jar in recent spark releases. It also
> > >>> apparently tries to include /libs/*, which is never compiled in
> > > unpackaged
> > >>> version, and now i think it is a bug it is included  because /libs/*
> is
> > >>> apparently legacy packaging, and shouldnt be used  in spark jobs
> with a
> > >>> wildcard. I cant beleive how lazy i am, i still did not find time to
> > >>> understand mahout build in all cases.
> > >>>
> > >>> I am not even sure if packaged mahout will work with spark, honestly,
> > >>> because of the /lib. Never tried that, since i mostly use application
> > >>> embedding techniques.
> > >>>
> > >>> The same solution may apply to adding external dependencies and
> removing
> > >>> the assembly in the Spark module. Which would leave only one major
> build
> > >>> issue afaik.
> > >>>>
> > >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>> No, no PR. Only experiment on private. But i believe i sufficiently
> > >>> defined
> > >>>> what i want to do in order to gauge if we may want to advance it
> some
> > >>> time
> > >>>> later. Goal is much lighter dependency for spark code. Eliminate
> > >>> everything
> > >>>> that is not compile-time dependent. (and a lot of it is thru legacy
> MR
> > >>> code
> > >>>> which we of course don't use).
> > >>>>
> > >>>> Cant say i understand the remaining issues you are talking about
> > > though.
> > >>>>
> > >>>> If you are talking about compiling lib or shaded assembly, no, this
> > >>> doesn't
> > >>>> do anything about it. Although point is, as it stands, the algebra
> and
> > >>>> shell don't have any external dependencies but spark and these 4
> (5?)
> > >>>> mahout jars so they technically don't even need an assembly (as
> > >>>> demonstrated).
> > >>>>
> > >>>> As i said, it seems driver code is the only one that may need some
> > >>> external
> > >>>> dependencies, but that's a different scenario from those i am
> talking
> > >>>> about. But i am relatively happy with having the first two working
> > >>> nicely
> > >>>> at this point.
> > >>>>
> > >>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
> > >>> wrote:
> > >>>>
> > >>>>> +1
> > >>>>>
> > >>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
> be
> > >>> nice
> > >>>>> to see how you’ve structured that in case we can use the same
> model to
> > >>>>> solve the two remaining refactoring issues.
> > >>>>> 1) external dependencies in the spark module
> > >>>>> 2) no spark or h2o in the release artifacts.
> > >>>>>
> > >>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
> wrote:
> > >>>>>
> > >>>>> Also +1
> > >>>>>
> > >>>>> iPhone'd
> > >>>>>
> > >>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
> > > wrote:
> > >>>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>>
> > >>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> > >>>>>>
> > >>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
> > >>> Lyubimov
> > >>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> (GMT-05:00)
> > >>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
> > >>>>> refactoring proposal </div><div>
> > >>>>>> </div>
> > >>>>>> So right now mahout-spark depends on mr-legacy.
> > >>>>>> I did quick refactoring and it turns out it only _irrevocably_
> > > depends
> > >>> on
> > >>>>>> the following classes there:
> > >>>>>>
> > >>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> > > and
> > >>>>> ...
> > >>>>>> *sigh* o.a.m.common.Pair
> > >>>>>>
> > >>>>>> So  I just dropped those five classes into new a new tiny
> > >>> mahout-hadoop
> > >>>>>> module (to signify stuff that is directly relevant to serializing
> > >>> thigns
> > >>>>> to
> > >>>>>> DFS API) and completely removed mrlegacy and its transients from
> > > spark
> > >>>>> and
> > >>>>>> spark-shell dependencies.
> > >>>>>>
> > >>>>>> So non-cli applications (shell scripts and embedded api use)
> actually
> > >>>>> only
> > >>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
> > >>> course)
> > >>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
> and
> > >>>>>> optionally mahout-spark-shell (for running shell)).
> > >>>>>>
> > >>>>>> This of course still doesn't address driver problems that want to
> > >>> throw
> > >>>>>> more stuff into front-end classpath (such as cli parser) but at
> least
> > >>> it
> > >>>>>> renders transitive luggage of mr-legacy (and the size of
> > >>> worker-shipped
> > >>>>>> jars) much more tolerable.
> > >>>>>>
> > >>>>>> How does that sound?
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
>

Re: Codebase refactoring proposal

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Feb 3, 2015 7:20 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>
> BTW what level of difficulty would making the DSL run on MLlib Vectors
and RowMatrix be? Looking at using their hashing TF-IDF but it raises
impedance mismatch between DRM and MLlib RowMatrix. This would further
reduce artifact size by a bunch.

Short answer, if it were possible, I'd not bother with Mahout code base at
all. The problem is it lacks sufficient flexibility semantics and
abstruction. Breeze is indefinitely better in that department but at the
time it was sufficiently worse on abstracting interoperability of matrices
with different structures. And mllib does not expose breeze.

Looking forward toward hardware acellerated bolt-on work I just must say
after reading breeze code for some time I still have much clearer plan how
such back hybridization and cost calibration might work with current Mahout
math abstractions than with breeze. It is also more in line with my current
work tasks.

>
> Also backing something like a DRM with DStreams. Periodic model recalc
with streams is maybe the first step towards truly streaming algos. Looking
at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
similarity. Attach Kafka and get evergreen models, if not incrementally
updating models.
>
> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> bottom line compile-time dependencies are satisfied with no extra stuff
> from mr-legacy or its transitives. This is proven by virtue of  successful
> compilation with no dependency on mr-legacy on the tree.
>
> Runtime sufficiency for no extra dependency is proven via running shell or
> embedded tests (unit tests) which are successful too. This implies
> embedding and shell apis.
>
> Issue with guava is typical one. if it were an issue, i wouldn't be able
to
> compile and/or run stuff. Now, question is what do we do if drivers want
> extra stuff that is not found in Spark.
>
> Now, It is so nice not to depend on anything extra so i am hesitant to
> offer anything  here. either shading or lib with opt-in dependency policy
> would suffice though, since it doesn't look like we'd have to have tons of
> extra for drivers.
>
>
>
> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com>
wrote:
>
> > I vaguely remember there being a Guava version problem where the version
> > had to be rolled back in one of the hadoop modules. The math-scala
> > IndexedDataset shouldn’t care about version.
> >
> > BTW It seems pretty easy to take out the option parser and replace with
> > match and tuples especially if we can extend the Scala App class. It
might
> > actually simplify things since I can then use several case classes to
hold
> > options (scopt needed one object), which in turn takes out all those
ugly
> > casts. I’ll take a look next time I’m in there.
> >
> > On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > in 'spark' module it is overwritten with spark dependency, which also
comes
> > at the same version so happens. so should be fine with 1.1.x
> >
> > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > mahout-spark_2.10 ---
> > [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> > [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> > [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> > [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> > [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> > [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> > [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> > [INFO] |  |  |  +-
> > commons-configuration:commons-configuration:jar:1.6:compile
> > [INFO] |  |  |  |  +-
> > commons-collections:commons-collections:jar:3.2.1:compile
> > [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> > [INFO] |  |  |  |  |  \-
> > commons-beanutils:commons-beanutils:jar:1.7.0:compile
> > [INFO] |  |  |  |  \-
> > commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> > [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> > [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> > [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> > [INFO] |  |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
> > [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> > [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> > [INFO] |  |  +-
> > org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> > [INFO] |  |  |  +-
> > org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> > [INFO] |  |  |  |  +-
> > org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> > [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> > [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> > [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> > [INFO] |  |  |  |  |  +-
> >
> >
com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> > [INFO] |  |  |  |  |  |  +-
> >
> >
com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> > [INFO] |  |  |  |  |  |  |  +-
> > javax.servlet:javax.servlet-api:jar:3.0.1:compile
> > [INFO] |  |  |  |  |  |  |  \-
com.sun.jersey:jersey-client:jar:1.9:compile
> > [INFO] |  |  |  |  |  |  \-
com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> > [INFO] |  |  |  |  |  |     +-
> > org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> > [INFO] |  |  |  |  |  |     |  \-
> > org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> > [INFO] |  |  |  |  |  |     |     \-
> > org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> > [INFO] |  |  |  |  |  |     |        \-
> > org.glassfish.external:management-api:jar:3.0.0-b012:compile
> > [INFO] |  |  |  |  |  |     +-
> > org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> > [INFO] |  |  |  |  |  |     |  \-
> > org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> > [INFO] |  |  |  |  |  |     +-
> > org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> > [INFO] |  |  |  |  |  |     \-
org.glassfish:javax.servlet:jar:3.1:compile
> > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> > [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> > [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> > [INFO] |  |  |  |  |  |  +-
org.codehaus.jettison:jettison:jar:1.1:compile
> > [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> > [INFO] |  |  |  |  |  |  +-
com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
> > [INFO] |  |  |  |  |  |  |     \-
> > javax.activation:activation:jar:1.1:compile
> > [INFO] |  |  |  |  |  |  +-
> > org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> > [INFO] |  |  |  |  |  |  \-
> > org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> > [INFO] |  |  |  |  |  \-
> > com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> > [INFO] |  |  |  |  \-
> > org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> > [INFO] |  |  |  \-
> > org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> > [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> > [INFO] |  |  +-
> > org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> > [INFO] |  |  |  \-
org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> > [INFO] |  |  +-
> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> > [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> > [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> > [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> > [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> > [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> > [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> > [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> > [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> > [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > [INFO] |  |  +-
> >
org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > [INFO] |  |  +-
org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > [INFO] |  |  |  +-
org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > [INFO] |  |  |  \-
> > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > [INFO] |  |  \-
org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > [INFO] |  |     \-
> >
> >
org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > [INFO] |  |        \-
> > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > [INFO] |  +-
org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> > [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > [INFO] |  |  +-
> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > [INFO] |  |  +-
> > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > [INFO] |  |  \-
org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > [INFO] |  |     \-
org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > d
> >
> > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> >> looks like it is also requested by mahout-math, wonder what is using it
> >> there.
> >>
> >> At very least, it needs to be synchronized to the one currently used by
> >> spark.
> >>
> >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
mahout-hadoop
> >> ---
> >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> >> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> >> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> >> [INFO] +-
org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>
> >>
> >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>
> >>> Looks like Guava is in Spark.
> >>>
> >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>
> >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
this
> >>> would not be included since I think it was taken from the mrlegacy
jar.
> >>>
> >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: "Pat Ferrel" <pa...@occamsmachete.com>
> >>> Date: Jan 25, 2015 9:39 AM
> >>> Subject: Re: Codebase refactoring proposal
> >>> To: <de...@mahout.apache.org>
> >>> Cc:
> >>>
> >>>> When you get a chance a PR would be good.
> >>>
> >>> Yes, it would. And not just for that.
> >>>
> >>>> As I understand it you are putting some class jars somewhere in the
> >>> classpath. Where? How?
> >>>>
> >>>
> >>> /bin/mahout
> >>>
> >>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >>> 'bin/mahout -spark'.)
> >>>
> >>> If i interpret current shell code there correctky, legacy path tries
to
> >>> use
> >>> examples assemblies if not packaged, or /lib if packaged. True
> > motivation
> >>> of that significantly predates 2010 and i suspect only Benson knows
> > whole
> >>> true intent there.
> >>>
> >>> The spark path, which is really a quick hack of the script, tries to
get
> >>> only selected mahout jars and locally instlalled spark classpath
which i
> >>> guess is just the shaded spark jar in recent spark releases. It also
> >>> apparently tries to include /libs/*, which is never compiled in
> > unpackaged
> >>> version, and now i think it is a bug it is included  because /libs/*
is
> >>> apparently legacy packaging, and shouldnt be used  in spark jobs with
a
> >>> wildcard. I cant beleive how lazy i am, i still did not find time to
> >>> understand mahout build in all cases.
> >>>
> >>> I am not even sure if packaged mahout will work with spark, honestly,
> >>> because of the /lib. Never tried that, since i mostly use application
> >>> embedding techniques.
> >>>
> >>> The same solution may apply to adding external dependencies and
removing
> >>> the assembly in the Spark module. Which would leave only one major
build
> >>> issue afaik.
> >>>>
> >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>>>
> >>>> No, no PR. Only experiment on private. But i believe i sufficiently
> >>> defined
> >>>> what i want to do in order to gauge if we may want to advance it some
> >>> time
> >>>> later. Goal is much lighter dependency for spark code. Eliminate
> >>> everything
> >>>> that is not compile-time dependent. (and a lot of it is thru legacy
MR
> >>> code
> >>>> which we of course don't use).
> >>>>
> >>>> Cant say i understand the remaining issues you are talking about
> > though.
> >>>>
> >>>> If you are talking about compiling lib or shaded assembly, no, this
> >>> doesn't
> >>>> do anything about it. Although point is, as it stands, the algebra
and
> >>>> shell don't have any external dependencies but spark and these 4 (5?)
> >>>> mahout jars so they technically don't even need an assembly (as
> >>>> demonstrated).
> >>>>
> >>>> As i said, it seems driver code is the only one that may need some
> >>> external
> >>>> dependencies, but that's a different scenario from those i am talking
> >>>> about. But i am relatively happy with having the first two working
> >>> nicely
> >>>> at this point.
> >>>>
> >>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>
> >>>>> +1
> >>>>>
> >>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
be
> >>> nice
> >>>>> to see how you’ve structured that in case we can use the same model
to
> >>>>> solve the two remaining refactoring issues.
> >>>>> 1) external dependencies in the spark module
> >>>>> 2) no spark or h2o in the release artifacts.
> >>>>>
> >>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu>
wrote:
> >>>>>
> >>>>> Also +1
> >>>>>
> >>>>> iPhone'd
> >>>>>
> >>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
> > wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>>
> >>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> >>>>>>
> >>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
> >>> Lyubimov
> >>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
> >>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
> >>>>> refactoring proposal </div><div>
> >>>>>> </div>
> >>>>>> So right now mahout-spark depends on mr-legacy.
> >>>>>> I did quick refactoring and it turns out it only _irrevocably_
> > depends
> >>> on
> >>>>>> the following classes there:
> >>>>>>
> >>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> > and
> >>>>> ...
> >>>>>> *sigh* o.a.m.common.Pair
> >>>>>>
> >>>>>> So  I just dropped those five classes into new a new tiny
> >>> mahout-hadoop
> >>>>>> module (to signify stuff that is directly relevant to serializing
> >>> thigns
> >>>>> to
> >>>>>> DFS API) and completely removed mrlegacy and its transients from
> > spark
> >>>>> and
> >>>>>> spark-shell dependencies.
> >>>>>>
> >>>>>> So non-cli applications (shell scripts and embedded api use)
actually
> >>>>> only
> >>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
> >>> course)
> >>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
and
> >>>>>> optionally mahout-spark-shell (for running shell)).
> >>>>>>
> >>>>>> This of course still doesn't address driver problems that want to
> >>> throw
> >>>>>> more stuff into front-end classpath (such as cli parser) but at
least
> >>> it
> >>>>>> renders transitive luggage of mr-legacy (and the size of
> >>> worker-shipped
> >>>>>> jars) much more tolerable.
> >>>>>>
> >>>>>> How does that sound?
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
>

Re: Codebase refactoring proposal

Posted by Andrew Palumbo <ap...@outlook.com>.

Pat,
I dont know if this would be useful but I was looking at porting our 
TF-IDF classes over from MRLegacy. They're pretty simple and basically 
just wrapper classes for a Lucene analyzer but they require a Lucene 
dependency.  I'm not sure if we want Lucene dependency in math-scala.  
This was actually something i wanted to check in on.  I have a script 
here that I've been playing with:

https://github.com/andrewpalumbo/mahout/blob/MAHOUT-1536-scala/examples/bin/spark/ClassifyNewNBFin.scala

this uses the MRLegacy TF and TF-IDF classes.



On 02/03/2015 10:19 AM, Pat Ferrel wrote:
> BTW what level of difficulty would making the DSL run on MLlib Vectors and RowMatrix be? Looking at using their hashing TF-IDF but it raises impedance mismatch between DRM and MLlib RowMatrix. This would further reduce artifact size by a bunch.
>
> Also backing something like a DRM with DStreams. Periodic model recalc with streams is maybe the first step towards truly streaming algos. Looking at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row similarity. Attach Kafka and get evergreen models, if not incrementally updating models.
>
> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> bottom line compile-time dependencies are satisfied with no extra stuff
> from mr-legacy or its transitives. This is proven by virtue of  successful
> compilation with no dependency on mr-legacy on the tree.
>
> Runtime sufficiency for no extra dependency is proven via running shell or
> embedded tests (unit tests) which are successful too. This implies
> embedding and shell apis.
>
> Issue with guava is typical one. if it were an issue, i wouldn't be able to
> compile and/or run stuff. Now, question is what do we do if drivers want
> extra stuff that is not found in Spark.
>
> Now, It is so nice not to depend on anything extra so i am hesitant to
> offer anything  here. either shading or lib with opt-in dependency policy
> would suffice though, since it doesn't look like we'd have to have tons of
> extra for drivers.
>
>
>
> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> I vaguely remember there being a Guava version problem where the version
>> had to be rolled back in one of the hadoop modules. The math-scala
>> IndexedDataset shouldn’t care about version.
>>
>> BTW It seems pretty easy to take out the option parser and replace with
>> match and tuples especially if we can extend the Scala App class. It might
>> actually simplify things since I can then use several case classes to hold
>> options (scopt needed one object), which in turn takes out all those ugly
>> casts. I’ll take a look next time I’m in there.
>>
>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> in 'spark' module it is overwritten with spark dependency, which also comes
>> at the same version so happens. so should be fine with 1.1.x
>>
>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> mahout-spark_2.10 ---
>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>> [INFO] |  |  |  +-
>> commons-configuration:commons-configuration:jar:1.6:compile
>> [INFO] |  |  |  |  +-
>> commons-collections:commons-collections:jar:3.2.1:compile
>> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
>> [INFO] |  |  |  |  |  \-
>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>> [INFO] |  |  |  |  \-
>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>> [INFO] |  |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>> [INFO] |  |  |  |  +-
>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>> [INFO] |  |  |  |  |  +-
>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>> [INFO] |  |  |  |  |  |  +-
>>
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>> [INFO] |  |  |  |  |  |  |  +-
>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>> [INFO] |  |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
>> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>> [INFO] |  |  |  |  |  |     +-
>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>> [INFO] |  |  |  |  |  |     |  \-
>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>> [INFO] |  |  |  |  |  |     |     \-
>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>> [INFO] |  |  |  |  |  |     |        \-
>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>> [INFO] |  |  |  |  |  |     +-
>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>> [INFO] |  |  |  |  |  |     |  \-
>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>> [INFO] |  |  |  |  |  |     +-
>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>> [INFO] |  |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile
>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>> [INFO] |  |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>> [INFO] |  |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>> [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
>> [INFO] |  |  |  |  |  |  |     \-
>> javax.activation:activation:jar:1.1:compile
>> [INFO] |  |  |  |  |  |  +-
>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>> [INFO] |  |  |  |  |  |  \-
>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>> [INFO] |  |  |  |  |  \-
>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>> [INFO] |  |  |  |  \-
>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>> [INFO] |  |  |  \-
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>> [INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
>> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>> [INFO] |  |  +-
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>> [INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>> [INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>> [INFO] |  |  |  \-
>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>> [INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>> [INFO] |  |     \-
>>
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>> [INFO] |  |        \-
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>> [INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>> [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>> [INFO] |  |  +-
>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>> [INFO] |  |  +-
>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>> [INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>> [INFO] |  |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>> d
>>
>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>>> looks like it is also requested by mahout-math, wonder what is using it
>>> there.
>>>
>>> At very least, it needs to be synchronized to the one currently used by
>>> spark.
>>>
>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
>>> ---
>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>
>>>
>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>> Looks like Guava is in Spark.
>>>>
>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>
>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>>>> would not be included since I think it was taken from the mrlegacy jar.
>>>>
>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>> ---------- Forwarded message ----------
>>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>>> Date: Jan 25, 2015 9:39 AM
>>>> Subject: Re: Codebase refactoring proposal
>>>> To: <de...@mahout.apache.org>
>>>> Cc:
>>>>
>>>>> When you get a chance a PR would be good.
>>>> Yes, it would. And not just for that.
>>>>
>>>>> As I understand it you are putting some class jars somewhere in the
>>>> classpath. Where? How?
>>>> /bin/mahout
>>>>
>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>> 'bin/mahout -spark'.)
>>>>
>>>> If i interpret current shell code there correctky, legacy path tries to
>>>> use
>>>> examples assemblies if not packaged, or /lib if packaged. True
>> motivation
>>>> of that significantly predates 2010 and i suspect only Benson knows
>> whole
>>>> true intent there.
>>>>
>>>> The spark path, which is really a quick hack of the script, tries to get
>>>> only selected mahout jars and locally instlalled spark classpath which i
>>>> guess is just the shaded spark jar in recent spark releases. It also
>>>> apparently tries to include /libs/*, which is never compiled in
>> unpackaged
>>>> version, and now i think it is a bug it is included  because /libs/* is
>>>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>>> understand mahout build in all cases.
>>>>
>>>> I am not even sure if packaged mahout will work with spark, honestly,
>>>> because of the /lib. Never tried that, since i mostly use application
>>>> embedding techniques.
>>>>
>>>> The same solution may apply to adding external dependencies and removing
>>>> the assembly in the Spark module. Which would leave only one major build
>>>> issue afaik.
>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>>> defined
>>>>> what i want to do in order to gauge if we may want to advance it some
>>>> time
>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>> everything
>>>>> that is not compile-time dependent. (and a lot of it is thru legacy MR
>>>> code
>>>>> which we of course don't use).
>>>>>
>>>>> Cant say i understand the remaining issues you are talking about
>> though.
>>>>> If you are talking about compiling lib or shaded assembly, no, this
>>>> doesn't
>>>>> do anything about it. Although point is, as it stands, the algebra and
>>>>> shell don't have any external dependencies but spark and these 4 (5?)
>>>>> mahout jars so they technically don't even need an assembly (as
>>>>> demonstrated).
>>>>>
>>>>> As i said, it seems driver code is the only one that may need some
>>>> external
>>>>> dependencies, but that's a different scenario from those i am talking
>>>>> about. But i am relatively happy with having the first two working
>>>> nicely
>>>>> at this point.
>>>>>
>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>>> +1
>>>>>>
>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
>>>> nice
>>>>>> to see how you’ve structured that in case we can use the same model to
>>>>>> solve the two remaining refactoring issues.
>>>>>> 1) external dependencies in the spark module
>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>
>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>>>
>>>>>> Also +1
>>>>>>
>>>>>> iPhone'd
>>>>>>
>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>>>>>>> +1
>>>>>>>
>>>>>>>
>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>
>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>> Lyubimov
>>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>>> refactoring proposal </div><div>
>>>>>>> </div>
>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>> depends
>>>> on
>>>>>>> the following classes there:
>>>>>>>
>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
>> and
>>>>>> ...
>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>
>>>>>>> So  I just dropped those five classes into new a new tiny
>>>> mahout-hadoop
>>>>>>> module (to signify stuff that is directly relevant to serializing
>>>> thigns
>>>>>> to
>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>> spark
>>>>>> and
>>>>>>> spark-shell dependencies.
>>>>>>>
>>>>>>> So non-cli applications (shell scripts and embedded api use) actually
>>>>>> only
>>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>>> course)
>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>
>>>>>>> This of course still doesn't address driver problems that want to
>>>> throw
>>>>>>> more stuff into front-end classpath (such as cli parser) but at least
>>>> it
>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>> worker-shipped
>>>>>>> jars) much more tolerable.
>>>>>>>
>>>>>>> How does that sound?
>>>>>>
>>>>
>>>>
>>

Re: Codebase refactoring proposal

Posted by Pat Ferrel <pa...@occamsmachete.com>.

BTW what level of difficulty would making the DSL run on MLlib Vectors and RowMatrix be? Looking at using their hashing TF-IDF but it raises impedance mismatch between DRM and MLlib RowMatrix. This would further reduce artifact size by a bunch.

Also backing something like a DRM with DStreams. Periodic model recalc with streams is maybe the first step towards truly streaming algos. Looking at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row similarity. Attach Kafka and get evergreen models, if not incrementally updating models.

On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

bottom line compile-time dependencies are satisfied with no extra stuff
from mr-legacy or its transitives. This is proven by virtue of  successful
compilation with no dependency on mr-legacy on the tree.

Runtime sufficiency for no extra dependency is proven via running shell or
embedded tests (unit tests) which are successful too. This implies
embedding and shell apis.

Issue with guava is typical one. if it were an issue, i wouldn't be able to
compile and/or run stuff. Now, question is what do we do if drivers want
extra stuff that is not found in Spark.

Now, It is so nice not to depend on anything extra so i am hesitant to
offer anything  here. either shading or lib with opt-in dependency policy
would suffice though, since it doesn't look like we'd have to have tons of
extra for drivers.



On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I vaguely remember there being a Guava version problem where the version
> had to be rolled back in one of the hadoop modules. The math-scala
> IndexedDataset shouldn’t care about version.
> 
> BTW It seems pretty easy to take out the option parser and replace with
> match and tuples especially if we can extend the Scala App class. It might
> actually simplify things since I can then use several case classes to hold
> options (scopt needed one object), which in turn takes out all those ugly
> casts. I’ll take a look next time I’m in there.
> 
> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> in 'spark' module it is overwritten with spark dependency, which also comes
> at the same version so happens. so should be fine with 1.1.x
> 
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-spark_2.10 ---
> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> [INFO] |  |  |  +-
> commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  |  +-
> commons-collections:commons-collections:jar:3.2.1:compile
> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  |  \-
> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> [INFO] |  |  |  |  +-
> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> [INFO] |  |  |  |  |  +-
> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +-
> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> [INFO] |  |  |  |  |  |  |  +-
> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> [INFO] |  |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |     \-
> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> [INFO] |  |  |  |  |  |     |        \-
> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> [INFO] |  |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
> [INFO] |  |  |  |  |  |  |     \-
> javax.activation:activation:jar:1.1:compile
> [INFO] |  |  |  |  |  |  +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> [INFO] |  |  |  |  |  |  \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> [INFO] |  |  |  |  |  \-
> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> [INFO] |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> [INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  \-
> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> [INFO] |  |     \-
> 
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> [INFO] |  |        \-
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> [INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> [INFO] |  |  +-
> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> [INFO] |  |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> d
> 
> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> looks like it is also requested by mahout-math, wonder what is using it
>> there.
>> 
>> At very least, it needs to be synchronized to the one currently used by
>> spark.
>> 
>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
>> ---
>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> 
>> 
>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> Looks like Guava is in Spark.
>>> 
>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>>> would not be included since I think it was taken from the mrlegacy jar.
>>> 
>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>> 
>>> ---------- Forwarded message ----------
>>> From: "Pat Ferrel" <pa...@occamsmachete.com>
>>> Date: Jan 25, 2015 9:39 AM
>>> Subject: Re: Codebase refactoring proposal
>>> To: <de...@mahout.apache.org>
>>> Cc:
>>> 
>>>> When you get a chance a PR would be good.
>>> 
>>> Yes, it would. And not just for that.
>>> 
>>>> As I understand it you are putting some class jars somewhere in the
>>> classpath. Where? How?
>>>> 
>>> 
>>> /bin/mahout
>>> 
>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>> 'bin/mahout -spark'.)
>>> 
>>> If i interpret current shell code there correctky, legacy path tries to
>>> use
>>> examples assemblies if not packaged, or /lib if packaged. True
> motivation
>>> of that significantly predates 2010 and i suspect only Benson knows
> whole
>>> true intent there.
>>> 
>>> The spark path, which is really a quick hack of the script, tries to get
>>> only selected mahout jars and locally instlalled spark classpath which i
>>> guess is just the shaded spark jar in recent spark releases. It also
>>> apparently tries to include /libs/*, which is never compiled in
> unpackaged
>>> version, and now i think it is a bug it is included  because /libs/* is
>>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>> understand mahout build in all cases.
>>> 
>>> I am not even sure if packaged mahout will work with spark, honestly,
>>> because of the /lib. Never tried that, since i mostly use application
>>> embedding techniques.
>>> 
>>> The same solution may apply to adding external dependencies and removing
>>> the assembly in the Spark module. Which would leave only one major build
>>> issue afaik.
>>>> 
>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>> defined
>>>> what i want to do in order to gauge if we may want to advance it some
>>> time
>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>> everything
>>>> that is not compile-time dependent. (and a lot of it is thru legacy MR
>>> code
>>>> which we of course don't use).
>>>> 
>>>> Cant say i understand the remaining issues you are talking about
> though.
>>>> 
>>>> If you are talking about compiling lib or shaded assembly, no, this
>>> doesn't
>>>> do anything about it. Although point is, as it stands, the algebra and
>>>> shell don't have any external dependencies but spark and these 4 (5?)
>>>> mahout jars so they technically don't even need an assembly (as
>>>> demonstrated).
>>>> 
>>>> As i said, it seems driver code is the only one that may need some
>>> external
>>>> dependencies, but that's a different scenario from those i am talking
>>>> about. But i am relatively happy with having the first two working
>>> nicely
>>>> at this point.
>>>> 
>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> +1
>>>>> 
>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
>>> nice
>>>>> to see how you’ve structured that in case we can use the same model to
>>>>> solve the two remaining refactoring issues.
>>>>> 1) external dependencies in the spark module
>>>>> 2) no spark or h2o in the release artifacts.
>>>>> 
>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>> 
>>>>> Also +1
>>>>> 
>>>>> iPhone'd
>>>>> 
>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap...@outlook.com>
> wrote:
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> 
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>> 
>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>> Lyubimov
>>>>> <dl...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>> refactoring proposal </div><div>
>>>>>> </div>
>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> depends
>>> on
>>>>>> the following classes there:
>>>>>> 
>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> and
>>>>> ...
>>>>>> *sigh* o.a.m.common.Pair
>>>>>> 
>>>>>> So  I just dropped those five classes into new a new tiny
>>> mahout-hadoop
>>>>>> module (to signify stuff that is directly relevant to serializing
>>> thigns
>>>>> to
>>>>>> DFS API) and completely removed mrlegacy and its transients from
> spark
>>>>> and
>>>>>> spark-shell dependencies.
>>>>>> 
>>>>>> So non-cli applications (shell scripts and embedded api use) actually
>>>>> only
>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>> course)
>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>> 
>>>>>> This of course still doesn't address driver problems that want to
>>> throw
>>>>>> more stuff into front-end classpath (such as cli parser) but at least
>>> it
>>>>>> renders transitive luggage of mr-legacy (and the size of
>>> worker-shipped
>>>>>> jars) much more tolerable.
>>>>>> 
>>>>>> How does that sound?
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
>