You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2014/04/06 19:40:04 UTC

management strategy discussion (was: Board Report)

Not to hijack the Board report thread.

I would like to thank Sebastian for support of this vision.

@Sean: Nice hearing from you again.

We had this discussion before, i don't think i want to repeat my technology
strategy argumentation again.

This time i would like to provide a little bit of a different angle.

Let's forget for a second about our role as committers and contributers,
and let's assess the situation from the purely managerial perspective.

As managers, we manage budget, investors, and market niche. Our budget is
committers' and contributors' time, and our investors are largely
contributors. Users may act indirectly as generating contributors funding.

Regardless of your data, the fact of life is that we've been dealing with
severely depleted data. Even contributors that brought best of the breed
Mahout methods, either have gone emeritus. or de-facto not doing any
further contributions. I personally haven't touched Hadoop MapReduce for
almost two years now, so i am definitely not an investor in that area any
more. Similar things happening to others, the capital flees us.

Forgive me for this hyperbola, but what you are suggesting under these
circumstances is like saying "let's continue to build our Mars mission",
while we barely have a budget to buy a Sessna.

Our "mission to Mars" (i.e. our market niche) is in my opinion our
technological dependencies and all the bad things that ensue from that,
and, as you mentioned, being incoherent library of things.

Our budget issues became so bad that as managers, we are literally faced
with a dilemma: to stay in the same market and become keepers of an
entombed mummy with no visitors and relatives left alive, or try and probe
new market niches with hope to revive the revenue stream. We kinda doing
the latter, because the alternative IMO is to simple close the doors.

Assuming we don't want to be keepers of the mummy or close the doors, we
explore market niches. What is differentiating? a new distributed engine?
Nah. There's a lot of those. The space is way too crowded. A library? we
already tried that. One of the ideas we are trying here is focusing on
higher-level programming model and abstracting away from massively parallel
low level primitives. By virtue of such approach, we are also abstracting
away (hopefully, at least for some part of the code) from the dependency on
a particular underlying layer.

In that light, Spark is just a PR/ease of POC decision. But philosophy of
an ML translation layer makes us strategically free of a  hard-wired
statement along the lines "our algorithms run on Spark".

Finally, i would like to repeat one of the lines i said in the past
discussions: Mahout vision statement was never declared as bound to
technology. It was bound to scalable ML only. In particular, i was hearing
emanating statements that Mahout is not married to Hadoop since as early as
2010. That's what has been so attractive to me, and still is: Mahout is not
focusing on technology so much. We focusing on ML and things surrounding it.

(Please forgive my hyperbolic exaggerations, they just for illustration
purposes. In reality it is not that dramatic. Worst case, if we close the
doors, the personal consequences are not as dramatic as in the real life,
of course. It's just another experiment, another day of tech weather.
Personally, we have practically nothing to lose. Bruised ego, perhaps, for
some, but that's about it).

Have a nice day.
-d

On Sun, Apr 6, 2014 at 3:41 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Sean,
>
> Answers inline.
>
>
> On 04/06/2014 11:35 AM, Sean Owen wrote:
>
>> I agree it's worth pausing to ask what is going on. Recent tweets and
>> articles I've seen give the impression that the project is somehow
>> moving entirely to Spark (or even Stratosphere?), or, entirely to H20.
>> These are sweeping changes that sound very hard to reconcile.
>>
>
> What is going on is the process of finding the next direction for mahout.
> This process has started only recently, is still going on and involves
> talking to people and projects outside of mahout to find means where
> collaboration might be beneficial. Apache projects ought to be community
> driven and recent tweets and articles are meant to create attentation and
> answers from the community with regard to the proposed changes, so that we
> can validate whether we are going into the right direction.
>
> Reactions have been quite positive so far, there is interest for
> collaboration from the Spark, H2O and Stratosphere community. And there has
> been a crowded room with no chairs left at the Hadoop Summit Europe last
> week, when Ted, Suneel and me gave a short talk describing potential future
> directions for Mahout and had a lively discussion with the audience for the
> rest of the time.
>
> What is to be done now is to go through a process of discussion and
> experimentation.
>
>
>  The reality seems more like: someone wants to add some Spark-based
>> matrix stuff and someone else wants to add some H20-based matrix
>> stuff. These are individually intriguing, and less hard to reconcile,
>> although sound overlapping.
>>
>
> I think there is a big misconception here. It is not the case that
> "someone wants to add Spark-based matrix stuff". Dmitriy has been working
> for several months on a scala DSL [1] for distributed linear algebraic
> operations which allows to write algorithms in a concise, compact and
> beautiful way. A first prototype of this code is part of the codebase and
> looks very promising.
>
> The best aspect of this dsl is that it allows to define algorithms on a
> *logical* level using a set of underlying logical operators. The benefit
> here is that this allows to abstract away the underlying execution system.
> Dmitriy already provides a prototypical runtime based on Apache Spark. It
> should be possible to integrate other systems like Stratosphere [2] by
> simply providing an implementation of the operators tailored to
> Stratosphere. In this way, users would be given the choice to run our
> algorithms on different systems without us having to maintain lots of
> different algorithm implementations.
>
>
>  But then, it's not clear what happens to the rest of the code base,
>> most of which is not related? Rewriting it seems far out of scope of
>> available effort, and not what anyone is suggesting. I assume deleting
>> it, while coherent, would be too extreme.
>>
>
> This is a point that needs to be discussed. With the latest release, we
> already deleted over 17,000 lines of code related to rarely used and
> unmaintained algorithms. If it is feasible to port the remaining
> distributed algorithms to a new platform depends on whether we can attract
> enough new faces to the project. That is one of the reasons why we talk to
> other projects and communities. From my personal experience I can say that
> implementing an algorithm in the new Scala DSL takes only a fraction of the
> time it takes to write it using MapReduce :)
>
>
>  Speaking as a downstream consumer now, the de facto plan emerging here
>> seems to be a plan to worsen, not address, the significant
>> inconsistencies and problems in the code already. There would be
>> undistributed, MR1, MR2, Spark, H20 code of differing flavors
>> scattered around. It sounds like a step away from 1.0-readiness at a
>> time when this seems to be advertised as coming soon.
>> In the context of a board report, I would think it's also important to
>> acknowledge this perspective, as it is almost certainly causing the
>> project to be removed from a major ecosystem distributor.
>>
>
> What I see is a lively, community-driven discussion ongoing that has yet
> to produce a de-facto plan. I urge you and the major ecosystem distributor
> to participate in this discussion so that we can together produce an
> outcome that matches our interests.
>
>
> Best,
> Sebastian
>
>
> [1] https://mahout.apache.org/users/sparkbindings/home.html
> [2] http://stratosphere.eu/
>

Re: management strategy discussion (was: Board Report)

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Apr 6, 2014 at 7:40 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Finally, i would like to repeat one of the lines i said in the past
> discussions: Mahout vision statement was never declared as bound to
> technology. It was bound to scalable ML only. In particular, i was hearing
> emanating statements that Mahout is not married to Hadoop since as early as
> 2010. That's what has been so attractive to me, and still is: Mahout is not
> focusing on technology so much. We focusing on ML and things surrounding
> it.
>
>
> (Please forgive my hyperbolic exaggerations, they just for illustration
> purposes. In reality it is not that dramatic. Worst case, if we close the
> doors, the personal consequences are not as dramatic as in the real life,
> of course. It's just another experiment, another day of tech weather.
> Personally, we have practically nothing to lose. Bruised ego, perhaps, for
> some, but that's about it).
>


+1.

Let's make this vision real.