You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Eric Charles <er...@apache.org> on 2016/06/01 16:00:29 UTC

Re: Future Mahout - Zeppelin work

Hi Suneel, an independent makes sense as mahout is supposed to run on 
various backend, so not only spark.

Yes, I am following mahout mailing list (and not abroad this year - this 
may change in the future).

On 30/05/16 05:47, Suneel Marthi wrote:
> Hi Eric,
>
> We r talking about the same PR which is a tweak of existing Spark-Zeppelin
> interpreter.
> What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
> is independent of above?
>
> BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
> Vancouver this time?
>
> On Sun, May 29, 2016 at 10:57 PM, Eric Charles <er...@apache.org> wrote:
>
>> Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
>>
>> https://github.com/apache/incubator-zeppelin/pull/928
>>
>> It declares in the spark interpreter the mahout deps, and creates the sdc
>> (spark distributed context).
>>
>> On 29/05/16 19:16, Suneel Marthi wrote:
>>
>>> On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <tr...@gmail.com>
>>> wrote:
>>>
>>> OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
>>>> duplicating efforts.
>>>>
>>>> Two things:
>>>>
>>>> 1- The blog post referenced the linear-regression example notebook twice-
>>>> I've updated it to reference the ggplot integration. E.g. import this
>>>> note:
>>>>
>>>>
>>>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
>>>> (I still need to update with a blurb about sampling, however it is done
>>>> in
>>>> that note...) So to any who tried the blog, I huge appology because that
>>>> notebook is where all of the 'magic happened', (all of the screen shots /
>>>> gg-plots / etc happened there).
>>>>
>>>> 2- I have a working prototype of the Zeppelin integration:
>>>> 'mahout-terp' branch of :
>>>> https://github.com/rawkintrevo/incubator-zeppelin
>>>> if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
>>>> properties, you have a Mahout interpreter. This is the minimally invasive
>>>> way to do it, I'll be opening a PR soon, we'll see what the gang over at
>>>> Zeppelin say.
>>>> I'll still need docs and an example notebook, but I'm waiting to make
>>>> sure
>>>> I don't need to do a major refactor before I get carried away with those
>>>> activities.
>>>>
>>>> In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
>>>> and you have a sdc declared based on the underlying sc.
>>>>
>>>>
>>> I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
>>> interpreter is gonna go down well with the Spark insanity.  I would prefer
>>> having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
>>> if that's acceptable to the Zeppelin folks, even though most of it might
>>> be
>>> repeatee.
>>>
>>> What do others have to say?
>>>
>>>
>>> have a good holiday weekend,
>>>>
>>>> tg
>>>>
>>>>
>>>>
>>>> Trevor Grant
>>>> Data Scientist
>>>> https://github.com/rawkintrevo
>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>> http://trevorgrant.org
>>>>
>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>
>>>>
>>>> On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap...@outlook.com>
>>>> wrote:
>>>>
>>>> Thx Trevor,
>>>>> Re: m-1854, It was something that we started when were first discussing
>>>>> using the smile plots for and trying to pipe them over to Zeppelin ..
>>>>> As
>>>>> far as I know there was not progress started on it.. I've unassigned it.
>>>>>
>>>>> Feel free to Assign any Jiras to yourself.  I think that m-1854 is
>>>>>
>>>> similar
>>>>
>>>>> to the mahout-spark-shell, so I may be able to help out there.
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>> Sent: Saturday, May 28, 2016 11:21:44 PM
>>>>> To: dev@mahout.apache.org
>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>
>>>>> Created a subtask on 1855 for tsv strings.
>>>>>
>>>>> Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
>>>>>
>>>> How
>>>>
>>>>> can I help?
>>>>>
>>>>> tg
>>>>>
>>>>>
>>>>>
>>>>> Trevor Grant
>>>>> Data Scientist
>>>>> https://github.com/rawkintrevo
>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>> http://trevorgrant.org
>>>>>
>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>
>>>>>
>>>>> On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap...@outlook.com>
>>>>> wrote:
>>>>>
>>>>> Great!
>>>>>>
>>>>>> When you free up and have the time, could you create some Jiras for
>>>>>>
>>>>> these?
>>>>>
>>>>>>
>>>>>> We actually have MAHOUT-1852 open for Histograms already, and
>>>>>>
>>>>> MAHOUT-1854
>>>>
>>>>> and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854
>>>>>>
>>>>> and
>>>>>
>>>>>> m-1855 out and we can start new ones if they're not relevant anymore or
>>>>>>
>>>>> we
>>>>>
>>>>>> can just go with those.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>> Sent: Thursday, May 26, 2016 3:17:22 PM
>>>>>> To: dev@mahout.apache.org
>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>
>>>>>> Short answer: it is high priority. I think it will be a Mahout
>>>>>>
>>>>> interpreter
>>>>>
>>>>>> into Zeppelin, and given that plans are on hold for a Flink-Mahout in
>>>>>>
>>>>> the
>>>>
>>>>> short term, I think it should be a piggy-back spark interpreter (e.g.
>>>>>> exposed through something like %spark.mahout).   So I have thoughts,
>>>>>>
>>>>> but
>>>>
>>>>> no
>>>>>
>>>>>> plan.  Been busy with a couple of other commitments.
>>>>>>
>>>>>> On the Mahout side we need:
>>>>>> A function that will convert small matrices into TSV strings
>>>>>> Convenience functions for sampling super-large matrices into things
>>>>>>
>>>>> like
>>>>
>>>>> histograms, etc, that one would want to plot. I.e. histogram bucketing?
>>>>>> (less important for the moment)
>>>>>>
>>>>>> On the Zeppelin Size we need:
>>>>>> an interpreter.
>>>>>>
>>>>>>
>>>>>> Trevor Grant
>>>>>> Data Scientist
>>>>>> https://github.com/rawkintrevo
>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>> http://trevorgrant.org
>>>>>>
>>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>>
>>>>>>
>>>>>> On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <sm...@apache.org>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> While on this subject, do we have a plan yet of integrating Zeppelin
>>>>>>>
>>>>>> into
>>>>>
>>>>>> Mahout (or the converse) of having Mahout specific interpreter for
>>>>>>> Zeppelin?  I think that shuld be high priority in the short term.
>>>>>>>
>>>>>>> On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
>>>>>>>
>>>>>> trevor.d.grant@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Ahh, like the "Sample From Matrix" paragraph in the notebook.
>>>>>>>>
>>>>>>>> Yea that seems like a good add. If not this afternoon, I'll include
>>>>>>>>
>>>>>>> it
>>>>>
>>>>>> Saturday.
>>>>>>>>
>>>>>>>>
>>>>>>>> Trevor Grant
>>>>>>>> Data Scientist
>>>>>>>> https://github.com/rawkintrevo
>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>> http://trevorgrant.org
>>>>>>>>
>>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>>>
>>>>>>> -Virgil*
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
>>>>>>>>
>>>>>>> ap.dev@outlook.com
>>>>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Trevor, I was reading over your blog last night again- first time
>>>>>>>>>
>>>>>>>> since
>>>>>>
>>>>>>> you updated. It is  great!
>>>>>>>>>
>>>>>>>>> I have one suggestion being adding in a code line on how the the
>>>>>>>>>
>>>>>>>> sampling
>>>>>>>
>>>>>>>> of the  DRM ->  in-core Matrix is done:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
>>>>
>>>>>
>>>>>>>>> eg something like:
>>>>>>>>>
>>>>>>>>>       mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
>>>>>>>>>
>>>>>>>>> Maybe you omitted this intentionally?
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>> Sent: Friday, May 20, 2016 7:56:20 PM
>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>
>>>>>>>>> Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
>>>>>>>>>
>>>>>>>> version
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> uninformative to me. I'd say if possible, you're first
>>>>>>>>>
>>>>>>>> troubleshooting
>>>>>>
>>>>>>> measure would be to re clone or do a "git fetch upstream" to get
>>>>>>>>>
>>>>>>>> up
>>>>
>>>>> to
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> very latest
>>>>>>>>>
>>>>>>>>> Sorry for delayed reply
>>>>>>>>> Tg
>>>>>>>>> On May 20, 2016 5:36 PM, "Andrew Musselman" <
>>>>>>>>>
>>>>>>>> andrew.musselman@gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Trevor, my zeppelin source is at this version:
>>>>>>>>>>
>>>>>>>>>>     <groupId>org.apache.zeppelin</groupId>
>>>>>>>>>>     <artifactId>zeppelin</artifactId>
>>>>>>>>>>     <packaging>pom</packaging>
>>>>>>>>>>     <version>0.6.0-incubating-SNAPSHOT</version>
>>>>>>>>>>     <name>Zeppelin</name>
>>>>>>>>>>     <description>Zeppelin project</description>
>>>>>>>>>>     <url>http://zeppelin.incubator.apache.org/</url>
>>>>>>>>>>
>>>>>>>>>> And yes you're right the artifacts weren't added to the
>>>>>>>>>>
>>>>>>>>> dependencies;
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> that a feature in more modern zep?
>>>>>>>>>>
>>>>>>>>>> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
>>>>>>>>>>
>>>>>>>>> dlieu.7@gmail.com
>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> no parenthesis.
>>>>>>>>>>>
>>>>>>>>>>> import o.a.m.sparkbindings._
>>>>>>>>>>> ....
>>>>>>>>>>> myRdd = myDrm.rdd
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
>>>>>>>>>>>
>>>>>>>>>> smarthi@apache.org
>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
>>>>>>>>>>>>
>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hey Pat,
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you spit out a TSV - you can import into pyspark /
>>>>>>>>>>>>>
>>>>>>>>>>>> matplotlib
>>>>>>>
>>>>>>>> from
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> resource pool in essentially the same way and use that
>>>>>>>>>>>>>
>>>>>>>>>>>> plotting
>>>>>>
>>>>>>> library
>>>>>>>>>>
>>>>>>>>>>> if
>>>>>>>>>>>>
>>>>>>>>>>>>> you prefer.  In fact you could import the tsv into pandas
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>
>>>>>> use
>>>>>>>
>>>>>>>> all
>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>> the pandas plotting as well (though I think it is for the
>>>>>>>>>>>>>
>>>>>>>>>>>> most
>>>>>>
>>>>>>> part,
>>>>>>>>>
>>>>>>>>>> also
>>>>>>>>>>>
>>>>>>>>>>>> matplotlib with some convenience functions).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>>>>
>>>>>
>>>>>>>>>>>>> In Zeppelin, unless you specify otherwise, pyspark,
>>>>>>>>>>>>>
>>>>>>>>>>>> sparkr,
>>>>
>>>>> spark-sql,
>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>> scala-spark all share the same spark context you can
>>>>>>>>>>>>>
>>>>>>>>>>>> create
>>>>
>>>>> RDDs
>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>> one
>>>>>>>>>>
>>>>>>>>>>> language and access them / work on them in another (so I
>>>>>>>>>>>>>
>>>>>>>>>>>> understand).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> So in Mahout can you "save" a matrix as a RDD? e.g.
>>>>>>>>>>>>>
>>>>>>>>>>>> something
>>>>>
>>>>>> like
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>> val myRDD = myDRM.asRDD()
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> val myRDD = myDRM.rdd()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> And would 'myRDD' then exist in the spark context?
>>>>>>>>>>>>>
>>>>>>>>>>>>> yes it will be in sparkContext
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>
>>>>>>>>>>>> things."
>>>>>
>>>>>> -Virgil*
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
>>>>>>>>>>>>>
>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW I dont want to stall progress but being the most
>>>>>>>>>>>>>>
>>>>>>>>>>>>> ignorant
>>>>>>>
>>>>>>>> of
>>>>>>>>
>>>>>>>>> plot
>>>>>>>>>>>
>>>>>>>>>>>> libs, Ill ask if we should consider python and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> matplotlib.
>>>>>
>>>>>> In
>>>>>>>
>>>>>>>> another
>>>>>>>>>>>
>>>>>>>>>>>> project we use python because of the RDD support on
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark
>>>>
>>>>> though
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> visualizations are extremely limited in our case. If we
>>>>>>>>>>>>>>
>>>>>>>>>>>>> can
>>>>>
>>>>>> pass
>>>>>>>>
>>>>>>>>> an
>>>>>>>>>
>>>>>>>>>> RDD
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> pyspark it would allow custom reductions in python
>>>>>>>>>>>>>>
>>>>>>>>>>>>> before
>>>>
>>>>> plotting,
>>>>>>>>>
>>>>>>>>>> even
>>>>>>>>>>>>
>>>>>>>>>>>>> though we will support many natively in Mahout. Im
>>>>>>>>>>>>>>
>>>>>>>>>>>>> guessing
>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>> this
>>>>>>>>>>>
>>>>>>>>>>>> would cross a context boundary and require a write to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> disk?
>>>>>
>>>>>>
>>>>>>>>>>>>>> So 2 questions:
>>>>>>>>>>>>>> 1) what does the inter language support look like with
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark
>>>>>>
>>>>>>> python
>>>>>>>>>
>>>>>>>>>> vs
>>>>>>>>>>>
>>>>>>>>>>>> SparkR, can we transfer RDDs?
>>>>>>>>>>>>>> 2) are the plot libs significantly different?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On May 20, 2016, at 9:54 AM, Trevor Grant <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dmitriy really nailed it on the head in his reply to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>
>>>>> post
>>>>>>
>>>>>>> which
>>>>>>>>>
>>>>>>>>>> I'll
>>>>>>>>>>>>
>>>>>>>>>>>>> rebroadcast below. In essence the whole reason you are
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (theoretically)
>>>>>>>>>>>
>>>>>>>>>>>> using Mahout is the data is to big to fit in memory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> If
>>>>
>>>>> it's
>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>>> big
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> fit
>>>>>>>>>>>>>
>>>>>>>>>>>>>> in memory, well then its probably too big to plot each
>>>>>>>>>>>>>>
>>>>>>>>>>>>> point
>>>>>>
>>>>>>> (e.g.
>>>>>>>>>
>>>>>>>>>> trillions of row, you only have so many pixels).   For
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>
>>>>>> example
>>>>>>>>>
>>>>>>>>>> I
>>>>>>>>>>
>>>>>>>>>>> randomly sampled a matrix.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So as Dmitriy says, in Mahout we need to have functions
>>>>>>>>>>>>>>
>>>>>>>>>>>>> that
>>>>>>
>>>>>>> will
>>>>>>>>
>>>>>>>>> 'preprocess' the data into something plotable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the Zepplin-Plotting thing, we need to have a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> function
>>>>>
>>>>>> that
>>>>>>>
>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>>> spit
>>>>>>>>>>>>
>>>>>>>>>>>>> out a tsv like string of the data we wanted plotted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree an honest Mahout interpreter in Zeppelin is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> probably
>>>>>>
>>>>>>> worth
>>>>>>>>>
>>>>>>>>>> doing.
>>>>>>>>>>>>
>>>>>>>>>>>>> There are a couple of ways to go about it. I opened up
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>
>>>>>> discussion
>>>>>>>>>>
>>>>>>>>>>> on
>>>>>>>>>>>>
>>>>>>>>>>>>> dev@Zeppelin and didn't get any replies. I'm going to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> take
>>>>>
>>>>>> that
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>> mean
>>>>>>>>>>>>
>>>>>>>>>>>>> we
>>>>>>>>>>>>>
>>>>>>>>>>>>>> can do it in a way that makes the most sense to Mahout
>>>>>>>>>>>>>>
>>>>>>>>>>>>> users...
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>> First steps are to include some methods in Mahout that
>>>>>>>>>>>>>>
>>>>>>>>>>>>> will
>>>>>
>>>>>> do
>>>>>>>
>>>>>>>> that
>>>>>>>>>
>>>>>>>>>> preprocessing, and one that will turn something into a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tsv
>>>>>
>>>>>> string.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>> I have some general ideas on possible approached to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> making
>>>>>
>>>>>> an
>>>>>>
>>>>>>> honest-mahout
>>>>>>>>>>>>>
>>>>>>>>>>>>>> interpreter but I want to play in the code and look at
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>
>>>>>> Flink-Mahout
>>>>>>>>>>>
>>>>>>>>>>>> shell a bit before I try to organize my thoughts and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> present
>>>>>>
>>>>>>> them.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>> ...(2) not sure what is the point of supporting
>>>>>>>>>>>>>>
>>>>>>>>>>>>> distributed
>>>>>
>>>>>> anything.
>>>>>>>>>>
>>>>>>>>>>> It
>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>>> distributed presumably because it is hard to keep it in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> memory.
>>>>>>>
>>>>>>>> Therefore,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> plotting anything distributed potentially presents 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>> problems:
>>>>>>>
>>>>>>>> storage
>>>>>>>>>>>
>>>>>>>>>>>> space and overplotting due to number of points. The
>>>>>>>>>>>>>>
>>>>>>>>>>>>> idea
>>>>
>>>>> is
>>>>>
>>>>>> that
>>>>>>>>
>>>>>>>>> we
>>>>>>>>>
>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> work out algorithms that condense big data information
>>>>>>>>>>>>>>
>>>>>>>>>>>>> into
>>>>>
>>>>>> small
>>>>>>>>
>>>>>>>>> plottable
>>>>>>>>>>>>>
>>>>>>>>>>>>>> information (like density grids, for example, or
>>>>>>>>>>>>>>
>>>>>>>>>>>>> histograms)....
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>
>>>>>>>>>>>>> things."
>>>>>>
>>>>>>> -Virgil*
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Great job Trevor, well need this detail to smooth
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> out
>>>>
>>>>> the
>>>>>>
>>>>>>> sharp
>>>>>>>>>
>>>>>>>>>> edges
>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> any guidance from you or the Zeppelin community will
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> be a
>>>>>
>>>>>> big
>>>>>>>
>>>>>>>> help.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On May 20, 2016, at 8:13 AM, Shannon Quinn <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> squinn@gatech.edu>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Agreed, thoroughly enjoying the blog post.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well done, Trevor!  I've not yet had a chance to try
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> this
>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>>> zeppelin
>>>>>>>>>>>>
>>>>>>>>>>>>> but I just read the blog which is great!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -------- Original message --------
>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ah thank you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fixing now.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> things."
>>>>>>>>
>>>>>>>>> -Virgil*
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ap.dev@outlook.com
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Trevor- Just refreshed your readme.  The jar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that I
>>>>>
>>>>>> mentioned
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>
>>>>>
>>>>>>>>>>>>>>>>> rather than:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>
>>>>>
>>>>>>>>>>>>>>>>> (In the spark module that is)
>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ah yes- I remember you pointing that out to me too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I got side tracked yesterday for most of the day on
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> an
>>>>>
>>>>>> adventure
>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>>> getting
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zeppelin to work right after I accidently updated
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>
>>>>> the
>>>>>>
>>>>>>> new
>>>>>>>>
>>>>>>>>> snapshot
>>>>>>>>>>>>
>>>>>>>>>>>>> (free
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> hint: the secret was to clear my cache *face-palm*)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm going to add that dependency to the readme.md
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> now.
>>>>>
>>>>>>
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> tg
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of
>>>>
>>>>> things."
>>>>>>>>
>>>>>>>>> -Virgil*
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ap.dev@outlook.com>
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Trevor this is very cool- I have not been able to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> look
>>>>>
>>>>>> at
>>>>>>>
>>>>>>>> it
>>>>>>>>
>>>>>>>>> closely
>>>>>>>>>>>>
>>>>>>>>>>>>> yet
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> but just a small point: I believe that you'll also
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> need
>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>> add
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>> For things like the classification stats,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> confusion
>>>>
>>>>> matrix,
>>>>>>>>
>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> t-digest.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 10:47:21 AM
>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I still need to update my readme/env per Pat's
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> comments
>>>>>>
>>>>>>> below,
>>>>>>>>>
>>>>>>>>>> however
>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> out further ado, I present two notebooks that
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> integrate
>>>>>>
>>>>>>> Mahout +
>>>>>>>>>>
>>>>>>>>>>> Spark
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zeppelin + ggplot2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>

Re: Future Mahout - Zeppelin work

Posted by Eric Charles <er...@apache.org>.

+1 piggybacking sounds reasonable and quick-win.

On 01/06/16 18:17, Trevor Grant wrote:
> Hey Eric,
>
> The 'piggyback' or 'patch' approach is a lot easier and less invasive to
> implement in practice, and has the Zeppelin community blessing.
>
> When the Flink version comes on line, it will be also super easy to
> replicate the effort.  And even doing two (or more) 'piggybacks' will be
> easier to maintain than one stand-alone Mahout interpretter.  Also,
> 'piggybacking' opens up the possibility of sharing between contexts,
> minimizes user configuration, etc.
>
> The differential is about 20 new lines of code for a piggy back on any
> underlying engine, vs. about 300 lines of code for a stand alone
> interpreter which must be kept up to date with its Spark/Flink counter
> parts.
>
> Philosophically the stand-alone makes sense, practically the piggyback
> does. *shruggie*
>
> It is possible that somewhere down the road we'll refactor the piggy
> back(s) into a stand alone interpreter, at which point none of the current
> effort will be wasted, it will just be moving some code around.  So the
> other advantage to the piggyback is that it quickly fields a minimum viable
> product, with out having to pay much for it later on down the road.
>
> This is in part due to the way Zeppelin implemented its interpreters which
> involves a lot of code repetition.
>
> I'm open to further discussion, but after playing in the Zeppelin code for
> a while and really groking different approaches I think this one is best. I
> do invite critiques because I believe I have considered most angles and can
> properly defend the current path, and if there is something I haven't
> thought of, I'd rather it be brought to light sooner than later.
>
> tg
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Wed, Jun 1, 2016 at 11:00 AM, Eric Charles <er...@apache.org> wrote:
>
>> Hi Suneel, an independent makes sense as mahout is supposed to run on
>> various backend, so not only spark.
>>
>> Yes, I am following mahout mailing list (and not abroad this year - this
>> may change in the future).
>>
>> On 30/05/16 05:47, Suneel Marthi wrote:
>>
>>> Hi Eric,
>>>
>>> We r talking about the same PR which is a tweak of existing Spark-Zeppelin
>>> interpreter.
>>> What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
>>> is independent of above?
>>>
>>> BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
>>> Vancouver this time?
>>>
>>> On Sun, May 29, 2016 at 10:57 PM, Eric Charles <er...@apache.org> wrote:
>>>
>>> Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
>>>>
>>>> https://github.com/apache/incubator-zeppelin/pull/928
>>>>
>>>> It declares in the spark interpreter the mahout deps, and creates the sdc
>>>> (spark distributed context).
>>>>
>>>> On 29/05/16 19:16, Suneel Marthi wrote:
>>>>
>>>> On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <trevor.d.grant@gmail.com
>>>>>>
>>>>> wrote:
>>>>>
>>>>> OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
>>>>>
>>>>>> duplicating efforts.
>>>>>>
>>>>>> Two things:
>>>>>>
>>>>>> 1- The blog post referenced the linear-regression example notebook
>>>>>> twice-
>>>>>> I've updated it to reference the ggplot integration. E.g. import this
>>>>>> note:
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
>>>>>> (I still need to update with a blurb about sampling, however it is done
>>>>>> in
>>>>>> that note...) So to any who tried the blog, I huge appology because
>>>>>> that
>>>>>> notebook is where all of the 'magic happened', (all of the screen
>>>>>> shots /
>>>>>> gg-plots / etc happened there).
>>>>>>
>>>>>> 2- I have a working prototype of the Zeppelin integration:
>>>>>> 'mahout-terp' branch of :
>>>>>> https://github.com/rawkintrevo/incubator-zeppelin
>>>>>> if you build, and set 'spark.mahout' to 'true' in the Spark
>>>>>> Interpretter
>>>>>> properties, you have a Mahout interpreter. This is the minimally
>>>>>> invasive
>>>>>> way to do it, I'll be opening a PR soon, we'll see what the gang over
>>>>>> at
>>>>>> Zeppelin say.
>>>>>> I'll still need docs and an example notebook, but I'm waiting to make
>>>>>> sure
>>>>>> I don't need to do a major refactor before I get carried away with
>>>>>> those
>>>>>> activities.
>>>>>>
>>>>>> In essence when 'spark-mahout' is 'true' you jump right in on r-like
>>>>>> dsl
>>>>>> and you have a sdc declared based on the underlying sc.
>>>>>>
>>>>>>
>>>>>> I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
>>>>> interpreter is gonna go down well with the Spark insanity.  I would
>>>>> prefer
>>>>> having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin
>>>>> project
>>>>> if that's acceptable to the Zeppelin folks, even though most of it might
>>>>> be
>>>>> repeatee.
>>>>>
>>>>> What do others have to say?
>>>>>
>>>>>
>>>>> have a good holiday weekend,
>>>>>
>>>>>>
>>>>>> tg
>>>>>>
>>>>>>
>>>>>>
>>>>>> Trevor Grant
>>>>>> Data Scientist
>>>>>> https://github.com/rawkintrevo
>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>> http://trevorgrant.org
>>>>>>
>>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>>
>>>>>>
>>>>>> On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap...@outlook.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thx Trevor,
>>>>>>
>>>>>>> Re: m-1854, It was something that we started when were first
>>>>>>> discussing
>>>>>>> using the smile plots for and trying to pipe them over to Zeppelin ..
>>>>>>> As
>>>>>>> far as I know there was not progress started on it.. I've unassigned
>>>>>>> it.
>>>>>>>
>>>>>>> Feel free to Assign any Jiras to yourself.  I think that m-1854 is
>>>>>>>
>>>>>>> similar
>>>>>>
>>>>>> to the mahout-spark-shell, so I may be able to help out there.
>>>>>>>
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>> Sent: Saturday, May 28, 2016 11:21:44 PM
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>
>>>>>>> Created a subtask on 1855 for tsv strings.
>>>>>>>
>>>>>>> Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
>>>>>>>
>>>>>>> How
>>>>>>
>>>>>> can I help?
>>>>>>>
>>>>>>> tg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Trevor Grant
>>>>>>> Data Scientist
>>>>>>> https://github.com/rawkintrevo
>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>> http://trevorgrant.org
>>>>>>>
>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>> -Virgil*
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap...@outlook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Great!
>>>>>>>
>>>>>>>>
>>>>>>>> When you free up and have the time, could you create some Jiras for
>>>>>>>>
>>>>>>>> these?
>>>>>>>
>>>>>>>
>>>>>>>> We actually have MAHOUT-1852 open for Histograms already, and
>>>>>>>>
>>>>>>>> MAHOUT-1854
>>>>>>>
>>>>>>
>>>>>> and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854
>>>>>>>
>>>>>>>>
>>>>>>>> and
>>>>>>>
>>>>>>> m-1855 out and we can start new ones if they're not relevant anymore
>>>>>>>> or
>>>>>>>>
>>>>>>>> we
>>>>>>>
>>>>>>> can just go with those.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>> Sent: Thursday, May 26, 2016 3:17:22 PM
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>
>>>>>>>> Short answer: it is high priority. I think it will be a Mahout
>>>>>>>>
>>>>>>>> interpreter
>>>>>>>
>>>>>>> into Zeppelin, and given that plans are on hold for a Flink-Mahout in
>>>>>>>>
>>>>>>>> the
>>>>>>>
>>>>>>
>>>>>> short term, I think it should be a piggy-back spark interpreter (e.g.
>>>>>>>
>>>>>>>> exposed through something like %spark.mahout).   So I have thoughts,
>>>>>>>>
>>>>>>>> but
>>>>>>>
>>>>>>
>>>>>> no
>>>>>>>
>>>>>>> plan.  Been busy with a couple of other commitments.
>>>>>>>>
>>>>>>>> On the Mahout side we need:
>>>>>>>> A function that will convert small matrices into TSV strings
>>>>>>>> Convenience functions for sampling super-large matrices into things
>>>>>>>>
>>>>>>>> like
>>>>>>>
>>>>>>
>>>>>> histograms, etc, that one would want to plot. I.e. histogram bucketing?
>>>>>>>
>>>>>>>> (less important for the moment)
>>>>>>>>
>>>>>>>> On the Zeppelin Size we need:
>>>>>>>> an interpreter.
>>>>>>>>
>>>>>>>>
>>>>>>>> Trevor Grant
>>>>>>>> Data Scientist
>>>>>>>> https://github.com/rawkintrevo
>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>> http://trevorgrant.org
>>>>>>>>
>>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>>> -Virgil*
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <sm...@apache.org>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> While on this subject, do we have a plan yet of integrating Zeppelin
>>>>>>>>
>>>>>>>>>
>>>>>>>>> into
>>>>>>>>
>>>>>>>
>>>>>>> Mahout (or the converse) of having Mahout specific interpreter for
>>>>>>>>
>>>>>>>>> Zeppelin?  I think that shuld be high priority in the short term.
>>>>>>>>>
>>>>>>>>> On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
>>>>>>>>>
>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>
>>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ahh, like the "Sample From Matrix" paragraph in the notebook.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yea that seems like a good add. If not this afternoon, I'll include
>>>>>>>>>>
>>>>>>>>>> it
>>>>>>>>>
>>>>>>>>
>>>>>>> Saturday.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Trevor Grant
>>>>>>>>>> Data Scientist
>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>
>>>>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>>>>>
>>>>>>>>>> -Virgil*
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>> On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
>>>>>>>>>>
>>>>>>>>>> ap.dev@outlook.com
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Trevor, I was reading over your blog last night again- first time
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> since
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> you updated. It is  great!
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I have one suggestion being adding in a code line on how the the
>>>>>>>>>>>
>>>>>>>>>>> sampling
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> of the  DRM ->  in-core Matrix is done:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
>>>>>>
>>>>>>
>>>>>>> eg something like:
>>>>>>>>>>>
>>>>>>>>>>>        mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
>>>>>>>>>>>
>>>>>>>>>>> Maybe you omitted this intentionally?
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>> Sent: Friday, May 20, 2016 7:56:20 PM
>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
>>>>>>>>>>>
>>>>>>>>>>> version
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> is
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> uninformative to me. I'd say if possible, you're first
>>>>>>>>>>>
>>>>>>>>>>> troubleshooting
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> measure would be to re clone or do a "git fetch upstream" to get
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> up
>>>>>>>>>>
>>>>>>>>>
>>>>>> to
>>>>>>>
>>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> very latest
>>>>>>>>>>>
>>>>>>>>>>> Sorry for delayed reply
>>>>>>>>>>> Tg
>>>>>>>>>>> On May 20, 2016 5:36 PM, "Andrew Musselman" <
>>>>>>>>>>>
>>>>>>>>>>> andrew.musselman@gmail.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Trevor, my zeppelin source is at this version:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>      <groupId>org.apache.zeppelin</groupId>
>>>>>>>>>>>>      <artifactId>zeppelin</artifactId>
>>>>>>>>>>>>      <packaging>pom</packaging>
>>>>>>>>>>>>      <version>0.6.0-incubating-SNAPSHOT</version>
>>>>>>>>>>>>      <name>Zeppelin</name>
>>>>>>>>>>>>      <description>Zeppelin project</description>
>>>>>>>>>>>>      <url>http://zeppelin.incubator.apache.org/</url>
>>>>>>>>>>>>
>>>>>>>>>>>> And yes you're right the artifacts weren't added to the
>>>>>>>>>>>>
>>>>>>>>>>>> dependencies;
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> is
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> that a feature in more modern zep?
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
>>>>>>>>>>>>
>>>>>>>>>>>> dlieu.7@gmail.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> no parenthesis.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> import o.a.m.sparkbindings._
>>>>>>>>>>>>> ....
>>>>>>>>>>>>> myRdd = myDrm.rdd
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
>>>>>>>>>>>>>
>>>>>>>>>>>>> smarthi@apache.org
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Pat,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you spit out a TSV - you can import into pyspark /
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> matplotlib
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>> from
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> resource pool in essentially the same way and use that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> plotting
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> library
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> if
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> you prefer.  In fact you could import the tsv into pandas
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> use
>>>>>>>>
>>>>>>>>>
>>>>>>>>> all
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> the pandas plotting as well (though I think it is for the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> part,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> also
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> matplotlib with some convenience functions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>>>>>>
>>>>>>
>>>>>>> In Zeppelin, unless you specify otherwise, pyspark,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sparkr,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> spark-sql,
>>>>>>>
>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> scala-spark all share the same spark context you can
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> create
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> RDDs
>>>>>>>
>>>>>>>>
>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> one
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> language and access them / work on them in another (so I
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> understand).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> So in Mahout can you "save" a matrix as a RDD? e.g.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> like
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> val myRDD = myDRM.asRDD()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> val myRDD = myDRM.rdd()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And would 'myRDD' then exist in the spark context?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> yes it will be in sparkContext
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> -Virgil*
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW I don\u2019t want to stall progress but being the most
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ignorant
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>> plot
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> libs, I\u2019ll ask if we should consider python and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> matplotlib.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> In
>>>>>>>>
>>>>>>>>>
>>>>>>>>> another
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> project we use python because of the RDD support on
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> though
>>>>>>>
>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> visualizations are extremely limited in our case. If we
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> pass
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>>>>
>>>>>>>>>>> RDD
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pyspark it would allow custom reductions in python
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> plotting,
>>>>>>>
>>>>>>>>
>>>>>>>>>>> even
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> though we will support many natively in Mahout. I\u2019m
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> guessing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> that
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> this
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> would cross a context boundary and require a write to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> disk?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>> So 2 questions:
>>>>>>>>>>>>>>>> 1) what does the inter language support look like with
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> python
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> vs
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> SparkR, can we transfer RDDs?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) are the plot libs significantly different?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On May 20, 2016, at 9:54 AM, Trevor Grant <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dmitriy really nailed it on the head in his reply to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> post
>>>>>>>
>>>>>>>>
>>>>>>>> which
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'll
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> rebroadcast below. In essence the whole reason you are
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (theoretically)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> using Mahout is the data is to big to fit in memory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> it's
>>>>>>>
>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> big
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in memory, well then its probably too big to plot each
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> (e.g.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> trillions of row, you only have so many pixels).   For
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> example
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> I
>>>>>>>>>>>>
>>>>>>>>>>>> randomly sampled a matrix.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So as Dmitriy says, in Mahout we need to have functions
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 'preprocess' the data into something plotable.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For the Zepplin-Plotting thing, we need to have a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>>
>>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> spit
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> out a tsv like string of the data we wanted plotted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree an honest Mahout interpreter in Zeppelin is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> worth
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> doing.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are a couple of ways to go about it. I opened up
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> discussion
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> on
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> dev@Zeppelin and didn't get any replies. I'm going to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>> mean
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can do it in a way that makes the most sense to Mahout
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> users...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> First steps are to include some methods in Mahout that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> do
>>>>>>>>
>>>>>>>>>
>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> preprocessing, and one that will turn something into a
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> tsv
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> string.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I have some general ideas on possible approached to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> making
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> an
>>>>>>>>
>>>>>>>> honest-mahout
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>>> interpreter but I want to play in the code and look at
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> Flink-Mahout
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>> shell a bit before I try to organize my thoughts and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> them.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> ...(2) not sure what is the point of supporting
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> distributed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> anything.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> It
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> distributed presumably because it is hard to keep it in
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> memory.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>> Therefore,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>> plotting anything distributed potentially presents 2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> problems:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>> storage
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> space and overplotting due to number of points. The
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> is
>>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> work out algorithms that condense big data information
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> small
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> plottable
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> information (like density grids, for example, or
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> histograms)....
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>> -Virgil*
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Great job Trevor, we\u2019ll need this detail to smooth
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>>
>>>>>>>> sharp
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> edges
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> any guidance from you or the Zeppelin community will
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> be a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>> big
>>>>>>>>
>>>>>>>>>
>>>>>>>>> help.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On May 20, 2016, at 8:13 AM, Shannon Quinn <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> squinn@gatech.edu>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Agreed, thoroughly enjoying the blog post.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Well done, Trevor!  I've not yet had a chance to try
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> in
>>>>>>>>>
>>>>>>>>> zeppelin
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> but I just read the blog which is great!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -------- Original message --------
>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ah thank you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fixing now.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>> -Virgil*
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ap.dev@outlook.com
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Trevor- Just refreshed your readme.  The jar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> mentioned
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>>
>>>>>>
>>>>>>> rather than:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>>
>>>>>>
>>>>>>> (In the spark module that is)
>>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ah yes- I remember you pointing that out to me too.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I got side tracked yesterday for most of the day on
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> adventure
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> in
>>>>>>>>>>>>>
>>>>>>>>>>>>> getting
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zeppelin to work right after I accidently updated
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>>
>>>>>>>> new
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> snapshot
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> (free
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> hint: the secret was to clear my cache *face-palm*)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm going to add that dependency to the readme.md
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>>> tg
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>> things."
>>>>>>>
>>>>>>>>
>>>>>>>>>> -Virgil*
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ap.dev@outlook.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Trevor this is very cool- I have not been able to
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>> at
>>>>>>>>
>>>>>>>>>
>>>>>>>>> it
>>>>>>>>>>
>>>>>>>>>> closely
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> yet
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> but just a small point: I believe that you'll also
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>> add
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>> For things like the classification stats,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> confusion
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>> matrix,
>>>>>>>
>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> t-digest.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 10:47:21 AM
>>>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>

Re: Future Mahout - Zeppelin work

Posted by Trevor Grant <tr...@gmail.com>.

Hey Eric,

The 'piggyback' or 'patch' approach is a lot easier and less invasive to
implement in practice, and has the Zeppelin community blessing.

When the Flink version comes on line, it will be also super easy to
replicate the effort.  And even doing two (or more) 'piggybacks' will be
easier to maintain than one stand-alone Mahout interpretter.  Also,
'piggybacking' opens up the possibility of sharing between contexts,
minimizes user configuration, etc.

The differential is about 20 new lines of code for a piggy back on any
underlying engine, vs. about 300 lines of code for a stand alone
interpreter which must be kept up to date with its Spark/Flink counter
parts.

Philosophically the stand-alone makes sense, practically the piggyback
does. *shruggie*

It is possible that somewhere down the road we'll refactor the piggy
back(s) into a stand alone interpreter, at which point none of the current
effort will be wasted, it will just be moving some code around.  So the
other advantage to the piggyback is that it quickly fields a minimum viable
product, with out having to pay much for it later on down the road.

This is in part due to the way Zeppelin implemented its interpreters which
involves a lot of code repetition.

I'm open to further discussion, but after playing in the Zeppelin code for
a while and really groking different approaches I think this one is best. I
do invite critiques because I believe I have considered most angles and can
properly defend the current path, and if there is something I haven't
thought of, I'd rather it be brought to light sooner than later.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, Jun 1, 2016 at 11:00 AM, Eric Charles <er...@apache.org> wrote:

> Hi Suneel, an independent makes sense as mahout is supposed to run on
> various backend, so not only spark.
>
> Yes, I am following mahout mailing list (and not abroad this year - this
> may change in the future).
>
> On 30/05/16 05:47, Suneel Marthi wrote:
>
>> Hi Eric,
>>
>> We r talking about the same PR which is a tweak of existing Spark-Zeppelin
>> interpreter.
>> What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
>> is independent of above?
>>
>> BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
>> Vancouver this time?
>>
>> On Sun, May 29, 2016 at 10:57 PM, Eric Charles <er...@apache.org> wrote:
>>
>> Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
>>>
>>> https://github.com/apache/incubator-zeppelin/pull/928
>>>
>>> It declares in the spark interpreter the mahout deps, and creates the sdc
>>> (spark distributed context).
>>>
>>> On 29/05/16 19:16, Suneel Marthi wrote:
>>>
>>> On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <trevor.d.grant@gmail.com
>>>> >
>>>> wrote:
>>>>
>>>> OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
>>>>
>>>>> duplicating efforts.
>>>>>
>>>>> Two things:
>>>>>
>>>>> 1- The blog post referenced the linear-regression example notebook
>>>>> twice-
>>>>> I've updated it to reference the ggplot integration. E.g. import this
>>>>> note:
>>>>>
>>>>>
>>>>>
>>>>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
>>>>> (I still need to update with a blurb about sampling, however it is done
>>>>> in
>>>>> that note...) So to any who tried the blog, I huge appology because
>>>>> that
>>>>> notebook is where all of the 'magic happened', (all of the screen
>>>>> shots /
>>>>> gg-plots / etc happened there).
>>>>>
>>>>> 2- I have a working prototype of the Zeppelin integration:
>>>>> 'mahout-terp' branch of :
>>>>> https://github.com/rawkintrevo/incubator-zeppelin
>>>>> if you build, and set 'spark.mahout' to 'true' in the Spark
>>>>> Interpretter
>>>>> properties, you have a Mahout interpreter. This is the minimally
>>>>> invasive
>>>>> way to do it, I'll be opening a PR soon, we'll see what the gang over
>>>>> at
>>>>> Zeppelin say.
>>>>> I'll still need docs and an example notebook, but I'm waiting to make
>>>>> sure
>>>>> I don't need to do a major refactor before I get carried away with
>>>>> those
>>>>> activities.
>>>>>
>>>>> In essence when 'spark-mahout' is 'true' you jump right in on r-like
>>>>> dsl
>>>>> and you have a sdc declared based on the underlying sc.
>>>>>
>>>>>
>>>>> I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
>>>> interpreter is gonna go down well with the Spark insanity.  I would
>>>> prefer
>>>> having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin
>>>> project
>>>> if that's acceptable to the Zeppelin folks, even though most of it might
>>>> be
>>>> repeatee.
>>>>
>>>> What do others have to say?
>>>>
>>>>
>>>> have a good holiday weekend,
>>>>
>>>>>
>>>>> tg
>>>>>
>>>>>
>>>>>
>>>>> Trevor Grant
>>>>> Data Scientist
>>>>> https://github.com/rawkintrevo
>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>> http://trevorgrant.org
>>>>>
>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>
>>>>>
>>>>> On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap...@outlook.com>
>>>>> wrote:
>>>>>
>>>>> Thx Trevor,
>>>>>
>>>>>> Re: m-1854, It was something that we started when were first
>>>>>> discussing
>>>>>> using the smile plots for and trying to pipe them over to Zeppelin ..
>>>>>> As
>>>>>> far as I know there was not progress started on it.. I've unassigned
>>>>>> it.
>>>>>>
>>>>>> Feel free to Assign any Jiras to yourself.  I think that m-1854 is
>>>>>>
>>>>>> similar
>>>>>
>>>>> to the mahout-spark-shell, so I may be able to help out there.
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>> Sent: Saturday, May 28, 2016 11:21:44 PM
>>>>>> To: dev@mahout.apache.org
>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>
>>>>>> Created a subtask on 1855 for tsv strings.
>>>>>>
>>>>>> Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
>>>>>>
>>>>>> How
>>>>>
>>>>> can I help?
>>>>>>
>>>>>> tg
>>>>>>
>>>>>>
>>>>>>
>>>>>> Trevor Grant
>>>>>> Data Scientist
>>>>>> https://github.com/rawkintrevo
>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>> http://trevorgrant.org
>>>>>>
>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>> -Virgil*
>>>>>>
>>>>>>
>>>>>> On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap...@outlook.com>
>>>>>> wrote:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>>
>>>>>>> When you free up and have the time, could you create some Jiras for
>>>>>>>
>>>>>>> these?
>>>>>>
>>>>>>
>>>>>>> We actually have MAHOUT-1852 open for Histograms already, and
>>>>>>>
>>>>>>> MAHOUT-1854
>>>>>>
>>>>>
>>>>> and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854
>>>>>>
>>>>>>>
>>>>>>> and
>>>>>>
>>>>>> m-1855 out and we can start new ones if they're not relevant anymore
>>>>>>> or
>>>>>>>
>>>>>>> we
>>>>>>
>>>>>> can just go with those.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>> Sent: Thursday, May 26, 2016 3:17:22 PM
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>
>>>>>>> Short answer: it is high priority. I think it will be a Mahout
>>>>>>>
>>>>>>> interpreter
>>>>>>
>>>>>> into Zeppelin, and given that plans are on hold for a Flink-Mahout in
>>>>>>>
>>>>>>> the
>>>>>>
>>>>>
>>>>> short term, I think it should be a piggy-back spark interpreter (e.g.
>>>>>>
>>>>>>> exposed through something like %spark.mahout).   So I have thoughts,
>>>>>>>
>>>>>>> but
>>>>>>
>>>>>
>>>>> no
>>>>>>
>>>>>> plan.  Been busy with a couple of other commitments.
>>>>>>>
>>>>>>> On the Mahout side we need:
>>>>>>> A function that will convert small matrices into TSV strings
>>>>>>> Convenience functions for sampling super-large matrices into things
>>>>>>>
>>>>>>> like
>>>>>>
>>>>>
>>>>> histograms, etc, that one would want to plot. I.e. histogram bucketing?
>>>>>>
>>>>>>> (less important for the moment)
>>>>>>>
>>>>>>> On the Zeppelin Size we need:
>>>>>>> an interpreter.
>>>>>>>
>>>>>>>
>>>>>>> Trevor Grant
>>>>>>> Data Scientist
>>>>>>> https://github.com/rawkintrevo
>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>> http://trevorgrant.org
>>>>>>>
>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>> -Virgil*
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <sm...@apache.org>
>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> While on this subject, do we have a plan yet of integrating Zeppelin
>>>>>>>
>>>>>>>>
>>>>>>>> into
>>>>>>>
>>>>>>
>>>>>> Mahout (or the converse) of having Mahout specific interpreter for
>>>>>>>
>>>>>>>> Zeppelin?  I think that shuld be high priority in the short term.
>>>>>>>>
>>>>>>>> On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
>>>>>>>>
>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Ahh, like the "Sample From Matrix" paragraph in the notebook.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yea that seems like a good add. If not this afternoon, I'll include
>>>>>>>>>
>>>>>>>>> it
>>>>>>>>
>>>>>>>
>>>>>> Saturday.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Trevor Grant
>>>>>>>>> Data Scientist
>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>> http://trevorgrant.org
>>>>>>>>>
>>>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>>>>
>>>>>>>>> -Virgil*
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>> On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
>>>>>>>>>
>>>>>>>>> ap.dev@outlook.com
>>>>>>>>
>>>>>>>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> Trevor, I was reading over your blog last night again- first time
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> since
>>>>>>>>>
>>>>>>>>
>>>>>>> you updated. It is  great!
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I have one suggestion being adding in a code line on how the the
>>>>>>>>>>
>>>>>>>>>> sampling
>>>>>>>>>
>>>>>>>>
>>>>>>>> of the  DRM ->  in-core Matrix is done:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
>>>>>
>>>>>
>>>>>> eg something like:
>>>>>>>>>>
>>>>>>>>>>       mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
>>>>>>>>>>
>>>>>>>>>> Maybe you omitted this intentionally?
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>> Sent: Friday, May 20, 2016 7:56:20 PM
>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>
>>>>>>>>>> Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
>>>>>>>>>>
>>>>>>>>>> version
>>>>>>>>>
>>>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>>
>>>>>>>>> uninformative to me. I'd say if possible, you're first
>>>>>>>>>>
>>>>>>>>>> troubleshooting
>>>>>>>>>
>>>>>>>>
>>>>>>> measure would be to re clone or do a "git fetch upstream" to get
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> up
>>>>>>>>>
>>>>>>>>
>>>>> to
>>>>>>
>>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>>
>>>>>>>>> very latest
>>>>>>>>>>
>>>>>>>>>> Sorry for delayed reply
>>>>>>>>>> Tg
>>>>>>>>>> On May 20, 2016 5:36 PM, "Andrew Musselman" <
>>>>>>>>>>
>>>>>>>>>> andrew.musselman@gmail.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Trevor, my zeppelin source is at this version:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     <groupId>org.apache.zeppelin</groupId>
>>>>>>>>>>>     <artifactId>zeppelin</artifactId>
>>>>>>>>>>>     <packaging>pom</packaging>
>>>>>>>>>>>     <version>0.6.0-incubating-SNAPSHOT</version>
>>>>>>>>>>>     <name>Zeppelin</name>
>>>>>>>>>>>     <description>Zeppelin project</description>
>>>>>>>>>>>     <url>http://zeppelin.incubator.apache.org/</url>
>>>>>>>>>>>
>>>>>>>>>>> And yes you're right the artifacts weren't added to the
>>>>>>>>>>>
>>>>>>>>>>> dependencies;
>>>>>>>>>>
>>>>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>>
>>>>>>>>> that a feature in more modern zep?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
>>>>>>>>>>>
>>>>>>>>>>> dlieu.7@gmail.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> no parenthesis.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> import o.a.m.sparkbindings._
>>>>>>>>>>>> ....
>>>>>>>>>>>> myRdd = myDrm.rdd
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
>>>>>>>>>>>>
>>>>>>>>>>>> smarthi@apache.org
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Pat,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you spit out a TSV - you can import into pyspark /
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> matplotlib
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> from
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> resource pool in essentially the same way and use that
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> plotting
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> library
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> if
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> you prefer.  In fact you could import the tsv into pandas
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> use
>>>>>>>
>>>>>>>>
>>>>>>>> all
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the pandas plotting as well (though I think it is for the
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> most
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> part,
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> also
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> matplotlib with some convenience functions).
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>>>>>
>>>>>
>>>>>> In Zeppelin, unless you specify otherwise, pyspark,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sparkr,
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> spark-sql,
>>>>>>
>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> scala-spark all share the same spark context you can
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> create
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> RDDs
>>>>>>
>>>>>>>
>>>>>>>> in
>>>>>>>>>
>>>>>>>>> one
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> language and access them / work on them in another (so I
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> understand).
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So in Mahout can you "save" a matrix as a RDD? e.g.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> like
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> val myRDD = myDRM.asRDD()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> val myRDD = myDRM.rdd()
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And would 'myRDD' then exist in the spark context?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yes it will be in sparkContext
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -Virgil*
>>>>>>>
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW I don’t want to stall progress but being the most
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ignorant
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> of
>>>>>>>>>
>>>>>>>>> plot
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> libs, I’ll ask if we should consider python and
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> matplotlib.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> In
>>>>>>>
>>>>>>>>
>>>>>>>> another
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> project we use python because of the RDD support on
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> though
>>>>>>
>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> visualizations are extremely limited in our case. If we
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> pass
>>>>>>>
>>>>>>>>
>>>>>>>>> an
>>>>>>>>>>
>>>>>>>>>> RDD
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pyspark it would allow custom reductions in python
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> plotting,
>>>>>>
>>>>>>>
>>>>>>>>>> even
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> though we will support many natively in Mahout. I’m
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> guessing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>>
>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> would cross a context boundary and require a write to
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> disk?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>
>>>>>>> So 2 questions:
>>>>>>>>>>>>>>> 1) what does the inter language support look like with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> python
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> vs
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> SparkR, can we transfer RDDs?
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) are the plot libs significantly different?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On May 20, 2016, at 9:54 AM, Trevor Grant <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> trevor.d.grant@gmail.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dmitriy really nailed it on the head in his reply to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> post
>>>>>>
>>>>>>>
>>>>>>> which
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'll
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> rebroadcast below. In essence the whole reason you are
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (theoretically)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> using Mahout is the data is to big to fit in memory.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> it's
>>>>>>
>>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>>>
>>>>>>>>> big
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> fit
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in memory, well then its probably too big to plot each
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> (e.g.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> trillions of row, you only have so many pixels).   For
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> example
>>>>>>>
>>>>>>>>
>>>>>>>>>> I
>>>>>>>>>>>
>>>>>>>>>>> randomly sampled a matrix.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So as Dmitriy says, in Mahout we need to have functions
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> will
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 'preprocess' the data into something plotable.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>> For the Zepplin-Plotting thing, we need to have a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> that
>>>>>>>
>>>>>>>>
>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> spit
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> out a tsv like string of the data we wanted plotted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree an honest Mahout interpreter in Zeppelin is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> worth
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> doing.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> There are a couple of ways to go about it. I opened up
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> discussion
>>>>>>>
>>>>>>>>
>>>>>>>>>>> on
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> dev@Zeppelin and didn't get any replies. I'm going to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> that
>>>>>>>
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> mean
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can do it in a way that makes the most sense to Mahout
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> users...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>> First steps are to include some methods in Mahout that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> do
>>>>>>>
>>>>>>>>
>>>>>>>> that
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> preprocessing, and one that will turn something into a
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> tsv
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> string.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I have some general ideas on possible approached to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> making
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> an
>>>>>>>
>>>>>>> honest-mahout
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>> interpreter but I want to play in the code and look at
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> Flink-Mahout
>>>>>>>
>>>>>>>>
>>>>>>>>>>>> shell a bit before I try to organize my thoughts and
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> them.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> ...(2) not sure what is the point of supporting
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> distributed
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> anything.
>>>>>>>
>>>>>>>>
>>>>>>>>>>> It
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> distributed presumably because it is hard to keep it in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> memory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> Therefore,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>> plotting anything distributed potentially presents 2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> problems:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> storage
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> space and overplotting due to number of points. The
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> is
>>>>>>
>>>>>> that
>>>>>>>
>>>>>>>>
>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>> have
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> work out algorithms that condense big data information
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> small
>>>>>>>
>>>>>>>>
>>>>>>>>> plottable
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> information (like density grids, for example, or
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> histograms)....
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>> -Virgil*
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pat@occamsmachete.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Great job Trevor, we’ll need this detail to smooth
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>> the
>>>>>>
>>>>>>>
>>>>>>> sharp
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> edges
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> any guidance from you or the Zeppelin community will
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> be a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> big
>>>>>>>
>>>>>>>>
>>>>>>>> help.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On May 20, 2016, at 8:13 AM, Shannon Quinn <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> squinn@gatech.edu>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Agreed, thoroughly enjoying the blog post.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well done, Trevor!  I've not yet had a chance to try
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>> in
>>>>>>>>
>>>>>>>> zeppelin
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> but I just read the blog which is great!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -------- Original message --------
>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ah thank you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Fixing now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> things."
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>> -Virgil*
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ap.dev@outlook.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey Trevor- Just refreshed your readme.  The jar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>> mentioned
>>>>>>>
>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>
>>>>>
>>>>>> rather than:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>
>>>>>
>>>>>> (In the spark module that is)
>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ah yes- I remember you pointing that out to me too.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I got side tracked yesterday for most of the day on
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>> adventure
>>>>>>>
>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>> getting
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zeppelin to work right after I accidently updated
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>> the
>>>>>>
>>>>>>>
>>>>>>> new
>>>>>>>>
>>>>>>>>>
>>>>>>>>> snapshot
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> (free
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> hint: the secret was to clear my cache *face-palm*)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm going to add that dependency to the readme.md
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>
>>>>>>> thanks,
>>>>>>>>>>>>>>>>>> tg
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>> things."
>>>>>>
>>>>>>>
>>>>>>>>> -Virgil*
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ap.dev@outlook.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Trevor this is very cool- I have not been able to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>> at
>>>>>>>
>>>>>>>>
>>>>>>>> it
>>>>>>>>>
>>>>>>>>> closely
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> yet
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> but just a small point: I believe that you'll also
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>> add
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>> For things like the classification stats,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> confusion
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>> matrix,
>>>>>>
>>>>>>>
>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> t-digest.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>>> From: Trevor Grant <tr...@gmail.com>
>>>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 10:47:21 AM
>>>>>>>>>>>>>>>>>>> To: dev@mahout.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>