You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2018/05/09 14:18:30 UTC

Revisiting Online serving of Spark models?

Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a
time as any to revisit the online serving situation in Spark ML. DB &
other's have done some excellent working moving a lot of the necessary
tools into a local linear algebra package that doesn't depend on having a
SparkContext.

There are a few different commercial and non-commercial solutions round
this, but currently our individual transform/predict methods are private so
they either need to copy or re-implement (or put them selves in
org.apache.spark) to access them. How would folks feel about adding a new
trait for ML pipeline stages to expose to do transformation of single
element inputs (or local collections) that could be optionally implemented
by stages which support this? That way we can have less copy and paste code
possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is
probably the right path, forward (folks have different needs), but I'd love
to see us make it simpler for other projects to build reliable serving
tools.

I realize this maybe puts some of the folks in an awkward position with
their own commercial offerings, but hopefully if we make it easier for
everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Maximiliano Felice <ma...@gmail.com>.

Hi!

To keep things ordered, I just sent an update on an older email requesting
an update for this, named: Spark model serving.

I propose to follow the discussion there. Or here, but not to branch.

Bye!


El mar., 3 jul. 2018 a las 22:15, Matei Zaharia (<ma...@gmail.com>)
escribió:

> Just wondering, is there an update on this? I haven’t seen a summary of
> the offline discussion but maybe I’ve missed it.
>
> Matei
>
> > On Jun 11, 2018, at 8:51 PM, Holden Karau <ho...@gmail.com>
> wrote:
> >
> > So I kicked of a thread on user@ to collect people's feedback there but
> I'll summarize the offline results later this week too.
> >
> > On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh <vi...@gmail.com> wrote:
> >
> > Hi,
> >
> > It'd be great if there can be any sharing of the offline discussion.
> Thanks!
> >
> >
> >
> > Holden Karau wrote
> > > We’re by the registration sign going to start walking over at 4:05
> > >
> > > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > >
> > >> Hi!
> > >>
> > >> Do we meet at the entrance?
> > >>
> > >> See you
> > >>
> > >>
> > >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> > >>
> >
> > > nick.pentreath@
> >
> > >> escribió:
> > >>
> > >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> > >>>
> > >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau &lt;
> >
> > > holden@
> >
> > > &gt; wrote:
> > >>>
> > >>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> > >>>>
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > >>>>
> > >>>>> Hi!
> > >>>>>
> > >>>>> We're already in San Francisco waiting for the summit. We even
> think
> > >>>>> that we spotted @holdenk this afternoon.
> > >>>>>
> > >>>> Unless you happened to be walking by my garage probably not super
> > >>>> likely, spent the day working on scooters/motorcycles (my style is a
> > >>>> little
> > >>>> less unique in SF :)). Also if you see me feel free to say hi
> unless I
> > >>>> look
> > >>>> like I haven't had my first coffee of the day, love chatting with
> folks
> > >>>> IRL
> > >>>> :)
> > >>>>
> > >>>>>
> > >>>>> @chris, we're really interested in the Meetup you're hosting. My
> team
> > >>>>> will probably join it since the beginning of you have room for us,
> and
> > >>>>> I'll
> > >>>>> join it later after discussing the topics on this thread. I'll send
> > >>>>> you an
> > >>>>> email regarding this request.
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> > >>>>>
> >
> > > sxk1969@
> >
> > >> escribió:
> > >>>>>
> > >>>>>> @Chris This sounds fantastic, please send summary notes for
> Seattle
> > >>>>>> folks
> > >>>>>>
> > >>>>>> @Felix I work in downtown Seattle, am wondering if we should a
> tech
> > >>>>>> meetup around model serving in spark at my work or elsewhere
> close,
> > >>>>>> thoughts?  I’m actually in the midst of building microservices to
> > >>>>>> manage
> > >>>>>> models and when I say models I mean much more than machine
> learning
> > >>>>>> models
> > >>>>>> (think OR, process models as well)
> > >>>>>>
> > >>>>>> Regards
> > >>>>>>
> > >>>>>> Sent from my iPhone
> > >>>>>>
> > >>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly &lt;
> >
> > > chris@
> >
> > > &gt; wrote:
> > >>>>>>
> > >>>>>> Hey everyone!
> > >>>>>>
> > >>>>>> @Felix:  thanks for putting this together.  i sent some of you a
> > >>>>>> quick
> > >>>>>> calendar event - mostly for me, so i don’t forget!  :)
> > >>>>>>
> > >>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark
> and
> > >>>>>> TensorFlow Meetup*
> > >>>>>> &lt;
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt
> ;
> > >>>>>> @5:30pm
> > >>>>>> on June 6th (same night) here in SF!
> > >>>>>>
> > >>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
> > >>>>>> includes the signup link:
> > >>>>>> *
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> > >>>>>> &lt;
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt
> ;
> > >>>>>>
> > >>>>>> We have an awesome lineup of speakers covered a lot of deep,
> > >>>>>> technical
> > >>>>>> ground.
> > >>>>>>
> > >>>>>> For those who can’t attend in person, we’ll be broadcasting live -
> > >>>>>> and
> > >>>>>> posting the recording afterward.
> > >>>>>>
> > >>>>>> All details are in the meetup link above…
> > >>>>>>
> > >>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more
> than
> > >>>>>> welcome to give a talk. I can move things around to make room.
> > >>>>>>
> > >>>>>> @joseph:  I’d personally like an update on the direction of the
> > >>>>>> Databricks proprietary ML Serving export format which is similar
> to
> > >>>>>> PMML
> > >>>>>> but not a standard in any way.
> > >>>>>>
> > >>>>>> Also, the Databricks ML Serving Runtime is only available to
> > >>>>>> Databricks customers.  This seems in conflict with the community
> > >>>>>> efforts
> > >>>>>> described here.  Can you comment on behalf of Databricks?
> > >>>>>>
> > >>>>>> Look forward to your response, joseph.
> > >>>>>>
> > >>>>>> See you all soon!
> > >>>>>>
> > >>>>>> —
> > >>>>>>
> > >>>>>>
> > >>>>>> *Chris Fregly *Founder @ *PipelineAI* &lt;https://pipeline.ai/&gt
> ;
> > >>>>>> (100,000
> > >>>>>> Users)
> > >>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
> > >>>>>> &lt;
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/&gt;
> > >>>>>> (85,000
> > >>>>>> Global Members)
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> *San Francisco - Chicago - Austin -
> > >>>>>> Washington DC - London - Dusseldorf *
> > >>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
> > >>>>>> &lt;http://community.pipeline.ai/&gt;*
> > >>>>>>
> > >>>>>>
> > >>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> Hi!
> > >>>>>>
> > >>>>>> Thank you! Let’s meet then
> > >>>>>>
> > >>>>>> June 6 4pm
> > >>>>>>
> > >>>>>> Moscone West Convention Center
> > >>>>>> 800 Howard Street, San Francisco, CA 94103
> > >>>>>> &lt;
> https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&amp;entry=gmail&amp;source=g&gt
> ;
> > >>>>>>
> > >>>>>> Ground floor (outside of conference area - should be available for
> > >>>>>> all) - we will meet and decide where to go
> > >>>>>>
> > >>>>>> (Would not send invite because that would be too much noise for
> dev@)
> > >>>>>>
> > >>>>>> To paraphrase Joseph, we will use this to kick off the discusssion
> > >>>>>> and
> > >>>>>> post notes after and follow up online. As for Seattle, I would be
> > >>>>>> very
> > >>>>>> interested to meet in person lateen and discuss ;)
> > >>>>>>
> > >>>>>>
> > >>>>>> _____________________________
> > >>>>>> From: Saikat Kanjilal &lt;
> >
> > > sxk1969@
> >
> > > &gt;
> > >>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
> > >>>>>> Subject: Re: Revisiting Online serving of Spark models?
> > >>>>>> To: Maximiliano Felice &lt;
> >
> > > maximilianofelice@
> >
> > > &gt;
> > >>>>>> Cc: Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;, Holden Karau <
> > >>>>>>
> >
> > > holden@
> >
> > >>, Joseph Bradley &lt;
> >
> > > joseph@
> >
> > > &gt;, Leif
> > >>>>>> Walsh &lt;
> >
> > > leif.walsh@
> >
> > > &gt;, dev &lt;
> >
> > > dev@.apache
> >
> > > &gt;
> > >>>>>>
> > >>>>>>
> > >>>>>> Would love to join but am in Seattle, thoughts on how to make this
> > >>>>>> work?
> > >>>>>>
> > >>>>>> Regards
> > >>>>>>
> > >>>>>> Sent from my iPhone
> > >>>>>>
> > >>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> > >>>>>>
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > >>>>>>
> > >>>>>> Big +1 to a meeting with fresh air.
> > >>>>>>
> > >>>>>> Could anyone send the invites? I don't really know which is the
> place
> > >>>>>> Holden is talking about.
> > >>>>>>
> > >>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;:
> > >>>>>>
> > >>>>>>> You had me at blue bottle!
> > >>>>>>>
> > >>>>>>> _____________________________
> > >>>>>>> From: Holden Karau &lt;
> >
> > > holden@
> >
> > > &gt;
> > >>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
> > >>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> > >>>>>>> To: Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;
> > >>>>>>> Cc: Saikat Kanjilal &lt;
> >
> > > sxk1969@
> >
> > > &gt;, Maximiliano Felice <
> > >>>>>>>
> >
> > > maximilianofelice@
> >
> > >>, Joseph Bradley &lt;
> >
> > > joseph@
> >
> > > &gt;,
> > >>>>>>> Leif Walsh &lt;
> >
> > > leif.walsh@
> >
> > > &gt;, dev &lt;
> >
> > > dev@.apache
> >
> > > &gt;
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I'm down for that, we could all go for a walk maybe to the mint
> > >>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our
> > >>>>>>> design
> > >>>>>>> meeting outside :p)?
> > >>>>>>>
> > >>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
> > >>>>>>>
> >
> > > felixcheung_m@
> >
> > >> wrote:
> > >>>>>>>
> > >>>>>>>> Bump.
> > >>>>>>>>
> > >>>>>>>> ------------------------------
> > >>>>>>>> *From:* Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;
> > >>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
> > >>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
> > >>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
> > >>>>>>>>
> > >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> > >>>>>>>>
> > >>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
> > >>>>>>>> (near) the Summit?
> > >>>>>>>>
> > >>>>>>>> (I propose we meet at the venue entrance so we could accommodate
> > >>>>>>>> people might not be in the conference)
> > >>>>>>>>
> > >>>>>>>> ------------------------------
> > >>>>>>>> *From:* Saikat Kanjilal &lt;
> >
> > > sxk1969@
> >
> > > &gt;
> > >>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
> > >>>>>>>> *To:* Maximiliano Felice
> > >>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley;
> dev
> > >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> > >>>>>>>>
> > >>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
> > >>>>>>>> well
> > >>>>>>>> for model serving and would love to join this discussion.
> > >>>>>>>>
> > >>>>>>>> Sent from my iPhone
> > >>>>>>>>
> > >>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
> > >>>>>>>>
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi!
> > >>>>>>>>
> > >>>>>>>> I'm don't usually write a lot on this list but I keep up to date
> > >>>>>>>> with the discussions and I'm a heavy user of Spark. This topic
> > >>>>>>>> caught my
> > >>>>>>>> attention, as we're currently facing this issue at work. I'm
> > >>>>>>>> attending to
> > >>>>>>>> the summit and was wondering if it would it be possible for me
> to
> > >>>>>>>> join that
> > >>>>>>>> meeting. I might be able to share some helpful usecases and
> ideas.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Maximiliano Felice
> > >>>>>>>>
> > >>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
> > >>>>>>>>
> >
> > > leif.walsh@
> >
> > >> escribió:
> > >>>>>>>>
> > >>>>>>>>> I’m with you on json being more readable than parquet, but
> we’ve
> > >>>>>>>>> had success using pyarrow’s parquet reader and have been quite
> > >>>>>>>>> happy with
> > >>>>>>>>> it so far. If your target is python (and probably if not now,
> then
> > >>>>>>>>> soon,
> > >>>>>>>>> R), you should look in to it.
> > >>>>>>>>>
> > >>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley &lt;
> >
> > > joseph@
> >
> > > &gt;
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
> > >>>>>>>>>> here:
> > >>>>>>>>>> * Our approach was to use the same format but write JSON
> instead
> > >>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and
> using
> > >>>>>>>>>> the same
> > >>>>>>>>>> format simplifies architecture.  Plus, some people want to
> check
> > >>>>>>>>>> files into
> > >>>>>>>>>> version control, and JSON is nice for that.
> > >>>>>>>>>> * The reader/writer APIs could be extended to take format
> > >>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON
> > >>>>>>>>>> (and maybe,
> > >>>>>>>>>> eventually, handle Parquet in the online serving setting).
> > >>>>>>>>>>
> > >>>>>>>>>> This would be a big project, so proposing a SPIP might be
> best.
> > >>>>>>>>>> If people are around at the Spark Summit, that could be a good
> > >>>>>>>>>> time to meet
> > >>>>>>>>>> up & then post notes back to the dev list.
> > >>>>>>>>>>
> > >>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
> > >>>>>>>>>>
> >
> > > felixcheung_m@
> >
> > >> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Specifically I’d like bring part of the discussion to Model
> and
> > >>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
> > >>>>>>>>>>> implementations
> > >>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing
> > >>>>>>>>>>> trained models
> > >>>>>>>>>>> outside of Spark for online serving.
> > >>>>>>>>>>>
> > >>>>>>>>>>> What’s the next step? Would folks be interested in getting
> > >>>>>>>>>>> together to discuss/get some feedback?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> _____________________________
> > >>>>>>>>>>> From: Felix Cheung &lt;
> >
> > > felixcheung_m@
> >
> > > &gt;
> > >>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
> > >>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> > >>>>>>>>>>> To: Holden Karau &lt;
> >
> > > holden@
> >
> > > &gt;, Joseph Bradley <
> > >>>>>>>>>>>
> >
> > > joseph@
> >
> > >>
> > >>>>>>>>>>> Cc: dev &lt;
> >
> > > dev@.apache
> >
> > > &gt;
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Huge +1 on this!
> > >>>>>>>>>>>
> > >>>>>>>>>>> ------------------------------
> > >>>>>>>>>>> *From:*
> >
> > > holden.karau@
> >
> > >  &lt;
> >
> > > holden.karau@
> >
> > > &gt; on behalf
> > >>>>>>>>>>> of Holden Karau &lt;
> >
> > > holden@
> >
> > > &gt;
> > >>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
> > >>>>>>>>>>> *To:* Joseph Bradley
> > >>>>>>>>>>> *Cc:* dev
> > >>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
> > >>>>>>>>>>>
> >
> > > joseph@
> >
> > >> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter
> of
> > >>>>>>>>>>>> this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Awesome! I'm glad other folks think something like this
> belongs
> > >>>>>>>>>>> in Spark.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> This was one of the original goals for mllib-local: to have
> > >>>>>>>>>>>> local versions of MLlib models which could be deployed
> without
> > >>>>>>>>>>>> the big
> > >>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.
> There
> > >>>>>>>>>>>> are related
> > >>>>>>>>>>>> commercial offerings like this : ) but the overhead of
> > >>>>>>>>>>>> maintaining those
> > >>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib
> to
> > >>>>>>>>>>>> avoid copying
> > >>>>>>>>>>>> logic across libraries will be well worth it.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> We've talked about this need at Databricks and have also
> been
> > >>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get
> this
> > >>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
> > >>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
> > >>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it
> would
> > >>>>>>>>>>>> be ideal to
> > >>>>>>>>>>>> have local, lightweight versions of models in mllib-local,
> > >>>>>>>>>>>> outside of the
> > >>>>>>>>>>>> main mllib package (for easier deployment with smaller &
> fewer
> > >>>>>>>>>>>> dependencies).
> > >>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
> > >>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows
> and
> > >>>>>>>>>>>> Types, which
> > >>>>>>>>>>>> could be moved into a local sql package.
> > >>>>>>>>>>>> * This architecture may require some awkward APIs currently
> to
> > >>>>>>>>>>>> have model prediction logic in mllib-local, local model
> classes
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes
> in
> > >>>>>>>>>>>> mllib.  We
> > >>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark
> 3.0
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> facilitate this architecture while making it feasible for
> 3rd
> > >>>>>>>>>>>> party
> > >>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
> > >>>>>>>>>>>>
> > >>>>>>>>>>> I agree this could be interesting, and feed into the other
> > >>>>>>>>>>> discussion around when (or if) we should be considering Spark
> > >>>>>>>>>>> 3.0
> > >>>>>>>>>>> I _think_ we could probably do it with optional traits people
> > >>>>>>>>>>> could mix in to avoid breaking the current APIs but I could
> be
> > >>>>>>>>>>> wrong on
> > >>>>>>>>>>> that point.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
> > >>>>>>>>>>>> might not be as important as per-Row transformations, but
> they
> > >>>>>>>>>>>> would be
> > >>>>>>>>>>>> helpful for batching for higher throughput.
> > >>>>>>>>>>>>
> > >>>>>>>>>>> That could be interesting as well.
> > >>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'll be interested to hear others' thoughts too!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Joseph
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
> > >>>>>>>>>>>>
> >
> > > holden@
> >
> > >> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi y'all,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems
> like
> > >>>>>>>>>>>>> a good a time as any to revisit the online serving
> situation
> > >>>>>>>>>>>>> in Spark ML.
> > >>>>>>>>>>>>> DB & other's have done some excellent working moving a lot
> of
> > >>>>>>>>>>>>> the necessary
> > >>>>>>>>>>>>> tools into a local linear algebra package that doesn't
> depend
> > >>>>>>>>>>>>> on having a
> > >>>>>>>>>>>>> SparkContext.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> There are a few different commercial and non-commercial
> > >>>>>>>>>>>>> solutions round this, but currently our individual
> > >>>>>>>>>>>>> transform/predict
> > >>>>>>>>>>>>> methods are private so they either need to copy or
> > >>>>>>>>>>>>> re-implement (or put
> > >>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would
> > >>>>>>>>>>>>> folks feel about
> > >>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do
> > >>>>>>>>>>>>> transformation of
> > >>>>>>>>>>>>> single element inputs (or local collections) that could be
> > >>>>>>>>>>>>> optionally
> > >>>>>>>>>>>>> implemented by stages which support this? That way we can
> have
> > >>>>>>>>>>>>> less copy
> > >>>>>>>>>>>>> and paste code possibly getting out of sync with our model
> > >>>>>>>>>>>>> training.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think continuing to have on-line serving grow in
> different
> > >>>>>>>>>>>>> projects is probably the right path, forward (folks have
> > >>>>>>>>>>>>> different needs),
> > >>>>>>>>>>>>> but I'd love to see us make it simpler for other projects
> to
> > >>>>>>>>>>>>> build reliable
> > >>>>>>>>>>>>> serving tools.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
> > >>>>>>>>>>>>> position with their own commercial offerings, but
> hopefully if
> > >>>>>>>>>>>>> we make it
> > >>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as
> > >>>>>>>>>>>>> well.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Holden :)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Joseph Bradley
> > >>>>>>>>>>>> Software Engineer - Machine Learning
> > >>>>>>>>>>>> Databricks, Inc.
> > >>>>>>>>>>>> [image: http://databricks.com] &lt;
> http://databricks.com/&gt;
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Joseph Bradley
> > >>>>>>>>>> Software Engineer - Machine Learning
> > >>>>>>>>>> Databricks, Inc.
> > >>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
> > >>>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> --
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Leif
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Twitter: https://twitter.com/holdenkarau
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Twitter: https://twitter.com/holdenkarau
> > >>>>
> > >>> --
> > > Twitter: https://twitter.com/holdenkarau
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Revisiting Online serving of Spark models?

Posted by Matei Zaharia <ma...@gmail.com>.

Just wondering, is there an update on this? I haven’t seen a summary of the offline discussion but maybe I’ve missed it.

Matei 

> On Jun 11, 2018, at 8:51 PM, Holden Karau <ho...@gmail.com> wrote:
> 
> So I kicked of a thread on user@ to collect people's feedback there but I'll summarize the offline results later this week too.
> 
> On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh <vi...@gmail.com> wrote:
> 
> Hi,
> 
> It'd be great if there can be any sharing of the offline discussion. Thanks!
> 
> 
> 
> Holden Karau wrote
> > We’re by the registration sign going to start walking over at 4:05
> > 
> > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
> 
> > maximilianofelice@
> 
> >> wrote:
> > 
> >> Hi!
> >>
> >> Do we meet at the entrance?
> >>
> >> See you
> >>
> >>
> >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> >> 
> 
> > nick.pentreath@
> 
> >> escribió:
> >>
> >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> >>>
> >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau &lt;
> 
> > holden@
> 
> > &gt; wrote:
> >>>
> >>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> >>>> 
> 
> > maximilianofelice@
> 
> >> wrote:
> >>>>
> >>>>> Hi!
> >>>>>
> >>>>> We're already in San Francisco waiting for the summit. We even think
> >>>>> that we spotted @holdenk this afternoon.
> >>>>>
> >>>> Unless you happened to be walking by my garage probably not super
> >>>> likely, spent the day working on scooters/motorcycles (my style is a
> >>>> little
> >>>> less unique in SF :)). Also if you see me feel free to say hi unless I
> >>>> look
> >>>> like I haven't had my first coffee of the day, love chatting with folks
> >>>> IRL
> >>>> :)
> >>>>
> >>>>>
> >>>>> @chris, we're really interested in the Meetup you're hosting. My team
> >>>>> will probably join it since the beginning of you have room for us, and
> >>>>> I'll
> >>>>> join it later after discussing the topics on this thread. I'll send
> >>>>> you an
> >>>>> email regarding this request.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> >>>>> 
> 
> > sxk1969@
> 
> >> escribió:
> >>>>>
> >>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
> >>>>>> folks
> >>>>>>
> >>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
> >>>>>> meetup around model serving in spark at my work or elsewhere close,
> >>>>>> thoughts?  I’m actually in the midst of building microservices to
> >>>>>> manage
> >>>>>> models and when I say models I mean much more than machine learning
> >>>>>> models
> >>>>>> (think OR, process models as well)
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly &lt;
> 
> > chris@
> 
> > &gt; wrote:
> >>>>>>
> >>>>>> Hey everyone!
> >>>>>>
> >>>>>> @Felix:  thanks for putting this together.  i sent some of you a
> >>>>>> quick
> >>>>>> calendar event - mostly for me, so i don’t forget!  :)
> >>>>>>
> >>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
> >>>>>> TensorFlow Meetup*
> >>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
> >>>>>> @5:30pm
> >>>>>> on June 6th (same night) here in SF!
> >>>>>>
> >>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
> >>>>>> includes the signup link:
> >>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> >>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
> >>>>>>
> >>>>>> We have an awesome lineup of speakers covered a lot of deep,
> >>>>>> technical
> >>>>>> ground.
> >>>>>>
> >>>>>> For those who can’t attend in person, we’ll be broadcasting live -
> >>>>>> and
> >>>>>> posting the recording afterward.
> >>>>>>
> >>>>>> All details are in the meetup link above…
> >>>>>>
> >>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
> >>>>>> welcome to give a talk. I can move things around to make room.
> >>>>>>
> >>>>>> @joseph:  I’d personally like an update on the direction of the
> >>>>>> Databricks proprietary ML Serving export format which is similar to
> >>>>>> PMML
> >>>>>> but not a standard in any way.
> >>>>>>
> >>>>>> Also, the Databricks ML Serving Runtime is only available to
> >>>>>> Databricks customers.  This seems in conflict with the community
> >>>>>> efforts
> >>>>>> described here.  Can you comment on behalf of Databricks?
> >>>>>>
> >>>>>> Look forward to your response, joseph.
> >>>>>>
> >>>>>> See you all soon!
> >>>>>>
> >>>>>> —
> >>>>>>
> >>>>>>
> >>>>>> *Chris Fregly *Founder @ *PipelineAI* &lt;https://pipeline.ai/&gt;
> >>>>>> (100,000
> >>>>>> Users)
> >>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
> >>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/&gt;
> >>>>>> (85,000
> >>>>>> Global Members)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> *San Francisco - Chicago - Austin -
> >>>>>> Washington DC - London - Dusseldorf *
> >>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
> >>>>>> &lt;http://community.pipeline.ai/&gt;*
> >>>>>>
> >>>>>>
> >>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> Thank you! Let’s meet then
> >>>>>>
> >>>>>> June 6 4pm
> >>>>>>
> >>>>>> Moscone West Convention Center
> >>>>>> 800 Howard Street, San Francisco, CA 94103
> >>>>>> &lt;https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&amp;entry=gmail&amp;source=g&gt;
> >>>>>>
> >>>>>> Ground floor (outside of conference area - should be available for
> >>>>>> all) - we will meet and decide where to go
> >>>>>>
> >>>>>> (Would not send invite because that would be too much noise for dev@)
> >>>>>>
> >>>>>> To paraphrase Joseph, we will use this to kick off the discusssion
> >>>>>> and
> >>>>>> post notes after and follow up online. As for Seattle, I would be
> >>>>>> very
> >>>>>> interested to meet in person lateen and discuss ;)
> >>>>>>
> >>>>>>
> >>>>>> _____________________________
> >>>>>> From: Saikat Kanjilal &lt;
> 
> > sxk1969@
> 
> > &gt;
> >>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
> >>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>> To: Maximiliano Felice &lt;
> 
> > maximilianofelice@
> 
> > &gt;
> >>>>>> Cc: Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;, Holden Karau <
> >>>>>> 
> 
> > holden@
> 
> >>, Joseph Bradley &lt;
> 
> > joseph@
> 
> > &gt;, Leif
> >>>>>> Walsh &lt;
> 
> > leif.walsh@
> 
> > &gt;, dev &lt;
> 
> > dev@.apache
> 
> > &gt;
> >>>>>>
> >>>>>>
> >>>>>> Would love to join but am in Seattle, thoughts on how to make this
> >>>>>> work?
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> >>>>>> 
> 
> > maximilianofelice@
> 
> >> wrote:
> >>>>>>
> >>>>>> Big +1 to a meeting with fresh air.
> >>>>>>
> >>>>>> Could anyone send the invites? I don't really know which is the place
> >>>>>> Holden is talking about.
> >>>>>>
> >>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;:
> >>>>>>
> >>>>>>> You had me at blue bottle!
> >>>>>>>
> >>>>>>> _____________________________
> >>>>>>> From: Holden Karau &lt;
> 
> > holden@
> 
> > &gt;
> >>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
> >>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>>> To: Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;
> >>>>>>> Cc: Saikat Kanjilal &lt;
> 
> > sxk1969@
> 
> > &gt;, Maximiliano Felice <
> >>>>>>> 
> 
> > maximilianofelice@
> 
> >>, Joseph Bradley &lt;
> 
> > joseph@
> 
> > &gt;,
> >>>>>>> Leif Walsh &lt;
> 
> > leif.walsh@
> 
> > &gt;, dev &lt;
> 
> > dev@.apache
> 
> > &gt;
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I'm down for that, we could all go for a walk maybe to the mint
> >>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our
> >>>>>>> design
> >>>>>>> meeting outside :p)?
> >>>>>>>
> >>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
> >>>>>>> 
> 
> > felixcheung_m@
> 
> >> wrote:
> >>>>>>>
> >>>>>>>> Bump.
> >>>>>>>>
> >>>>>>>> ------------------------------
> >>>>>>>> *From:* Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;
> >>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
> >>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
> >>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
> >>>>>>>>
> >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>
> >>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
> >>>>>>>> (near) the Summit?
> >>>>>>>>
> >>>>>>>> (I propose we meet at the venue entrance so we could accommodate
> >>>>>>>> people might not be in the conference)
> >>>>>>>>
> >>>>>>>> ------------------------------
> >>>>>>>> *From:* Saikat Kanjilal &lt;
> 
> > sxk1969@
> 
> > &gt;
> >>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
> >>>>>>>> *To:* Maximiliano Felice
> >>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
> >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>
> >>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
> >>>>>>>> well
> >>>>>>>> for model serving and would love to join this discussion.
> >>>>>>>>
> >>>>>>>> Sent from my iPhone
> >>>>>>>>
> >>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
> >>>>>>>> 
> 
> > maximilianofelice@
> 
> >> wrote:
> >>>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> I'm don't usually write a lot on this list but I keep up to date
> >>>>>>>> with the discussions and I'm a heavy user of Spark. This topic
> >>>>>>>> caught my
> >>>>>>>> attention, as we're currently facing this issue at work. I'm
> >>>>>>>> attending to
> >>>>>>>> the summit and was wondering if it would it be possible for me to
> >>>>>>>> join that
> >>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Maximiliano Felice
> >>>>>>>>
> >>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
> >>>>>>>> 
> 
> > leif.walsh@
> 
> >> escribió:
> >>>>>>>>
> >>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
> >>>>>>>>> had success using pyarrow’s parquet reader and have been quite
> >>>>>>>>> happy with
> >>>>>>>>> it so far. If your target is python (and probably if not now, then
> >>>>>>>>> soon,
> >>>>>>>>> R), you should look in to it.
> >>>>>>>>>
> >>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley &lt;
> 
> > joseph@
> 
> > &gt;
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
> >>>>>>>>>> here:
> >>>>>>>>>> * Our approach was to use the same format but write JSON instead
> >>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using
> >>>>>>>>>> the same
> >>>>>>>>>> format simplifies architecture.  Plus, some people want to check
> >>>>>>>>>> files into
> >>>>>>>>>> version control, and JSON is nice for that.
> >>>>>>>>>> * The reader/writer APIs could be extended to take format
> >>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON
> >>>>>>>>>> (and maybe,
> >>>>>>>>>> eventually, handle Parquet in the online serving setting).
> >>>>>>>>>>
> >>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
> >>>>>>>>>> If people are around at the Spark Summit, that could be a good
> >>>>>>>>>> time to meet
> >>>>>>>>>> up & then post notes back to the dev list.
> >>>>>>>>>>
> >>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
> >>>>>>>>>> 
> 
> > felixcheung_m@
> 
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
> >>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
> >>>>>>>>>>> implementations
> >>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing 
> >>>>>>>>>>> trained models
> >>>>>>>>>>> outside of Spark for online serving.
> >>>>>>>>>>>
> >>>>>>>>>>> What’s the next step? Would folks be interested in getting
> >>>>>>>>>>> together to discuss/get some feedback?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _____________________________
> >>>>>>>>>>> From: Felix Cheung &lt;
> 
> > felixcheung_m@
> 
> > &gt;
> >>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
> >>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>>>>>>> To: Holden Karau &lt;
> 
> > holden@
> 
> > &gt;, Joseph Bradley <
> >>>>>>>>>>> 
> 
> > joseph@
> 
> >>
> >>>>>>>>>>> Cc: dev &lt;
> 
> > dev@.apache
> 
> > &gt;
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Huge +1 on this!
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------------
> >>>>>>>>>>> *From:*
> 
> > holden.karau@
> 
> >  &lt;
> 
> > holden.karau@
> 
> > &gt; on behalf
> >>>>>>>>>>> of Holden Karau &lt;
> 
> > holden@
> 
> > &gt;
> >>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
> >>>>>>>>>>> *To:* Joseph Bradley
> >>>>>>>>>>> *Cc:* dev
> >>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
> >>>>>>>>>>> 
> 
> > joseph@
> 
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
> >>>>>>>>>>>> this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
> >>>>>>>>>>> in Spark.
> >>>>>>>>>>>
> >>>>>>>>>>>> This was one of the original goals for mllib-local: to have
> >>>>>>>>>>>> local versions of MLlib models which could be deployed without
> >>>>>>>>>>>> the big
> >>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There
> >>>>>>>>>>>> are related
> >>>>>>>>>>>> commercial offerings like this : ) but the overhead of
> >>>>>>>>>>>> maintaining those
> >>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to
> >>>>>>>>>>>> avoid copying
> >>>>>>>>>>>> logic across libraries will be well worth it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We've talked about this need at Databricks and have also been
> >>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
> >>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
> >>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
> >>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would
> >>>>>>>>>>>> be ideal to
> >>>>>>>>>>>> have local, lightweight versions of models in mllib-local,
> >>>>>>>>>>>> outside of the
> >>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
> >>>>>>>>>>>> dependencies).
> >>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
> >>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and
> >>>>>>>>>>>> Types, which
> >>>>>>>>>>>> could be moved into a local sql package.
> >>>>>>>>>>>> * This architecture may require some awkward APIs currently to
> >>>>>>>>>>>> have model prediction logic in mllib-local, local model classes
> >>>>>>>>>>>> in
> >>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in
> >>>>>>>>>>>> mllib.  We
> >>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0
> >>>>>>>>>>>> to
> >>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd
> >>>>>>>>>>>> party
> >>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
> >>>>>>>>>>>>
> >>>>>>>>>>> I agree this could be interesting, and feed into the other
> >>>>>>>>>>> discussion around when (or if) we should be considering Spark
> >>>>>>>>>>> 3.0
> >>>>>>>>>>> I _think_ we could probably do it with optional traits people
> >>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be
> >>>>>>>>>>> wrong on
> >>>>>>>>>>> that point.
> >>>>>>>>>>>
> >>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
> >>>>>>>>>>>> might not be as important as per-Row transformations, but they
> >>>>>>>>>>>> would be
> >>>>>>>>>>>> helpful for batching for higher throughput.
> >>>>>>>>>>>>
> >>>>>>>>>>> That could be interesting as well.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll be interested to hear others' thoughts too!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Joseph
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
> >>>>>>>>>>>> 
> 
> > holden@
> 
> >> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi y'all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like
> >>>>>>>>>>>>> a good a time as any to revisit the online serving situation
> >>>>>>>>>>>>> in Spark ML.
> >>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of
> >>>>>>>>>>>>> the necessary
> >>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend
> >>>>>>>>>>>>> on having a
> >>>>>>>>>>>>> SparkContext.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There are a few different commercial and non-commercial
> >>>>>>>>>>>>> solutions round this, but currently our individual
> >>>>>>>>>>>>> transform/predict
> >>>>>>>>>>>>> methods are private so they either need to copy or
> >>>>>>>>>>>>> re-implement (or put
> >>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would
> >>>>>>>>>>>>> folks feel about
> >>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do
> >>>>>>>>>>>>> transformation of
> >>>>>>>>>>>>> single element inputs (or local collections) that could be
> >>>>>>>>>>>>> optionally
> >>>>>>>>>>>>> implemented by stages which support this? That way we can have
> >>>>>>>>>>>>> less copy
> >>>>>>>>>>>>> and paste code possibly getting out of sync with our model
> >>>>>>>>>>>>> training.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think continuing to have on-line serving grow in different
> >>>>>>>>>>>>> projects is probably the right path, forward (folks have
> >>>>>>>>>>>>> different needs),
> >>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to
> >>>>>>>>>>>>> build reliable
> >>>>>>>>>>>>> serving tools.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
> >>>>>>>>>>>>> position with their own commercial offerings, but hopefully if
> >>>>>>>>>>>>> we make it
> >>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as
> >>>>>>>>>>>>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Holden :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Joseph Bradley
> >>>>>>>>>>>> Software Engineer - Machine Learning
> >>>>>>>>>>>> Databricks, Inc.
> >>>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Joseph Bradley
> >>>>>>>>>> Software Engineer - Machine Learning
> >>>>>>>>>> Databricks, Inc.
> >>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> Cheers,
> >>>>>>>>> Leif
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>>>
> >>> --
> > Twitter: https://twitter.com/holdenkarau
> 
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Revisiting Online serving of Spark models?

Posted by Vadim Chelyshov <qt...@gmail.com>.

I've almost completed a library for speeding up current spark models serving
- https://github.com/Hydrospheredata/fastserving. It depends on spark, but
it provides a way to turn spark logical plan from dataframe sample, that was
passed into pipeline/transformer, into an alternative transformer that works
with a local data structure and provides a significant performance speedup.

From the future perspective, I think introducing some dataframe-like
structure with exposed catalist-like ast and providing different ways of
interpretation(local/spark) possible could solve the current problems with a
"minimal" rewriting.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@gmail.com>.

So I kicked of a thread on user@ to collect people's feedback there but
I'll summarize the offline results later this week too.

On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh <vi...@gmail.com> wrote:

>
> Hi,
>
> It'd be great if there can be any sharing of the offline discussion.
> Thanks!
>
>
>
> Holden Karau wrote
> > We’re by the registration sign going to start walking over at 4:05
> >
> > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
>
> > maximilianofelice@
>
> >> wrote:
> >
> >> Hi!
> >>
> >> Do we meet at the entrance?
> >>
> >> See you
> >>
> >>
> >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> >>
>
> > nick.pentreath@
>
> >> escribió:
> >>
> >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> >>>
> >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau &lt;
>
> > holden@
>
> > &gt; wrote:
> >>>
> >>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> >>>>
>
> > maximilianofelice@
>
> >> wrote:
> >>>>
> >>>>> Hi!
> >>>>>
> >>>>> We're already in San Francisco waiting for the summit. We even think
> >>>>> that we spotted @holdenk this afternoon.
> >>>>>
> >>>> Unless you happened to be walking by my garage probably not super
> >>>> likely, spent the day working on scooters/motorcycles (my style is a
> >>>> little
> >>>> less unique in SF :)). Also if you see me feel free to say hi unless I
> >>>> look
> >>>> like I haven't had my first coffee of the day, love chatting with
> folks
> >>>> IRL
> >>>> :)
> >>>>
> >>>>>
> >>>>> @chris, we're really interested in the Meetup you're hosting. My team
> >>>>> will probably join it since the beginning of you have room for us,
> and
> >>>>> I'll
> >>>>> join it later after discussing the topics on this thread. I'll send
> >>>>> you an
> >>>>> email regarding this request.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> >>>>>
>
> > sxk1969@
>
> >> escribió:
> >>>>>
> >>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
> >>>>>> folks
> >>>>>>
> >>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
> >>>>>> meetup around model serving in spark at my work or elsewhere close,
> >>>>>> thoughts?  I’m actually in the midst of building microservices to
> >>>>>> manage
> >>>>>> models and when I say models I mean much more than machine learning
> >>>>>> models
> >>>>>> (think OR, process models as well)
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly &lt;
>
> > chris@
>
> > &gt; wrote:
> >>>>>>
> >>>>>> Hey everyone!
> >>>>>>
> >>>>>> @Felix:  thanks for putting this together.  i sent some of you a
> >>>>>> quick
> >>>>>> calendar event - mostly for me, so i don’t forget!  :)
> >>>>>>
> >>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
> >>>>>> TensorFlow Meetup*
> >>>>>> &lt;
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt
> ;
> >>>>>> @5:30pm
> >>>>>> on June 6th (same night) here in SF!
> >>>>>>
> >>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
> >>>>>> includes the signup link:
> >>>>>> *
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> >>>>>> &lt;
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt
> ;
> >>>>>>
> >>>>>> We have an awesome lineup of speakers covered a lot of deep,
> >>>>>> technical
> >>>>>> ground.
> >>>>>>
> >>>>>> For those who can’t attend in person, we’ll be broadcasting live -
> >>>>>> and
> >>>>>> posting the recording afterward.
> >>>>>>
> >>>>>> All details are in the meetup link above…
> >>>>>>
> >>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
> >>>>>> welcome to give a talk. I can move things around to make room.
> >>>>>>
> >>>>>> @joseph:  I’d personally like an update on the direction of the
> >>>>>> Databricks proprietary ML Serving export format which is similar to
> >>>>>> PMML
> >>>>>> but not a standard in any way.
> >>>>>>
> >>>>>> Also, the Databricks ML Serving Runtime is only available to
> >>>>>> Databricks customers.  This seems in conflict with the community
> >>>>>> efforts
> >>>>>> described here.  Can you comment on behalf of Databricks?
> >>>>>>
> >>>>>> Look forward to your response, joseph.
> >>>>>>
> >>>>>> See you all soon!
> >>>>>>
> >>>>>> —
> >>>>>>
> >>>>>>
> >>>>>> *Chris Fregly *Founder @ *PipelineAI* &lt;https://pipeline.ai/&gt;
> >>>>>> (100,000
> >>>>>> Users)
> >>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
> >>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/&gt
> ;
> >>>>>> (85,000
> >>>>>> Global Members)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> *San Francisco - Chicago - Austin -
> >>>>>> Washington DC - London - Dusseldorf *
> >>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
> >>>>>> &lt;http://community.pipeline.ai/&gt;*
> >>>>>>
> >>>>>>
> >>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> Thank you! Let’s meet then
> >>>>>>
> >>>>>> June 6 4pm
> >>>>>>
> >>>>>> Moscone West Convention Center
> >>>>>> 800 Howard Street, San Francisco, CA 94103
> >>>>>> &lt;
> https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&amp;entry=gmail&amp;source=g&gt
> ;
> >>>>>>
> >>>>>> Ground floor (outside of conference area - should be available for
> >>>>>> all) - we will meet and decide where to go
> >>>>>>
> >>>>>> (Would not send invite because that would be too much noise for dev@
> )
> >>>>>>
> >>>>>> To paraphrase Joseph, we will use this to kick off the discusssion
> >>>>>> and
> >>>>>> post notes after and follow up online. As for Seattle, I would be
> >>>>>> very
> >>>>>> interested to meet in person lateen and discuss ;)
> >>>>>>
> >>>>>>
> >>>>>> _____________________________
> >>>>>> From: Saikat Kanjilal &lt;
>
> > sxk1969@
>
> > &gt;
> >>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
> >>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>> To: Maximiliano Felice &lt;
>
> > maximilianofelice@
>
> > &gt;
> >>>>>> Cc: Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;, Holden Karau <
> >>>>>>
>
> > holden@
>
> >>, Joseph Bradley &lt;
>
> > joseph@
>
> > &gt;, Leif
> >>>>>> Walsh &lt;
>
> > leif.walsh@
>
> > &gt;, dev &lt;
>
> > dev@.apache
>
> > &gt;
> >>>>>>
> >>>>>>
> >>>>>> Would love to join but am in Seattle, thoughts on how to make this
> >>>>>> work?
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> >>>>>>
>
> > maximilianofelice@
>
> >> wrote:
> >>>>>>
> >>>>>> Big +1 to a meeting with fresh air.
> >>>>>>
> >>>>>> Could anyone send the invites? I don't really know which is the
> place
> >>>>>> Holden is talking about.
> >>>>>>
> >>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;:
> >>>>>>
> >>>>>>> You had me at blue bottle!
> >>>>>>>
> >>>>>>> _____________________________
> >>>>>>> From: Holden Karau &lt;
>
> > holden@
>
> > &gt;
> >>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
> >>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>>> To: Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >>>>>>> Cc: Saikat Kanjilal &lt;
>
> > sxk1969@
>
> > &gt;, Maximiliano Felice <
> >>>>>>>
>
> > maximilianofelice@
>
> >>, Joseph Bradley &lt;
>
> > joseph@
>
> > &gt;,
> >>>>>>> Leif Walsh &lt;
>
> > leif.walsh@
>
> > &gt;, dev &lt;
>
> > dev@.apache
>
> > &gt;
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I'm down for that, we could all go for a walk maybe to the mint
> >>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our
> >>>>>>> design
> >>>>>>> meeting outside :p)?
> >>>>>>>
> >>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
> >>>>>>>
>
> > felixcheung_m@
>
> >> wrote:
> >>>>>>>
> >>>>>>>> Bump.
> >>>>>>>>
> >>>>>>>> ------------------------------
> >>>>>>>> *From:* Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
> >>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
> >>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
> >>>>>>>>
> >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>
> >>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
> >>>>>>>> (near) the Summit?
> >>>>>>>>
> >>>>>>>> (I propose we meet at the venue entrance so we could accommodate
> >>>>>>>> people might not be in the conference)
> >>>>>>>>
> >>>>>>>> ------------------------------
> >>>>>>>> *From:* Saikat Kanjilal &lt;
>
> > sxk1969@
>
> > &gt;
> >>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
> >>>>>>>> *To:* Maximiliano Felice
> >>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
> >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>
> >>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
> >>>>>>>> well
> >>>>>>>> for model serving and would love to join this discussion.
> >>>>>>>>
> >>>>>>>> Sent from my iPhone
> >>>>>>>>
> >>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
> >>>>>>>>
>
> > maximilianofelice@
>
> >> wrote:
> >>>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> I'm don't usually write a lot on this list but I keep up to date
> >>>>>>>> with the discussions and I'm a heavy user of Spark. This topic
> >>>>>>>> caught my
> >>>>>>>> attention, as we're currently facing this issue at work. I'm
> >>>>>>>> attending to
> >>>>>>>> the summit and was wondering if it would it be possible for me to
> >>>>>>>> join that
> >>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Maximiliano Felice
> >>>>>>>>
> >>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
> >>>>>>>>
>
> > leif.walsh@
>
> >> escribió:
> >>>>>>>>
> >>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
> >>>>>>>>> had success using pyarrow’s parquet reader and have been quite
> >>>>>>>>> happy with
> >>>>>>>>> it so far. If your target is python (and probably if not now,
> then
> >>>>>>>>> soon,
> >>>>>>>>> R), you should look in to it.
> >>>>>>>>>
> >>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley &lt;
>
> > joseph@
>
> > &gt;
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
> >>>>>>>>>> here:
> >>>>>>>>>> * Our approach was to use the same format but write JSON instead
> >>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using
> >>>>>>>>>> the same
> >>>>>>>>>> format simplifies architecture.  Plus, some people want to check
> >>>>>>>>>> files into
> >>>>>>>>>> version control, and JSON is nice for that.
> >>>>>>>>>> * The reader/writer APIs could be extended to take format
> >>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON
> >>>>>>>>>> (and maybe,
> >>>>>>>>>> eventually, handle Parquet in the online serving setting).
> >>>>>>>>>>
> >>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
> >>>>>>>>>> If people are around at the Spark Summit, that could be a good
> >>>>>>>>>> time to meet
> >>>>>>>>>> up & then post notes back to the dev list.
> >>>>>>>>>>
> >>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
> >>>>>>>>>>
>
> > felixcheung_m@
>
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
> >>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
> >>>>>>>>>>> implementations
> >>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing
> >>>>>>>>>>> trained models
> >>>>>>>>>>> outside of Spark for online serving.
> >>>>>>>>>>>
> >>>>>>>>>>> What’s the next step? Would folks be interested in getting
> >>>>>>>>>>> together to discuss/get some feedback?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _____________________________
> >>>>>>>>>>> From: Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
> >>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
> >>>>>>>>>>> To: Holden Karau &lt;
>
> > holden@
>
> > &gt;, Joseph Bradley <
> >>>>>>>>>>>
>
> > joseph@
>
> >>
> >>>>>>>>>>> Cc: dev &lt;
>
> > dev@.apache
>
> > &gt;
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Huge +1 on this!
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------------
> >>>>>>>>>>> *From:*
>
> > holden.karau@
>
> >  &lt;
>
> > holden.karau@
>
> > &gt; on behalf
> >>>>>>>>>>> of Holden Karau &lt;
>
> > holden@
>
> > &gt;
> >>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
> >>>>>>>>>>> *To:* Joseph Bradley
> >>>>>>>>>>> *Cc:* dev
> >>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
> >>>>>>>>>>>
>
> > joseph@
>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
> >>>>>>>>>>>> this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Awesome! I'm glad other folks think something like this
> belongs
> >>>>>>>>>>> in Spark.
> >>>>>>>>>>>
> >>>>>>>>>>>> This was one of the original goals for mllib-local: to have
> >>>>>>>>>>>> local versions of MLlib models which could be deployed without
> >>>>>>>>>>>> the big
> >>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There
> >>>>>>>>>>>> are related
> >>>>>>>>>>>> commercial offerings like this : ) but the overhead of
> >>>>>>>>>>>> maintaining those
> >>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to
> >>>>>>>>>>>> avoid copying
> >>>>>>>>>>>> logic across libraries will be well worth it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We've talked about this need at Databricks and have also been
> >>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
> >>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
> >>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
> >>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would
> >>>>>>>>>>>> be ideal to
> >>>>>>>>>>>> have local, lightweight versions of models in mllib-local,
> >>>>>>>>>>>> outside of the
> >>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
> >>>>>>>>>>>> dependencies).
> >>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
> >>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and
> >>>>>>>>>>>> Types, which
> >>>>>>>>>>>> could be moved into a local sql package.
> >>>>>>>>>>>> * This architecture may require some awkward APIs currently to
> >>>>>>>>>>>> have model prediction logic in mllib-local, local model
> classes
> >>>>>>>>>>>> in
> >>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in
> >>>>>>>>>>>> mllib.  We
> >>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0
> >>>>>>>>>>>> to
> >>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd
> >>>>>>>>>>>> party
> >>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
> >>>>>>>>>>>>
> >>>>>>>>>>> I agree this could be interesting, and feed into the other
> >>>>>>>>>>> discussion around when (or if) we should be considering Spark
> >>>>>>>>>>> 3.0
> >>>>>>>>>>> I _think_ we could probably do it with optional traits people
> >>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be
> >>>>>>>>>>> wrong on
> >>>>>>>>>>> that point.
> >>>>>>>>>>>
> >>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
> >>>>>>>>>>>> might not be as important as per-Row transformations, but they
> >>>>>>>>>>>> would be
> >>>>>>>>>>>> helpful for batching for higher throughput.
> >>>>>>>>>>>>
> >>>>>>>>>>> That could be interesting as well.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll be interested to hear others' thoughts too!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Joseph
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
> >>>>>>>>>>>>
>
> > holden@
>
> >> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi y'all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems
> like
> >>>>>>>>>>>>> a good a time as any to revisit the online serving situation
> >>>>>>>>>>>>> in Spark ML.
> >>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of
> >>>>>>>>>>>>> the necessary
> >>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend
> >>>>>>>>>>>>> on having a
> >>>>>>>>>>>>> SparkContext.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There are a few different commercial and non-commercial
> >>>>>>>>>>>>> solutions round this, but currently our individual
> >>>>>>>>>>>>> transform/predict
> >>>>>>>>>>>>> methods are private so they either need to copy or
> >>>>>>>>>>>>> re-implement (or put
> >>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would
> >>>>>>>>>>>>> folks feel about
> >>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do
> >>>>>>>>>>>>> transformation of
> >>>>>>>>>>>>> single element inputs (or local collections) that could be
> >>>>>>>>>>>>> optionally
> >>>>>>>>>>>>> implemented by stages which support this? That way we can
> have
> >>>>>>>>>>>>> less copy
> >>>>>>>>>>>>> and paste code possibly getting out of sync with our model
> >>>>>>>>>>>>> training.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think continuing to have on-line serving grow in different
> >>>>>>>>>>>>> projects is probably the right path, forward (folks have
> >>>>>>>>>>>>> different needs),
> >>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to
> >>>>>>>>>>>>> build reliable
> >>>>>>>>>>>>> serving tools.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
> >>>>>>>>>>>>> position with their own commercial offerings, but hopefully
> if
> >>>>>>>>>>>>> we make it
> >>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as
> >>>>>>>>>>>>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Holden :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Joseph Bradley
> >>>>>>>>>>>> Software Engineer - Machine Learning
> >>>>>>>>>>>> Databricks, Inc.
> >>>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Joseph Bradley
> >>>>>>>>>> Software Engineer - Machine Learning
> >>>>>>>>>> Databricks, Inc.
> >>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> Cheers,
> >>>>>>>>> Leif
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Twitter: https://twitter.com/holdenkarau
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>>>
> >>> --
> > Twitter: https://twitter.com/holdenkarau
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Revisiting Online serving of Spark models?

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Hi,

It'd be great if there can be any sharing of the offline discussion. Thanks!



Holden Karau wrote
> We’re by the registration sign going to start walking over at 4:05
> 
> On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <

> maximilianofelice@

>> wrote:
> 
>> Hi!
>>
>> Do we meet at the entrance?
>>
>> See you
>>
>>
>> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
>> 

> nick.pentreath@

>> escribió:
>>
>>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>>>
>>> On Sun, 3 Jun 2018 at 00:24 Holden Karau &lt;

> holden@

> &gt; wrote:
>>>
>>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>>>> 

> maximilianofelice@

>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> We're already in San Francisco waiting for the summit. We even think
>>>>> that we spotted @holdenk this afternoon.
>>>>>
>>>> Unless you happened to be walking by my garage probably not super
>>>> likely, spent the day working on scooters/motorcycles (my style is a
>>>> little
>>>> less unique in SF :)). Also if you see me feel free to say hi unless I
>>>> look
>>>> like I haven't had my first coffee of the day, love chatting with folks
>>>> IRL
>>>> :)
>>>>
>>>>>
>>>>> @chris, we're really interested in the Meetup you're hosting. My team
>>>>> will probably join it since the beginning of you have room for us, and
>>>>> I'll
>>>>> join it later after discussing the topics on this thread. I'll send
>>>>> you an
>>>>> email regarding this request.
>>>>>
>>>>> Thanks
>>>>>
>>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
>>>>> 

> sxk1969@

>> escribió:
>>>>>
>>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>>>> folks
>>>>>>
>>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>>>> meetup around model serving in spark at my work or elsewhere close,
>>>>>> thoughts?  I’m actually in the midst of building microservices to
>>>>>> manage
>>>>>> models and when I say models I mean much more than machine learning
>>>>>> models
>>>>>> (think OR, process models as well)
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly &lt;

> chris@

> &gt; wrote:
>>>>>>
>>>>>> Hey everyone!
>>>>>>
>>>>>> @Felix:  thanks for putting this together.  i sent some of you a
>>>>>> quick
>>>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>>>
>>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>>>> TensorFlow Meetup*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
>>>>>> @5:30pm
>>>>>> on June 6th (same night) here in SF!
>>>>>>
>>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>>>> includes the signup link:
>>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
>>>>>>
>>>>>> We have an awesome lineup of speakers covered a lot of deep,
>>>>>> technical
>>>>>> ground.
>>>>>>
>>>>>> For those who can’t attend in person, we’ll be broadcasting live -
>>>>>> and
>>>>>> posting the recording afterward.
>>>>>>
>>>>>> All details are in the meetup link above…
>>>>>>
>>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>>>> welcome to give a talk. I can move things around to make room.
>>>>>>
>>>>>> @joseph:  I’d personally like an update on the direction of the
>>>>>> Databricks proprietary ML Serving export format which is similar to
>>>>>> PMML
>>>>>> but not a standard in any way.
>>>>>>
>>>>>> Also, the Databricks ML Serving Runtime is only available to
>>>>>> Databricks customers.  This seems in conflict with the community
>>>>>> efforts
>>>>>> described here.  Can you comment on behalf of Databricks?
>>>>>>
>>>>>> Look forward to your response, joseph.
>>>>>>
>>>>>> See you all soon!
>>>>>>
>>>>>> —
>>>>>>
>>>>>>
>>>>>> *Chris Fregly *Founder @ *PipelineAI* &lt;https://pipeline.ai/&gt;
>>>>>> (100,000
>>>>>> Users)
>>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/&gt;
>>>>>> (85,000
>>>>>> Global Members)
>>>>>>
>>>>>>
>>>>>>
>>>>>> *San Francisco - Chicago - Austin -
>>>>>> Washington DC - London - Dusseldorf *
>>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>>>> &lt;http://community.pipeline.ai/&gt;*
>>>>>>
>>>>>>
>>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> Thank you! Let’s meet then
>>>>>>
>>>>>> June 6 4pm
>>>>>>
>>>>>> Moscone West Convention Center
>>>>>> 800 Howard Street, San Francisco, CA 94103
>>>>>> &lt;https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&amp;entry=gmail&amp;source=g&gt;
>>>>>>
>>>>>> Ground floor (outside of conference area - should be available for
>>>>>> all) - we will meet and decide where to go
>>>>>>
>>>>>> (Would not send invite because that would be too much noise for dev@)
>>>>>>
>>>>>> To paraphrase Joseph, we will use this to kick off the discusssion
>>>>>> and
>>>>>> post notes after and follow up online. As for Seattle, I would be
>>>>>> very
>>>>>> interested to meet in person lateen and discuss ;)
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Saikat Kanjilal &lt;

> sxk1969@

> &gt;
>>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Maximiliano Felice &lt;

> maximilianofelice@

> &gt;
>>>>>> Cc: Felix Cheung &lt;

> felixcheung_m@

> &gt;, Holden Karau <
>>>>>> 

> holden@

>>, Joseph Bradley &lt;

> joseph@

> &gt;, Leif
>>>>>> Walsh &lt;

> leif.walsh@

> &gt;, dev &lt;

> dev@.apache

> &gt;
>>>>>>
>>>>>>
>>>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>>>> work?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>>>> 

> maximilianofelice@

>> wrote:
>>>>>>
>>>>>> Big +1 to a meeting with fresh air.
>>>>>>
>>>>>> Could anyone send the invites? I don't really know which is the place
>>>>>> Holden is talking about.
>>>>>>
>>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung &lt;

> felixcheung_m@

> &gt;:
>>>>>>
>>>>>>> You had me at blue bottle!
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Holden Karau &lt;

> holden@

> &gt;
>>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>> To: Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>> Cc: Saikat Kanjilal &lt;

> sxk1969@

> &gt;, Maximiliano Felice <
>>>>>>> 

> maximilianofelice@

>>, Joseph Bradley &lt;

> joseph@

> &gt;,
>>>>>>> Leif Walsh &lt;

> leif.walsh@

> &gt;, dev &lt;

> dev@.apache

> &gt;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm down for that, we could all go for a walk maybe to the mint
>>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our
>>>>>>> design
>>>>>>> meeting outside :p)?
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>>>> 

> felixcheung_m@

>> wrote:
>>>>>>>
>>>>>>>> Bump.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>>>
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>>>> (near) the Summit?
>>>>>>>>
>>>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>>>> people might not be in the conference)
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Saikat Kanjilal &lt;

> sxk1969@

> &gt;
>>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>>>> *To:* Maximiliano Felice
>>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
>>>>>>>> well
>>>>>>>> for model serving and would love to join this discussion.
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>>>> 

> maximilianofelice@

>> wrote:
>>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I'm don't usually write a lot on this list but I keep up to date
>>>>>>>> with the discussions and I'm a heavy user of Spark. This topic
>>>>>>>> caught my
>>>>>>>> attention, as we're currently facing this issue at work. I'm
>>>>>>>> attending to
>>>>>>>> the summit and was wondering if it would it be possible for me to
>>>>>>>> join that
>>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Maximiliano Felice
>>>>>>>>
>>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
>>>>>>>> 

> leif.walsh@

>> escribió:
>>>>>>>>
>>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
>>>>>>>>> had success using pyarrow’s parquet reader and have been quite
>>>>>>>>> happy with
>>>>>>>>> it so far. If your target is python (and probably if not now, then
>>>>>>>>> soon,
>>>>>>>>> R), you should look in to it.
>>>>>>>>>
>>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley &lt;

> joseph@

> &gt;
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
>>>>>>>>>> here:
>>>>>>>>>> * Our approach was to use the same format but write JSON instead
>>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using
>>>>>>>>>> the same
>>>>>>>>>> format simplifies architecture.  Plus, some people want to check
>>>>>>>>>> files into
>>>>>>>>>> version control, and JSON is nice for that.
>>>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON
>>>>>>>>>> (and maybe,
>>>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>>>
>>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
>>>>>>>>>> If people are around at the Spark Summit, that could be a good
>>>>>>>>>> time to meet
>>>>>>>>>> up & then post notes back to the dev list.
>>>>>>>>>>
>>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>>>> 

> felixcheung_m@

>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
>>>>>>>>>>> implementations
>>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing 
>>>>>>>>>>> trained models
>>>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>>>
>>>>>>>>>>> What’s the next step? Would folks be interested in getting
>>>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>>>> To: Holden Karau &lt;

> holden@

> &gt;, Joseph Bradley <
>>>>>>>>>>> 

> joseph@

>>
>>>>>>>>>>> Cc: dev &lt;

> dev@.apache

> &gt;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Huge +1 on this!
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>> *From:*

> holden.karau@

>  &lt;

> holden.karau@

> &gt; on behalf
>>>>>>>>>>> of Holden Karau &lt;

> holden@

> &gt;
>>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>>>> *Cc:* dev
>>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>>>> 

> joseph@

>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>>>>> this.
>>>>>>>>>>>>
>>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
>>>>>>>>>>> in Spark.
>>>>>>>>>>>
>>>>>>>>>>>> This was one of the original goals for mllib-local: to have
>>>>>>>>>>>> local versions of MLlib models which could be deployed without
>>>>>>>>>>>> the big
>>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There
>>>>>>>>>>>> are related
>>>>>>>>>>>> commercial offerings like this : ) but the overhead of
>>>>>>>>>>>> maintaining those
>>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to
>>>>>>>>>>>> avoid copying
>>>>>>>>>>>> logic across libraries will be well worth it.
>>>>>>>>>>>>
>>>>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would
>>>>>>>>>>>> be ideal to
>>>>>>>>>>>> have local, lightweight versions of models in mllib-local,
>>>>>>>>>>>> outside of the
>>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>>>>> dependencies).
>>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
>>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and
>>>>>>>>>>>> Types, which
>>>>>>>>>>>> could be moved into a local sql package.
>>>>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>>>>> have model prediction logic in mllib-local, local model classes
>>>>>>>>>>>> in
>>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in
>>>>>>>>>>>> mllib.  We
>>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0
>>>>>>>>>>>> to
>>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd
>>>>>>>>>>>> party
>>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>>>
>>>>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>>>>> discussion around when (or if) we should be considering Spark
>>>>>>>>>>> 3.0
>>>>>>>>>>> I _think_ we could probably do it with optional traits people
>>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be
>>>>>>>>>>> wrong on
>>>>>>>>>>> that point.
>>>>>>>>>>>
>>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
>>>>>>>>>>>> might not be as important as per-Row transformations, but they
>>>>>>>>>>>> would be
>>>>>>>>>>>> helpful for batching for higher throughput.
>>>>>>>>>>>>
>>>>>>>>>>> That could be interesting as well.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>>>
>>>>>>>>>>>> Joseph
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>>>> 

> holden@

>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like
>>>>>>>>>>>>> a good a time as any to revisit the online serving situation
>>>>>>>>>>>>> in Spark ML.
>>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of
>>>>>>>>>>>>> the necessary
>>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend
>>>>>>>>>>>>> on having a
>>>>>>>>>>>>> SparkContext.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>>>> solutions round this, but currently our individual
>>>>>>>>>>>>> transform/predict
>>>>>>>>>>>>> methods are private so they either need to copy or
>>>>>>>>>>>>> re-implement (or put
>>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would
>>>>>>>>>>>>> folks feel about
>>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do
>>>>>>>>>>>>> transformation of
>>>>>>>>>>>>> single element inputs (or local collections) that could be
>>>>>>>>>>>>> optionally
>>>>>>>>>>>>> implemented by stages which support this? That way we can have
>>>>>>>>>>>>> less copy
>>>>>>>>>>>>> and paste code possibly getting out of sync with our model
>>>>>>>>>>>>> training.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>>>>> projects is probably the right path, forward (folks have
>>>>>>>>>>>>> different needs),
>>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to
>>>>>>>>>>>>> build reliable
>>>>>>>>>>>>> serving tools.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>>>>> position with their own commercial offerings, but hopefully if
>>>>>>>>>>>>> we make it
>>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Holden :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Joseph Bradley
>>>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Joseph Bradley
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> Cheers,
>>>>>>>>> Leif
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>> --
> Twitter: https://twitter.com/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

We’re by the registration sign going to start walking over at 4:05

On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
maximilianofelice@gmail.com> wrote:

> Hi!
>
> Do we meet at the entrance?
>
> See you
>
>
> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> nick.pentreath@gmail.com> escribió:
>
>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>>
>> On Sun, 3 Jun 2018 at 00:24 Holden Karau <ho...@pigscanfly.ca> wrote:
>>
>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> We're already in San Francisco waiting for the summit. We even think
>>>> that we spotted @holdenk this afternoon.
>>>>
>>> Unless you happened to be walking by my garage probably not super
>>> likely, spent the day working on scooters/motorcycles (my style is a little
>>> less unique in SF :)). Also if you see me feel free to say hi unless I look
>>> like I haven't had my first coffee of the day, love chatting with folks IRL
>>> :)
>>>
>>>>
>>>> @chris, we're really interested in the Meetup you're hosting. My team
>>>> will probably join it since the beginning of you have room for us, and I'll
>>>> join it later after discussing the topics on this thread. I'll send you an
>>>> email regarding this request.
>>>>
>>>> Thanks
>>>>
>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
>>>> sxk1969@hotmail.com> escribió:
>>>>
>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>>> folks
>>>>>
>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>>> meetup around model serving in spark at my work or elsewhere close,
>>>>> thoughts?  I’m actually in the midst of building microservices to manage
>>>>> models and when I say models I mean much more than machine learning models
>>>>> (think OR, process models as well)
>>>>>
>>>>> Regards
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>
>>>>> Hey everyone!
>>>>>
>>>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>>
>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>>> TensorFlow Meetup*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm
>>>>> on June 6th (same night) here in SF!
>>>>>
>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>>> includes the signup link:
>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>>>
>>>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>>>> ground.
>>>>>
>>>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>>>> posting the recording afterward.
>>>>>
>>>>> All details are in the meetup link above…
>>>>>
>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>>> welcome to give a talk. I can move things around to make room.
>>>>>
>>>>> @joseph:  I’d personally like an update on the direction of the
>>>>> Databricks proprietary ML Serving export format which is similar to PMML
>>>>> but not a standard in any way.
>>>>>
>>>>> Also, the Databricks ML Serving Runtime is only available to
>>>>> Databricks customers.  This seems in conflict with the community efforts
>>>>> described here.  Can you comment on behalf of Databricks?
>>>>>
>>>>> Look forward to your response, joseph.
>>>>>
>>>>> See you all soon!
>>>>>
>>>>> —
>>>>>
>>>>>
>>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>>>> Users)
>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>>>> Global Members)
>>>>>
>>>>>
>>>>>
>>>>> *San Francisco - Chicago - Austin -
>>>>> Washington DC - London - Dusseldorf *
>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>>> <http://community.pipeline.ai/>*
>>>>>
>>>>>
>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> Thank you! Let’s meet then
>>>>>
>>>>> June 6 4pm
>>>>>
>>>>> Moscone West Convention Center
>>>>> 800 Howard Street, San Francisco, CA 94103
>>>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>>>
>>>>> Ground floor (outside of conference area - should be available for
>>>>> all) - we will meet and decide where to go
>>>>>
>>>>> (Would not send invite because that would be too much noise for dev@)
>>>>>
>>>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>>>> post notes after and follow up online. As for Seattle, I would be very
>>>>> interested to meet in person lateen and discuss ;)
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Saikat Kanjilal <sx...@hotmail.com>
>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Maximiliano Felice <ma...@gmail.com>
>>>>> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
>>>>> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif
>>>>> Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>>>
>>>>>
>>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>>> work?
>>>>>
>>>>> Regards
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>>> maximilianofelice@gmail.com> wrote:
>>>>>
>>>>> Big +1 to a meeting with fresh air.
>>>>>
>>>>> Could anyone send the invites? I don't really know which is the place
>>>>> Holden is talking about.
>>>>>
>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>>>>>
>>>>>> You had me at blue bottle!
>>>>>>
>>>>>> _____________________________
>>>>>> From: Holden Karau <ho...@pigscanfly.ca>
>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>>>>>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>>>>>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm down for that, we could all go for a walk maybe to the mint
>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our design
>>>>>> meeting outside :p)?
>>>>>>
>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>
>>>>>>> Bump.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Felix Cheung <fe...@hotmail.com>
>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>>
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>>> (near) the Summit?
>>>>>>>
>>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>>> people might not be in the conference)
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>>> *To:* Maximiliano Felice
>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>>>> for model serving and would love to join this discussion.
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>>> maximilianofelice@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I'm don't usually write a lot on this list but I keep up to date
>>>>>>> with the discussions and I'm a heavy user of Spark. This topic caught my
>>>>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>>>>> the summit and was wondering if it would it be possible for me to join that
>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Maximiliano Felice
>>>>>>>
>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
>>>>>>> leif.walsh@gmail.com> escribió:
>>>>>>>
>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
>>>>>>>> had success using pyarrow’s parquet reader and have been quite happy with
>>>>>>>> it so far. If your target is python (and probably if not now, then soon,
>>>>>>>> R), you should look in to it.
>>>>>>>>
>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
>>>>>>>>> here:
>>>>>>>>> * Our approach was to use the same format but write JSON instead
>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>>>>>> version control, and JSON is nice for that.
>>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON (and maybe,
>>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>>
>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
>>>>>>>>> If people are around at the Spark Summit, that could be a good time to meet
>>>>>>>>> up & then post notes back to the dev list.
>>>>>>>>>
>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>>
>>>>>>>>>> What’s the next step? Would folks be interested in getting
>>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _____________________________
>>>>>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>>>>>> joseph@databricks.com>
>>>>>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Huge +1 on this!
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf
>>>>>>>>>> of Holden Karau <ho...@pigscanfly.ca>
>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>>> *Cc:* dev
>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
>>>>>>>>>> in Spark.
>>>>>>>>>>
>>>>>>>>>>> This was one of the original goals for mllib-local: to have
>>>>>>>>>>> local versions of MLlib models which could be deployed without the big
>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There are related
>>>>>>>>>>> commercial offerings like this : ) but the overhead of maintaining those
>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to avoid copying
>>>>>>>>>>> logic across libraries will be well worth it.
>>>>>>>>>>>
>>>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would be ideal to
>>>>>>>>>>> have local, lightweight versions of models in mllib-local, outside of the
>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>>>> dependencies).
>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and Types, which
>>>>>>>>>>> could be moved into a local sql package.
>>>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>>>> have model prediction logic in mllib-local, local model classes in
>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We
>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 to
>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd party
>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>>
>>>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>>>>> I _think_ we could probably do it with optional traits people
>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be wrong on
>>>>>>>>>> that point.
>>>>>>>>>>
>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
>>>>>>>>>>> might not be as important as per-Row transformations, but they would be
>>>>>>>>>>> helpful for batching for higher throughput.
>>>>>>>>>>>
>>>>>>>>>> That could be interesting as well.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>>
>>>>>>>>>>> Joseph
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>>
>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like
>>>>>>>>>>>> a good a time as any to revisit the online serving situation in Spark ML.
>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of the necessary
>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>>>>>> SparkContext.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>>> solutions round this, but currently our individual transform/predict
>>>>>>>>>>>> methods are private so they either need to copy or re-implement (or put
>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would folks feel about
>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do transformation of
>>>>>>>>>>>> single element inputs (or local collections) that could be optionally
>>>>>>>>>>>> implemented by stages which support this? That way we can have less copy
>>>>>>>>>>>> and paste code possibly getting out of sync with our model training.
>>>>>>>>>>>>
>>>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>>>>>> serving tools.
>>>>>>>>>>>>
>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>>>> position with their own commercial offerings, but hopefully if we make it
>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as well.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Holden :)
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Joseph Bradley
>>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joseph Bradley
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>> Databricks, Inc.
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>> Leif
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Maximiliano Felice <ma...@gmail.com>.

Hi!

Do we meet at the entrance?

See you

El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <ni...@gmail.com>
escribió:

> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>
> On Sun, 3 Jun 2018 at 00:24 Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>> maximilianofelice@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> We're already in San Francisco waiting for the summit. We even think
>>> that we spotted @holdenk this afternoon.
>>>
>> Unless you happened to be walking by my garage probably not super likely,
>> spent the day working on scooters/motorcycles (my style is a little less
>> unique in SF :)). Also if you see me feel free to say hi unless I look like
>> I haven't had my first coffee of the day, love chatting with folks IRL :)
>>
>>>
>>> @chris, we're really interested in the Meetup you're hosting. My team
>>> will probably join it since the beginning of you have room for us, and I'll
>>> join it later after discussing the topics on this thread. I'll send you an
>>> email regarding this request.
>>>
>>> Thanks
>>>
>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sx...@hotmail.com>
>>> escribió:
>>>
>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>> folks
>>>>
>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>> meetup around model serving in spark at my work or elsewhere close,
>>>> thoughts?  I’m actually in the midst of building microservices to manage
>>>> models and when I say models I mean much more than machine learning models
>>>> (think OR, process models as well)
>>>>
>>>> Regards
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>
>>>> Hey everyone!
>>>>
>>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>
>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>> TensorFlow Meetup*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm
>>>> on June 6th (same night) here in SF!
>>>>
>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>> includes the signup link:
>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>>
>>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>>> ground.
>>>>
>>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>>> posting the recording afterward.
>>>>
>>>> All details are in the meetup link above…
>>>>
>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>> welcome to give a talk. I can move things around to make room.
>>>>
>>>> @joseph:  I’d personally like an update on the direction of the
>>>> Databricks proprietary ML Serving export format which is similar to PMML
>>>> but not a standard in any way.
>>>>
>>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>>> customers.  This seems in conflict with the community efforts described
>>>> here.  Can you comment on behalf of Databricks?
>>>>
>>>> Look forward to your response, joseph.
>>>>
>>>> See you all soon!
>>>>
>>>> —
>>>>
>>>>
>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>>> Users)
>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>>> Global Members)
>>>>
>>>>
>>>>
>>>> *San Francisco - Chicago - Austin -
>>>> Washington DC - London - Dusseldorf *
>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>> <http://community.pipeline.ai/>*
>>>>
>>>>
>>>> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>> Hi!
>>>>
>>>> Thank you! Let’s meet then
>>>>
>>>> June 6 4pm
>>>>
>>>> Moscone West Convention Center
>>>> 800 Howard Street, San Francisco, CA 94103
>>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>>
>>>> Ground floor (outside of conference area - should be available for all)
>>>> - we will meet and decide where to go
>>>>
>>>> (Would not send invite because that would be too much noise for dev@)
>>>>
>>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>>> post notes after and follow up online. As for Seattle, I would be very
>>>> interested to meet in person lateen and discuss ;)
>>>>
>>>>
>>>> _____________________________
>>>> From: Saikat Kanjilal <sx...@hotmail.com>
>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Maximiliano Felice <ma...@gmail.com>
>>>> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
>>>> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif
>>>> Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>>
>>>>
>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>> work?
>>>>
>>>> Regards
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>> maximilianofelice@gmail.com> wrote:
>>>>
>>>> Big +1 to a meeting with fresh air.
>>>>
>>>> Could anyone send the invites? I don't really know which is the place
>>>> Holden is talking about.
>>>>
>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>>>>
>>>>> You had me at blue bottle!
>>>>>
>>>>> _____________________________
>>>>> From: Holden Karau <ho...@pigscanfly.ca>
>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>>>>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>>>>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>>>> outside :p)?
>>>>>
>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> Bump.
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Felix Cheung <fe...@hotmail.com>
>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>> (near) the Summit?
>>>>>>
>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>> people might not be in the conference)
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>> *To:* Maximiliano Felice
>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>>> for model serving and would love to join this discussion.
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>> maximilianofelice@gmail.com> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>>>> the summit and was wondering if it would it be possible for me to join that
>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>
>>>>>> Thanks,
>>>>>> Maximiliano Felice
>>>>>>
>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>>>>>> escribió:
>>>>>>
>>>>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>>>>> should look in to it.
>>>>>>>
>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>>>>> * Our approach was to use the same format but write JSON instead of
>>>>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>>>>> version control, and JSON is nice for that.
>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON (and maybe,
>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>
>>>>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>>>>>> & then post notes back to the dev list.
>>>>>>>>
>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>>>
>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>
>>>>>>>>> What’s the next step? Would folks be interested in getting
>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _____________________________
>>>>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>>>>> joseph@databricks.com>
>>>>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Huge +1 on this!
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf
>>>>>>>>> of Holden Karau <ho...@pigscanfly.ca>
>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>> *Cc:* dev
>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
>>>>>>>>> in Spark.
>>>>>>>>>
>>>>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>>>>>> libraries will be well worth it.
>>>>>>>>>>
>>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would be ideal to
>>>>>>>>>> have local, lightweight versions of models in mllib-local, outside of the
>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>>> dependencies).
>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
>>>>>>>>>> to utilize elements of Spark SQL, particularly Rows and Types, which could
>>>>>>>>>> be moved into a local sql package.
>>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>>> have model prediction logic in mllib-local, local model classes in
>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We
>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 to
>>>>>>>>>> facilitate this architecture while making it feasible for 3rd party
>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>
>>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>>>> I _think_ we could probably do it with optional traits people
>>>>>>>>> could mix in to avoid breaking the current APIs but I could be wrong on
>>>>>>>>> that point.
>>>>>>>>>
>>>>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>>>>> not be as important as per-Row transformations, but they would be helpful
>>>>>>>>>> for batching for higher throughput.
>>>>>>>>>>
>>>>>>>>> That could be interesting as well.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>
>>>>>>>>>> Joseph
>>>>>>>>>>
>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>
>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>>>>> SparkContext.
>>>>>>>>>>>
>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>> solutions round this, but currently our individual transform/predict
>>>>>>>>>>> methods are private so they either need to copy or re-implement (or put
>>>>>>>>>>> them selves in org.apache.spark) to access them. How would folks feel about
>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do transformation of
>>>>>>>>>>> single element inputs (or local collections) that could be optionally
>>>>>>>>>>> implemented by stages which support this? That way we can have less copy
>>>>>>>>>>> and paste code possibly getting out of sync with our model training.
>>>>>>>>>>>
>>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>>>>> serving tools.
>>>>>>>>>>>
>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>>> position with their own commercial offerings, but hopefully if we make it
>>>>>>>>>>> easier for everyone the commercial vendors can benefit as well.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Holden :)
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Joseph Bradley
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joseph Bradley
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>> Databricks, Inc.
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> Leif
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: Revisiting Online serving of Spark models?

Posted by Nick Pentreath <ni...@gmail.com>.

I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau <ho...@pigscanfly.ca> wrote:

> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sx...@hotmail.com>
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> <http://community.pipeline.ai/>*
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that would be too much noise for dev@)
>>>
>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>> post notes after and follow up online. As for Seattle, I would be very
>>> interested to meet in person lateen and discuss ;)
>>>
>>>
>>> _____________________________
>>> From: Saikat Kanjilal <sx...@hotmail.com>
>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Maximiliano Felice <ma...@gmail.com>
>>> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
>>> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif
>>> Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>
>>>
>>> Would love to join but am in Seattle, thoughts on how to make this work?
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>> Big +1 to a meeting with fresh air.
>>>
>>> Could anyone send the invites? I don't really know which is the place
>>> Holden is talking about.
>>>
>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>>>
>>>> You had me at blue bottle!
>>>>
>>>> _____________________________
>>>> From: Holden Karau <ho...@pigscanfly.ca>
>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Felix Cheung <fe...@hotmail.com>
>>>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>>>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>>>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>>
>>>>
>>>>
>>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>>> outside :p)?
>>>>
>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>> felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> Bump.
>>>>>
>>>>> ------------------------------
>>>>> *From:* Felix Cheung <fe...@hotmail.com>
>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>> (near) the Summit?
>>>>>
>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>> people might not be in the conference)
>>>>>
>>>>> ------------------------------
>>>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>> *To:* Maximiliano Felice
>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>> for model serving and would love to join this discussion.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>> maximilianofelice@gmail.com> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>>> the summit and was wondering if it would it be possible for me to join that
>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>
>>>>> Thanks,
>>>>> Maximiliano Felice
>>>>>
>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>>>>> escribió:
>>>>>
>>>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>>>> should look in to it.
>>>>>>
>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>>>> * Our approach was to use the same format but write JSON instead of
>>>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>>>> version control, and JSON is nice for that.
>>>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>>>>> handle Parquet in the online serving setting).
>>>>>>>
>>>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>>>>> & then post notes back to the dev list.
>>>>>>>
>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>>
>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>>>> outside of Spark for online serving.
>>>>>>>>
>>>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>>>> to discuss/get some feedback?
>>>>>>>>
>>>>>>>>
>>>>>>>> _____________________________
>>>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>>>> joseph@databricks.com>
>>>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Huge +1 on this!
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf
>>>>>>>> of Holden Karau <ho...@pigscanfly.ca>
>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>> *To:* Joseph Bradley
>>>>>>>> *Cc:* dev
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>>>> Spark.
>>>>>>>>
>>>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>>>>> libraries will be well worth it.
>>>>>>>>>
>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>> methods taking a Row to the current Models.  Instead, it would be ideal to
>>>>>>>>> have local, lightweight versions of models in mllib-local, outside of the
>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>> dependencies).
>>>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
>>>>>>>>> to utilize elements of Spark SQL, particularly Rows and Types, which could
>>>>>>>>> be moved into a local sql package.
>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>> have model prediction logic in mllib-local, local model classes in
>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We
>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 to
>>>>>>>>> facilitate this architecture while making it feasible for 3rd party
>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>
>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>>>> point.
>>>>>>>>
>>>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>>>> not be as important as per-Row transformations, but they would be helpful
>>>>>>>>> for batching for higher throughput.
>>>>>>>>>
>>>>>>>> That could be interesting as well.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>
>>>>>>>>> Joseph
>>>>>>>>>
>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi y'all,
>>>>>>>>>>
>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>>>> SparkContext.
>>>>>>>>>>
>>>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>>>
>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>>>> serving tools.
>>>>>>>>>>
>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>> position with their own commercial offerings, but hopefully if we make it
>>>>>>>>>> easier for everyone the commercial vendors can benefit as well.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Holden :)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joseph Bradley
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>> Databricks, Inc.
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joseph Bradley
>>>>>>> Software Engineer - Machine Learning
>>>>>>> Databricks, Inc.
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Cheers,
>>>>>> Leif
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
maximilianofelice@gmail.com> wrote:

> Hi!
>
> We're already in San Francisco waiting for the summit. We even think that
> we spotted @holdenk this afternoon.
>
Unless you happened to be walking by my garage probably not super likely,
spent the day working on scooters/motorcycles (my style is a little less
unique in SF :)). Also if you see me feel free to say hi unless I look like
I haven't had my first coffee of the day, love chatting with folks IRL :)

>
> @chris, we're really interested in the Meetup you're hosting. My team will
> probably join it since the beginning of you have room for us, and I'll join
> it later after discussing the topics on this thread. I'll send you an email
> regarding this request.
>
> Thanks
>
> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sx...@hotmail.com>
> escribió:
>
>> @Chris This sounds fantastic, please send summary notes for Seattle folks
>>
>> @Felix I work in downtown Seattle, am wondering if we should a tech
>> meetup around model serving in spark at my work or elsewhere close,
>> thoughts?  I’m actually in the midst of building microservices to manage
>> models and when I say models I mean much more than machine learning models
>> (think OR, process models as well)
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>>
>> Hey everyone!
>>
>> @Felix:  thanks for putting this together.  i sent some of you a quick
>> calendar event - mostly for me, so i don’t forget!  :)
>>
>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>> TensorFlow Meetup*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm
>> on June 6th (same night) here in SF!
>>
>> Everybody is welcome to come.  Here’s the link to the meetup that
>> includes the signup link:
>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>
>> We have an awesome lineup of speakers covered a lot of deep, technical
>> ground.
>>
>> For those who can’t attend in person, we’ll be broadcasting live - and
>> posting the recording afterward.
>>
>> All details are in the meetup link above…
>>
>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>> welcome to give a talk. I can move things around to make room.
>>
>> @joseph:  I’d personally like an update on the direction of the
>> Databricks proprietary ML Serving export format which is similar to PMML
>> but not a standard in any way.
>>
>> Also, the Databricks ML Serving Runtime is only available to Databricks
>> customers.  This seems in conflict with the community efforts described
>> here.  Can you comment on behalf of Databricks?
>>
>> Look forward to your response, joseph.
>>
>> See you all soon!
>>
>> —
>>
>>
>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>> Users)
>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>> Global Members)
>>
>>
>>
>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf *
>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>> <http://community.pipeline.ai/>*
>>
>>
>> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>> Hi!
>>
>> Thank you! Let’s meet then
>>
>> June 6 4pm
>>
>> Moscone West Convention Center
>> 800 Howard Street, San Francisco, CA 94103
>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>
>> Ground floor (outside of conference area - should be available for all) -
>> we will meet and decide where to go
>>
>> (Would not send invite because that would be too much noise for dev@)
>>
>> To paraphrase Joseph, we will use this to kick off the discusssion and
>> post notes after and follow up online. As for Seattle, I would be very
>> interested to meet in person lateen and discuss ;)
>>
>>
>> _____________________________
>> From: Saikat Kanjilal <sx...@hotmail.com>
>> Sent: Tuesday, May 29, 2018 11:46 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Maximiliano Felice <ma...@gmail.com>
>> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
>> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif
>> Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>
>>
>> Would love to join but am in Seattle, thoughts on how to make this work?
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>> maximilianofelice@gmail.com> wrote:
>>
>> Big +1 to a meeting with fresh air.
>>
>> Could anyone send the invites? I don't really know which is the place
>> Holden is talking about.
>>
>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>>
>>> You had me at blue bottle!
>>>
>>> _____________________________
>>> From: Holden Karau <ho...@pigscanfly.ca>
>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Felix Cheung <fe...@hotmail.com>
>>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>>
>>>
>>>
>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>> outside :p)?
>>>
>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheung_m@hotmail.com
>>> > wrote:
>>>
>>>> Bump.
>>>>
>>>> ------------------------------
>>>> *From:* Felix Cheung <fe...@hotmail.com>
>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>>> the Summit?
>>>>
>>>> (I propose we meet at the venue entrance so we could accommodate people
>>>> might not be in the conference)
>>>>
>>>> ------------------------------
>>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>> *To:* Maximiliano Felice
>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>> for model serving and would love to join this discussion.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>> maximilianofelice@gmail.com> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>> the summit and was wondering if it would it be possible for me to join that
>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>
>>>> Thanks,
>>>> Maximiliano Felice
>>>>
>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>>>> escribió:
>>>>
>>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>>> should look in to it.
>>>>>
>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>>> * Our approach was to use the same format but write JSON instead of
>>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>>> version control, and JSON is nice for that.
>>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>>>> handle Parquet in the online serving setting).
>>>>>>
>>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>>>> & then post notes back to the dev list.
>>>>>>
>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>
>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>>> outside of Spark for online serving.
>>>>>>>
>>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>>> to discuss/get some feedback?
>>>>>>>
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>>> joseph@databricks.com>
>>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Huge +1 on this!
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>>>>>> Holden Karau <ho...@pigscanfly.ca>
>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>> *To:* Joseph Bradley
>>>>>>> *Cc:* dev
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>> joseph@databricks.com> wrote:
>>>>>>>
>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>>>
>>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>>> Spark.
>>>>>>>
>>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>>>> libraries will be well worth it.
>>>>>>>>
>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>> methods taking a Row to the current Models.  Instead, it would be ideal to
>>>>>>>> have local, lightweight versions of models in mllib-local, outside of the
>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>> dependencies).
>>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
>>>>>>>> to utilize elements of Spark SQL, particularly Rows and Types, which could
>>>>>>>> be moved into a local sql package.
>>>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>>>>> MLlib APIs (especially in Java).
>>>>>>>>
>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>>> point.
>>>>>>>
>>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>>> not be as important as per-Row transformations, but they would be helpful
>>>>>>>> for batching for higher throughput.
>>>>>>>>
>>>>>>> That could be interesting as well.
>>>>>>>
>>>>>>>>
>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>
>>>>>>>> Joseph
>>>>>>>>
>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi y'all,
>>>>>>>>>
>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>>> SparkContext.
>>>>>>>>>
>>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>>
>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>>> serving tools.
>>>>>>>>>
>>>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Holden :)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joseph Bradley
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>> Databricks, Inc.
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joseph Bradley
>>>>>> Software Engineer - Machine Learning
>>>>>> Databricks, Inc.
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>> --
>>>>> --
>>>>> Cheers,
>>>>> Leif
>>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>>
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Maximiliano Felice <ma...@gmail.com>.

Hi!

We're already in San Francisco waiting for the summit. We even think that
we spotted @holdenk this afternoon.

@chris, we're really interested in the Meetup you're hosting. My team will
probably join it since the beginning of you have room for us, and I'll join
it later after discussing the topics on this thread. I'll send you an email
regarding this request.

Thanks

El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sx...@hotmail.com>
escribió:

> @Chris This sounds fantastic, please send summary notes for Seattle folks
>
> @Felix I work in downtown Seattle, am wondering if we should a tech meetup
> around model serving in spark at my work or elsewhere close, thoughts?  I’m
> actually in the midst of building microservices to manage models and when I
> say models I mean much more than machine learning models (think OR, process
> models as well)
>
> Regards
>
> Sent from my iPhone
>
> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>
> Hey everyone!
>
> @Felix:  thanks for putting this together.  i sent some of you a quick
> calendar event - mostly for me, so i don’t forget!  :)
>
> Coincidentally, this is the focus of June 6th's *Advanced Spark and
> TensorFlow Meetup*
> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm
> on June 6th (same night) here in SF!
>
> Everybody is welcome to come.  Here’s the link to the meetup that includes
> the signup link:
> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>
> We have an awesome lineup of speakers covered a lot of deep, technical
> ground.
>
> For those who can’t attend in person, we’ll be broadcasting live - and
> posting the recording afterward.
>
> All details are in the meetup link above…
>
> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
> welcome to give a talk. I can move things around to make room.
>
> @joseph:  I’d personally like an update on the direction of the Databricks
> proprietary ML Serving export format which is similar to PMML but not a
> standard in any way.
>
> Also, the Databricks ML Serving Runtime is only available to Databricks
> customers.  This seems in conflict with the community efforts described
> here.  Can you comment on behalf of Databricks?
>
> Look forward to your response, joseph.
>
> See you all soon!
>
> —
>
>
> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
> Users)
> Organizer @ *Advanced Spark and TensorFlow Meetup*
> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
> Global Members)
>
>
>
> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf *
> *Try our PipelineAI Community Edition with GPUs and TPUs!!
> <http://community.pipeline.ai/>*
>
>
> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> Hi!
>
> Thank you! Let’s meet then
>
> June 6 4pm
>
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>
> Ground floor (outside of conference area - should be available for all) -
> we will meet and decide where to go
>
> (Would not send invite because that would be too much noise for dev@)
>
> To paraphrase Joseph, we will use this to kick off the discusssion and
> post notes after and follow up online. As for Seattle, I would be very
> interested to meet in person lateen and discuss ;)
>
>
> _____________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Tuesday, May 29, 2018 11:46 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice <ma...@gmail.com>
> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif Walsh
> <le...@gmail.com>, dev <de...@spark.apache.org>
>
>
> Would love to join but am in Seattle, thoughts on how to make this work?
>
> Regards
>
> Sent from my iPhone
>
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
> Big +1 to a meeting with fresh air.
>
> Could anyone send the invites? I don't really know which is the place
> Holden is talking about.
>
> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>
>> You had me at blue bottle!
>>
>> _____________________________
>> From: Holden Karau <ho...@pigscanfly.ca>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung <fe...@hotmail.com>
>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>
>>
>>
>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>> blue bottle and grab coffee (if the weather holds have our design meeting
>> outside :p)?
>>
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Bump.
>>>
>>> ------------------------------
>>> *From:* Felix Cheung <fe...@hotmail.com>
>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>> the Summit?
>>>
>>> (I propose we meet at the venue entrance so we could accommodate people
>>> might not be in the conference)
>>>
>>> ------------------------------
>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>> *To:* Maximiliano Felice
>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> I’m in the same exact boat as Maximiliano and have use cases as well for
>>> model serving and would love to join this discussion.
>>>
>>> Sent from my iPhone
>>>
>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> I'm don't usually write a lot on this list but I keep up to date with
>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>> attention, as we're currently facing this issue at work. I'm attending to
>>> the summit and was wondering if it would it be possible for me to join that
>>> meeting. I might be able to share some helpful usecases and ideas.
>>>
>>> Thanks,
>>> Maximiliano Felice
>>>
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>>> escribió:
>>>
>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>> should look in to it.
>>>>
>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>> * Our approach was to use the same format but write JSON instead of
>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>> version control, and JSON is nice for that.
>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>>> handle Parquet in the online serving setting).
>>>>>
>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>>> & then post notes back to the dev list.
>>>>>
>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>> outside of Spark for online serving.
>>>>>>
>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>> to discuss/get some feedback?
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>> joseph@databricks.com>
>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Huge +1 on this!
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>>>>> Holden Karau <ho...@pigscanfly.ca>
>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>> *To:* Joseph Bradley
>>>>>> *Cc:* dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>> joseph@databricks.com> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>>
>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>> Spark.
>>>>>>
>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>>> libraries will be well worth it.
>>>>>>>
>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>>>>> moved into a local sql package.
>>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>>>> MLlib APIs (especially in Java).
>>>>>>>
>>>>>> I agree this could be interesting, and feed into the other discussion
>>>>>> around when (or if) we should be considering Spark 3.0
>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>> point.
>>>>>>
>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>> not be as important as per-Row transformations, but they would be helpful
>>>>>>> for batching for higher throughput.
>>>>>>>
>>>>>> That could be interesting as well.
>>>>>>
>>>>>>>
>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi y'all,
>>>>>>>>
>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>> SparkContext.
>>>>>>>>
>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>
>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>> serving tools.
>>>>>>>>
>>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joseph Bradley
>>>>>>> Software Engineer - Machine Learning
>>>>>>> Databricks, Inc.
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Bradley
>>>>> Software Engineer - Machine Learning
>>>>> Databricks, Inc.
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
>
>

Re: Revisiting Online serving of Spark models?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

@Chris This sounds fantastic, please send summary notes for Seattle folks

@Felix I work in downtown Seattle, am wondering if we should a tech meetup around model serving in spark at my work or elsewhere close, thoughts?  I’m actually in the midst of building microservices to manage models and when I say models I mean much more than machine learning models (think OR, process models as well)

Regards

Sent from my iPhone

On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com>> wrote:

Hey everyone!

@Felix:  thanks for putting this together.  i sent some of you a quick calendar event - mostly for me, so i don’t forget!  :)

Coincidentally, this is the focus of June 6th's Advanced Spark and TensorFlow Meetup<https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm on June 6th (same night) here in SF!

Everybody is welcome to come.  Here’s the link to the meetup that includes the signup link:  https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/

We have an awesome lineup of speakers covered a lot of deep, technical ground.

For those who can’t attend in person, we’ll be broadcasting live - and posting the recording afterward.

All details are in the meetup link above…

@holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than welcome to give a talk. I can move things around to make room.

@joseph:  I’d personally like an update on the direction of the Databricks proprietary ML Serving export format which is similar to PMML but not a standard in any way.

Also, the Databricks ML Serving Runtime is only available to Databricks customers.  This seems in conflict with the community efforts described here.  Can you comment on behalf of Databricks?

Look forward to your response, joseph.

See you all soon!

—

Chris Fregly
Founder @ PipelineAI<https://pipeline.ai/> (100,000 Users)
Organizer @ Advanced Spark and TensorFlow Meetup<https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000 Global Members)

San Francisco - Chicago - Austin -
Washington DC - London - Dusseldorf

Try our PipelineAI Community Edition with GPUs and TPUs!!<http://community.pipeline.ai/>


On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post notes after and follow up online. As for Seattle, I would be very interested to meet in person lateen and discuss ;)


_____________________________
From: Saikat Kanjilal <sx...@hotmail.com>>
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice <ma...@gmail.com>>
Cc: Felix Cheung <fe...@hotmail.com>>, Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>, Leif Walsh <le...@gmail.com>>, dev <de...@spark.apache.org>>


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>>:
You had me at blue bottle!

_____________________________
From: Holden Karau <ho...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <fe...@hotmail.com>>
Cc: Saikat Kanjilal <sx...@hotmail.com>>, Maximiliano Felice <ma...@gmail.com>>, Joseph Bradley <jo...@databricks.com>>, Leif Walsh <le...@gmail.com>>, dev <de...@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>> wrote:
Bump.

________________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>



Huge +1 on this!

________________________________
From:holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.
[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau





--
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.
[http://databricks.com]<http://databricks.com/>
--
--
Cheers,
Leif



--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Chris Fregly <ch...@fregly.com>.

Hey everyone!

@Felix:  thanks for putting this together.  i sent some of you a quick calendar event - mostly for me, so i don’t forget!  :)

Coincidentally, this is the focus of June 6th's Advanced Spark and TensorFlow Meetup <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> @5:30pm on June 6th (same night) here in SF!

Everybody is welcome to come.  Here’s the link to the meetup that includes the signup link:  https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/ <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>

We have an awesome lineup of speakers covered a lot of deep, technical ground.

For those who can’t attend in person, we’ll be broadcasting live - and posting the recording afterward.  

All details are in the meetup link above…

@holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than welcome to give a talk. I can move things around to make room.

@joseph:  I’d personally like an update on the direction of the Databricks proprietary ML Serving export format which is similar to PMML but not a standard in any way.

Also, the Databricks ML Serving Runtime is only available to Databricks customers.  This seems in conflict with the community efforts described here.  Can you comment on behalf of Databricks?

Look forward to your response, joseph.

See you all soon!

—

Chris Fregly
Founder @ PipelineAI <https://pipeline.ai/> (100,000 Users)
Organizer @ Advanced Spark and TensorFlow Meetup <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000 Global Members)

San Francisco - Chicago - Austin - 
Washington DC - London - Dusseldorf

Try our PipelineAI Community Edition with GPUs and TPUs!! <http://community.pipeline.ai/>


> On May 30, 2018, at 9:32 AM, Felix Cheung <fe...@hotmail.com> wrote:
> 
> Hi!
> 
> Thank you! Let’s meet then
> 
> June 6 4pm
> 
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> 
> Ground floor (outside of conference area - should be available for all) - we will meet and decide where to go
> 
> (Would not send invite because that would be too much noise for dev@)
> 
> To paraphrase Joseph, we will use this to kick off the discusssion and post notes after and follow up online. As for Seattle, I would be very interested to meet in person lateen and discuss ;) 
> 
> 
> _____________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Tuesday, May 29, 2018 11:46 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice <ma...@gmail.com>
> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
> 
> 
> Would love to join but am in Seattle, thoughts on how to make this work?
> 
> Regards
> 
> Sent from my iPhone
> 
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <maximilianofelice@gmail.com <ma...@gmail.com>> wrote:
> 
>> Big +1 to a meeting with fresh air.
>> 
>> Could anyone send the invites? I don't really know which is the place Holden is talking about.
>> 
>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>:
>> You had me at blue bottle!
>> 
>> _____________________________
>> From: Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>> Cc: Saikat Kanjilal <sxk1969@hotmail.com <ma...@hotmail.com>>, Maximiliano Felice <maximilianofelice@gmail.com <ma...@gmail.com>>, Joseph Bradley <joseph@databricks.com <ma...@databricks.com>>, Leif Walsh <leif.walsh@gmail.com <ma...@gmail.com>>, dev <dev@spark.apache.org <ma...@spark.apache.org>>
>> 
>> 
>> 
>> I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?
>> 
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>> Bump.
>> 
>> From: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>> Sent: Saturday, May 26, 2018 1:05:29 PM
>> To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>> Cc: Leif Walsh; Holden Karau; dev
>> 
>> Subject: Re: Revisiting Online serving of Spark models?
>>  
>> Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?
>> 
>> (I propose we meet at the venue entrance so we could accommodate people might not be in the conference)
>> 
>> From: Saikat Kanjilal <sxk1969@hotmail.com <ma...@hotmail.com>>
>> Sent: Tuesday, May 22, 2018 7:47:07 AM
>> To: Maximiliano Felice
>> Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>> Subject: Re: Revisiting Online serving of Spark models?
>>  
>> I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.
>> 
>> Sent from my iPhone
>> 
>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <maximilianofelice@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Hi!
>>> 
>>> I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.
>>> 
>>> Thanks,
>>> Maximiliano Felice
>>> 
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com <ma...@gmail.com>> escribió:
>>> I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 
>>> 
>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com <ma...@databricks.com>> wrote:
>>> Regarding model reading and writing, I'll give quick thoughts here:
>>> * Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
>>> * The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).
>>> 
>>> This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.
>>> 
>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>> Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.
>>> 
>>> What’s the next step? Would folks be interested in getting together to discuss/get some feedback?
>>> 
>>> 
>>> _____________________________
>>> From: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>> Sent: Thursday, May 10, 2018 10:10 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>>, Joseph Bradley <joseph@databricks.com <ma...@databricks.com>>
>>> Cc: dev <dev@spark.apache.org <ma...@spark.apache.org>>
>>> 
>>> 
>>> 
>>> Huge +1 on this!
>>> 
>>> From:holden.karau@gmail.com <ma...@gmail.com> <holden.karau@gmail.com <ma...@gmail.com>> on behalf of Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>>
>>> Sent: Thursday, May 10, 2018 9:39:26 AM
>>> To: Joseph Bradley
>>> Cc: dev
>>> Subject: Re: Revisiting Online serving of Spark models?
>>>  
>>> 
>>> 
>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <joseph@databricks.com <ma...@databricks.com>> wrote:
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>> 
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>> This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.
>>> 
>>> We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
>>> I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
>>> I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
>>> * It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
>>> That could be interesting as well. 
>>> 
>>> I'll be interested to hear others' thoughts too!
>>> 
>>> Joseph
>>> 
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>> wrote:
>>> Hi y'all,
>>> 
>>> With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.
>>> 
>>> There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.
>>> 
>>> I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.
>>> 
>>> I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.
>>> 
>>> Cheers,
>>> 
>>> Holden :)
>>> 
>>> -- 
>>> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
>>> 
>>> 
>>> 
>>> --
>>> Joseph Bradley
>>> Software Engineer - Machine Learning
>>> Databricks, Inc.
>>>  <http://databricks.com/>
>>> 
>>> 
>>> 
>>> -- 
>>> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Joseph Bradley
>>> Software Engineer - Machine Learning
>>> Databricks, Inc.
>>>  <http://databricks.com/>
>>> -- 
>>> -- 
>>> Cheers,
>>> Leif
>> 
>> 
>> 
>> -- 
>> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
>> 
>> 
>> 
> 
>

Re: Revisiting Online serving of Spark models?

Posted by Denny Lee <de...@gmail.com>.

I most likely will not be able to join SF next week but definitely up for a
session after Summit in Seattle to dive further into this, eh?!

On Wed, May 30, 2018 at 9:32 AM Felix Cheung <fe...@hotmail.com>
wrote:

> Hi!
>
> Thank you! Let’s meet then
>
> June 6 4pm
>
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>
> Ground floor (outside of conference area - should be available for all) -
> we will meet and decide where to go
>
> (Would not send invite because that would be too much noise for dev@)
>
> To paraphrase Joseph, we will use this to kick off the discusssion and
> post notes after and follow up online. As for Seattle, I would be very
> interested to meet in person lateen and discuss ;)
>
>
> _____________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Tuesday, May 29, 2018 11:46 AM
>
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice <ma...@gmail.com>
> Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <
> holden@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif Walsh
> <le...@gmail.com>, dev <de...@spark.apache.org>
>
>
>
> Would love to join but am in Seattle, thoughts on how to make this work?
>
> Regards
>
> Sent from my iPhone
>
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
> Big +1 to a meeting with fresh air.
>
> Could anyone send the invites? I don't really know which is the place
> Holden is talking about.
>
> 2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:
>
>> You had me at blue bottle!
>>
>> _____________________________
>> From: Holden Karau <ho...@pigscanfly.ca>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung <fe...@hotmail.com>
>> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
>> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
>> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>>
>>
>>
>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>> blue bottle and grab coffee (if the weather holds have our design meeting
>> outside :p)?
>>
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Bump.
>>>
>>> ------------------------------
>>> *From:* Felix Cheung <fe...@hotmail.com>
>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>> the Summit?
>>>
>>> (I propose we meet at the venue entrance so we could accommodate people
>>> might not be in the conference)
>>>
>>> ------------------------------
>>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>> *To:* Maximiliano Felice
>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> I’m in the same exact boat as Maximiliano and have use cases as well for
>>> model serving and would love to join this discussion.
>>>
>>> Sent from my iPhone
>>>
>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> I'm don't usually write a lot on this list but I keep up to date with
>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>> attention, as we're currently facing this issue at work. I'm attending to
>>> the summit and was wondering if it would it be possible for me to join that
>>> meeting. I might be able to share some helpful usecases and ideas.
>>>
>>> Thanks,
>>> Maximiliano Felice
>>>
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>>> escribió:
>>>
>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>> should look in to it.
>>>>
>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>> * Our approach was to use the same format but write JSON instead of
>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>> format simplifies architecture.  Plus, some people want to check files into
>>>>> version control, and JSON is nice for that.
>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>>> handle Parquet in the online serving setting).
>>>>>
>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>>> & then post notes back to the dev list.
>>>>>
>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>>> outside of Spark for online serving.
>>>>>>
>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>> to discuss/get some feedback?
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>>> joseph@databricks.com>
>>>>>> Cc: dev <de...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Huge +1 on this!
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>>>>> Holden Karau <ho...@pigscanfly.ca>
>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>> *To:* Joseph Bradley
>>>>>> *Cc:* dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>> joseph@databricks.com> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>>
>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>> Spark.
>>>>>>
>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>>> libraries will be well worth it.
>>>>>>>
>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>>>>> moved into a local sql package.
>>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>>>> MLlib APIs (especially in Java).
>>>>>>>
>>>>>> I agree this could be interesting, and feed into the other discussion
>>>>>> around when (or if) we should be considering Spark 3.0
>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>> point.
>>>>>>
>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>> not be as important as per-Row transformations, but they would be helpful
>>>>>>> for batching for higher throughput.
>>>>>>>
>>>>>> That could be interesting as well.
>>>>>>
>>>>>>>
>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi y'all,
>>>>>>>>
>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>>> SparkContext.
>>>>>>>>
>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>
>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>>> serving tools.
>>>>>>>>
>>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Joseph Bradley
>>>>>>>
>>>>>>> Software Engineer - Machine Learning
>>>>>>>
>>>>>>> Databricks, Inc.
>>>>>>>
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
>

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post notes after and follow up online. As for Seattle, I would be very interested to meet in person lateen and discuss ;)


_____________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice <ma...@gmail.com>
Cc: Felix Cheung <fe...@hotmail.com>, Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>, Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>>:
You had me at blue bottle!

_____________________________
From: Holden Karau <ho...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <fe...@hotmail.com>>
Cc: Saikat Kanjilal <sx...@hotmail.com>>, Maximiliano Felice <ma...@gmail.com>>, Joseph Bradley <jo...@databricks.com>>, Leif Walsh <le...@gmail.com>>, dev <de...@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>> wrote:
Bump.

________________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>



Huge +1 on this!

________________________________
From:holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif



--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>>:
You had me at blue bottle!

_____________________________
From: Holden Karau <ho...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <fe...@hotmail.com>>
Cc: Saikat Kanjilal <sx...@hotmail.com>>, Maximiliano Felice <ma...@gmail.com>>, Joseph Bradley <jo...@databricks.com>>, Leif Walsh <le...@gmail.com>>, dev <de...@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>> wrote:
Bump.

________________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>



Huge +1 on this!

________________________________
From:holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif



--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Maximiliano Felice <ma...@gmail.com>.

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place
Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <fe...@hotmail.com>:

> You had me at blue bottle!
>
> _____________________________
> From: Holden Karau <ho...@pigscanfly.ca>
> Sent: Tuesday, May 29, 2018 9:47 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Felix Cheung <fe...@hotmail.com>
> Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <
> maximilianofelice@gmail.com>, Joseph Bradley <jo...@databricks.com>,
> Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>
>
>
>
> I'm down for that, we could all go for a walk maybe to the mint plazaa
> blue bottle and grab coffee (if the weather holds have our design meeting
> outside :p)?
>
> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Bump.
>>
>> ------------------------------
>> *From:* Felix Cheung <fe...@hotmail.com>
>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>> *Cc:* Leif Walsh; Holden Karau; dev
>>
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>> the Summit?
>>
>> (I propose we meet at the venue entrance so we could accommodate people
>> might not be in the conference)
>>
>> ------------------------------
>> *From:* Saikat Kanjilal <sx...@hotmail.com>
>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>> *To:* Maximiliano Felice
>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>> I’m in the same exact boat as Maximiliano and have use cases as well for
>> model serving and would love to join this discussion.
>>
>> Sent from my iPhone
>>
>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>> maximilianofelice@gmail.com> wrote:
>>
>> Hi!
>>
>> I'm don't usually write a lot on this list but I keep up to date with the
>> discussions and I'm a heavy user of Spark. This topic caught my attention,
>> as we're currently facing this issue at work. I'm attending to the summit
>> and was wondering if it would it be possible for me to join that meeting. I
>> might be able to share some helpful usecases and ideas.
>>
>> Thanks,
>> Maximiliano Felice
>>
>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
>> escribió:
>>
>>> I’m with you on json being more readable than parquet, but we’ve had
>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>> far. If your target is python (and probably if not now, then soon, R), you
>>> should look in to it.
>>>
>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>>> wrote:
>>>
>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>> * Our approach was to use the same format but write JSON instead of
>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>> format simplifies architecture.  Plus, some people want to check files into
>>>> version control, and JSON is nice for that.
>>>> * The reader/writer APIs could be extended to take format parameters
>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>> handle Parquet in the online serving setting).
>>>>
>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>> people are around at the Spark Summit, that could be a good time to meet up
>>>> & then post notes back to the dev list.
>>>>
>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>> felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>>> outside of Spark for online serving.
>>>>>
>>>>> What’s the next step? Would folks be interested in getting together to
>>>>> discuss/get some feedback?
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Felix Cheung <fe...@hotmail.com>
>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>>> joseph@databricks.com>
>>>>> Cc: dev <de...@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> Huge +1 on this!
>>>>>
>>>>> ------------------------------
>>>>> *From:*holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>>>> Holden Karau <ho...@pigscanfly.ca>
>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>> *To:* Joseph Bradley
>>>>> *Cc:* dev
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <joseph@databricks.com
>>>>> > wrote:
>>>>>
>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>
>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>> Spark.
>>>>>
>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>>> libraries will be well worth it.
>>>>>>
>>>>>> We've talked about this need at Databricks and have also been syncing
>>>>>> with the creators of MLeap.  It'd be great to get this functionality into
>>>>>> Spark itself.  Some thoughts:
>>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>>>> moved into a local sql package.
>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>>> MLlib APIs (especially in Java).
>>>>>>
>>>>> I agree this could be interesting, and feed into the other discussion
>>>>> around when (or if) we should be considering Spark 3.0
>>>>> I _think_ we could probably do it with optional traits people could
>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>> point.
>>>>>
>>>>>> * It could also be worth discussing local DataFrames.  They might not
>>>>>> be as important as per-Row transformations, but they would be helpful for
>>>>>> batching for higher throughput.
>>>>>>
>>>>> That could be interesting as well.
>>>>>
>>>>>>
>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>
>>>>>> Joseph
>>>>>>
>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi y'all,
>>>>>>>
>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>> good a time as any to revisit the online serving situation in Spark ML. DB
>>>>>>> & other's have done some excellent working moving a lot of the necessary
>>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>>> SparkContext.
>>>>>>>
>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>>> possibly getting out of sync with our model training.
>>>>>>>
>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>> projects is probably the right path, forward (folks have different needs),
>>>>>>> but I'd love to see us make it simpler for other projects to build reliable
>>>>>>> serving tools.
>>>>>>>
>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Holden :)
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Joseph Bradley
>>>>>>
>>>>>> Software Engineer - Machine Learning
>>>>>>
>>>>>> Databricks, Inc.
>>>>>>
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>
>
>

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

You had me at blue bottle!

_____________________________
From: Holden Karau <ho...@pigscanfly.ca>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <fe...@hotmail.com>
Cc: Saikat Kanjilal <sx...@hotmail.com>, Maximiliano Felice <ma...@gmail.com>, Joseph Bradley <jo...@databricks.com>, Leif Walsh <le...@gmail.com>, dev <de...@spark.apache.org>

I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>> wrote:
Bump.

________________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?

_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>

Huge +1 on this!

________________________________
From:holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif

--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

I'm down for that, we could all go for a walk maybe to the mint plazaa blue
bottle and grab coffee (if the weather holds have our design meeting
outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <fe...@hotmail.com>
wrote:

> Bump.
>
> ------------------------------
> *From:* Felix Cheung <fe...@hotmail.com>
> *Sent:* Saturday, May 26, 2018 1:05:29 PM
> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
> *Cc:* Leif Walsh; Holden Karau; dev
>
> *Subject:* Re: Revisiting Online serving of Spark models?
>
> Hi! How about we meet the community and discuss on June 6 4pm at (near)
> the Summit?
>
> (I propose we meet at the venue entrance so we could accommodate people
> might not be in the conference)
>
> ------------------------------
> *From:* Saikat Kanjilal <sx...@hotmail.com>
> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
> *To:* Maximiliano Felice
> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
> *Subject:* Re: Revisiting Online serving of Spark models?
>
> I’m in the same exact boat as Maximiliano and have use cases as well for
> model serving and would love to join this discussion.
>
> Sent from my iPhone
>
> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
> Hi!
>
> I'm don't usually write a lot on this list but I keep up to date with the
> discussions and I'm a heavy user of Spark. This topic caught my attention,
> as we're currently facing this issue at work. I'm attending to the summit
> and was wondering if it would it be possible for me to join that meeting. I
> might be able to share some helpful usecases and ideas.
>
> Thanks,
> Maximiliano Felice
>
> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
> escribió:
>
>> I’m with you on json being more readable than parquet, but we’ve had
>> success using pyarrow’s parquet reader and have been quite happy with it so
>> far. If your target is python (and probably if not now, then soon, R), you
>> should look in to it.
>>
>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
>> wrote:
>>
>>> Regarding model reading and writing, I'll give quick thoughts here:
>>> * Our approach was to use the same format but write JSON instead of
>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>> format simplifies architecture.  Plus, some people want to check files into
>>> version control, and JSON is nice for that.
>>> * The reader/writer APIs could be extended to take format parameters
>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>> handle Parquet in the online serving setting).
>>>
>>> This would be a big project, so proposing a SPIP might be best.  If
>>> people are around at the Spark Summit, that could be a good time to meet up
>>> & then post notes back to the dev list.
>>>
>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheung_m@hotmail.com
>>> > wrote:
>>>
>>>> Specifically I’d like bring part of the discussion to Model and
>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>> outside of Spark for online serving.
>>>>
>>>> What’s the next step? Would folks be interested in getting together to
>>>> discuss/get some feedback?
>>>>
>>>>
>>>> _____________________________
>>>> From: Felix Cheung <fe...@hotmail.com>
>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>>> joseph@databricks.com>
>>>> Cc: dev <de...@spark.apache.org>
>>>>
>>>>
>>>>
>>>> Huge +1 on this!
>>>>
>>>> ------------------------------
>>>> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>>> Holden Karau <ho...@pigscanfly.ca>
>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>> *To:* Joseph Bradley
>>>> *Cc:* dev
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>>
>>>>
>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>
>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>> Spark.
>>>>
>>>>> This was one of the original goals for mllib-local: to have local
>>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>>> libraries will be well worth it.
>>>>>
>>>>> We've talked about this need at Databricks and have also been syncing
>>>>> with the creators of MLeap.  It'd be great to get this functionality into
>>>>> Spark itself.  Some thoughts:
>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>>> moved into a local sql package.
>>>>> * This architecture may require some awkward APIs currently to have
>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>> MLlib APIs (especially in Java).
>>>>>
>>>> I agree this could be interesting, and feed into the other discussion
>>>> around when (or if) we should be considering Spark 3.0
>>>> I _think_ we could probably do it with optional traits people could mix
>>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>>
>>>>> * It could also be worth discussing local DataFrames.  They might not
>>>>> be as important as per-Row transformations, but they would be helpful for
>>>>> batching for higher throughput.
>>>>>
>>>> That could be interesting as well.
>>>>
>>>>>
>>>>> I'll be interested to hear others' thoughts too!
>>>>>
>>>>> Joseph
>>>>>
>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Hi y'all,
>>>>>>
>>>>>> With the renewed interest in ML in Apache Spark now seems like a good
>>>>>> a time as any to revisit the online serving situation in Spark ML. DB &
>>>>>> other's have done some excellent working moving a lot of the necessary
>>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>>> SparkContext.
>>>>>>
>>>>>> There are a few different commercial and non-commercial solutions
>>>>>> round this, but currently our individual transform/predict methods are
>>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>> by stages which support this? That way we can have less copy and paste code
>>>>>> possibly getting out of sync with our model training.
>>>>>>
>>>>>> I think continuing to have on-line serving grow in different projects
>>>>>> is probably the right path, forward (folks have different needs), but I'd
>>>>>> love to see us make it simpler for other projects to build reliable serving
>>>>>> tools.
>>>>>>
>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>> --
>> --
>> Cheers,
>> Leif
>>
>


-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

Bump.

________________________________
From: Felix Cheung <fe...@hotmail.com>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev
Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?

_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>

Huge +1 on this!

________________________________
From: holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?

_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>

Huge +1 on this!

________________________________
From: holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif

Re: Revisiting Online serving of Spark models?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <ma...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>



Huge +1 on this!

________________________________
From: holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif

Re: Revisiting Online serving of Spark models?

Posted by Maximiliano Felice <ma...@gmail.com>.

Hi!

I'm don't usually write a lot on this list but I keep up to date with the
discussions and I'm a heavy user of Spark. This topic caught my attention,
as we're currently facing this issue at work. I'm attending to the summit
and was wondering if it would it be possible for me to join that meeting. I
might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <le...@gmail.com>
escribió:

> I’m with you on json being more readable than parquet, but we’ve had
> success using pyarrow’s parquet reader and have been quite happy with it so
> far. If your target is python (and probably if not now, then soon, R), you
> should look in to it.
>
> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Regarding model reading and writing, I'll give quick thoughts here:
>> * Our approach was to use the same format but write JSON instead of
>> Parquet.  It's easier to parse JSON without Spark, and using the same
>> format simplifies architecture.  Plus, some people want to check files into
>> version control, and JSON is nice for that.
>> * The reader/writer APIs could be extended to take format parameters
>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>> handle Parquet in the online serving setting).
>>
>> This would be a big project, so proposing a SPIP might be best.  If
>> people are around at the Spark Summit, that could be a good time to meet up
>> & then post notes back to the dev list.
>>
>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Specifically I’d like bring part of the discussion to Model and
>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>> outside of Spark for online serving.
>>>
>>> What’s the next step? Would folks be interested in getting together to
>>> discuss/get some feedback?
>>>
>>>
>>> _____________________________
>>> From: Felix Cheung <fe...@hotmail.com>
>>> Sent: Thursday, May 10, 2018 10:10 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>> joseph@databricks.com>
>>> Cc: dev <de...@spark.apache.org>
>>>
>>>
>>>
>>> Huge +1 on this!
>>>
>>> ------------------------------
>>> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>> Holden Karau <ho...@pigscanfly.ca>
>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>> *To:* Joseph Bradley
>>> *Cc:* dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>>
>>>
>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
>>> wrote:
>>>
>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>
>>>> Awesome! I'm glad other folks think something like this belongs in
>>> Spark.
>>>
>>>> This was one of the original goals for mllib-local: to have local
>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>> libraries will be well worth it.
>>>>
>>>> We've talked about this need at Databricks and have also been syncing
>>>> with the creators of MLeap.  It'd be great to get this functionality into
>>>> Spark itself.  Some thoughts:
>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>> moved into a local sql package.
>>>> * This architecture may require some awkward APIs currently to have
>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>> architecture while making it feasible for 3rd party developers to extend
>>>> MLlib APIs (especially in Java).
>>>>
>>> I agree this could be interesting, and feed into the other discussion
>>> around when (or if) we should be considering Spark 3.0
>>> I _think_ we could probably do it with optional traits people could mix
>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>
>>>> * It could also be worth discussing local DataFrames.  They might not
>>>> be as important as per-Row transformations, but they would be helpful for
>>>> batching for higher throughput.
>>>>
>>> That could be interesting as well.
>>>
>>>>
>>>> I'll be interested to hear others' thoughts too!
>>>>
>>>> Joseph
>>>>
>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Hi y'all,
>>>>>
>>>>> With the renewed interest in ML in Apache Spark now seems like a good
>>>>> a time as any to revisit the online serving situation in Spark ML. DB &
>>>>> other's have done some excellent working moving a lot of the necessary
>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>> SparkContext.
>>>>>
>>>>> There are a few different commercial and non-commercial solutions
>>>>> round this, but currently our individual transform/predict methods are
>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>> element inputs (or local collections) that could be optionally implemented
>>>>> by stages which support this? That way we can have less copy and paste code
>>>>> possibly getting out of sync with our model training.
>>>>>
>>>>> I think continuing to have on-line serving grow in different projects
>>>>> is probably the right path, forward (folks have different needs), but I'd
>>>>> love to see us make it simpler for other projects to build reliable serving
>>>>> tools.
>>>>>
>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>> everyone the commercial vendors can benefit as well.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> --
> Cheers,
> Leif
>

Re: Revisiting Online serving of Spark models?

Posted by Leif Walsh <le...@gmail.com>.

I’m with you on json being more readable than parquet, but we’ve had
success using pyarrow’s parquet reader and have been quite happy with it so
far. If your target is python (and probably if not now, then soon, R), you
should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley <jo...@databricks.com> wrote:

> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters (just
> like DataFrame reader/writers) to handle JSON (and maybe, eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If people
> are around at the Spark Summit, that could be a good time to meet up & then
> post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>> that rely on SparkContext. This is a big blocker on reusing  trained models
>> outside of Spark for online serving.
>>
>> What’s the next step? Would folks be interested in getting together to
>> discuss/get some feedback?
>>
>>
>> _____________________________
>> From: Felix Cheung <fe...@hotmail.com>
>> Sent: Thursday, May 10, 2018 10:10 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>> joseph@databricks.com>
>> Cc: dev <de...@spark.apache.org>
>>
>>
>>
>> Huge +1 on this!
>>
>> ------------------------------
>> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
>> Holden Karau <ho...@pigscanfly.ca>
>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>> *To:* Joseph Bradley
>> *Cc:* dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>>
>>
>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
>> wrote:
>>
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>
>>> This was one of the original goals for mllib-local: to have local
>>> versions of MLlib models which could be deployed without the big Spark JARs
>>> and without a SparkContext or SparkSession.  There are related commercial
>>> offerings like this : ) but the overhead of maintaining those offerings is
>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>> libraries will be well worth it.
>>>
>>> We've talked about this need at Databricks and have also been syncing
>>> with the creators of MLeap.  It'd be great to get this functionality into
>>> Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods
>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>> local, lightweight versions of models in mllib-local, outside of the main
>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>> moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have
>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>> architecture while making it feasible for 3rd party developers to extend
>>> MLlib APIs (especially in Java).
>>>
>> I agree this could be interesting, and feed into the other discussion
>> around when (or if) we should be considering Spark 3.0
>> I _think_ we could probably do it with optional traits people could mix
>> in to avoid breaking the current APIs but I could be wrong on that point.
>>
>>> * It could also be worth discussing local DataFrames.  They might not be
>>> as important as per-Row transformations, but they would be helpful for
>>> batching for higher throughput.
>>>
>> That could be interesting as well.
>>
>>>
>>> I'll be interested to hear others' thoughts too!
>>>
>>> Joseph
>>>
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Hi y'all,
>>>>
>>>> With the renewed interest in ML in Apache Spark now seems like a good a
>>>> time as any to revisit the online serving situation in Spark ML. DB &
>>>> other's have done some excellent working moving a lot of the necessary
>>>> tools into a local linear algebra package that doesn't depend on having a
>>>> SparkContext.
>>>>
>>>> There are a few different commercial and non-commercial solutions round
>>>> this, but currently our individual transform/predict methods are private so
>>>> they either need to copy or re-implement (or put them selves in
>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>> trait for ML pipeline stages to expose to do transformation of single
>>>> element inputs (or local collections) that could be optionally implemented
>>>> by stages which support this? That way we can have less copy and paste code
>>>> possibly getting out of sync with our model training.
>>>>
>>>> I think continuing to have on-line serving grow in different projects
>>>> is probably the right path, forward (folks have different needs), but I'd
>>>> love to see us make it simpler for other projects to build reliable serving
>>>> tools.
>>>>
>>>> I realize this maybe puts some of the folks in an awkward position with
>>>> their own commercial offerings, but hopefully if we make it easier for
>>>> everyone the commercial vendors can benefit as well.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 
-- 
Cheers,
Leif

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

+1 on meeting up!

________________________________
From: Holden Karau <ho...@pigscanfly.ca>
Sent: Monday, May 21, 2018 2:52:20 PM
To: Joseph Bradley
Cc: Felix Cheung; dev
Subject: Re: Revisiting Online serving of Spark models?

(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau <ho...@pigscanfly.ca>> wrote:
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jo...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?

_____________________________
From: Felix Cheung <fe...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>>, Joseph Bradley <jo...@databricks.com>>
Cc: dev <de...@spark.apache.org>>

Huge +1 on this!

________________________________
From: holden.karau@gmail.com<ma...@gmail.com> <ho...@gmail.com>> on behalf of Holden Karau <ho...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau
--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> I like that idea. I’ll be around Spark Summit.
>
> On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Regarding model reading and writing, I'll give quick thoughts here:
>> * Our approach was to use the same format but write JSON instead of
>> Parquet.  It's easier to parse JSON without Spark, and using the same
>> format simplifies architecture.  Plus, some people want to check files into
>> version control, and JSON is nice for that.
>> * The reader/writer APIs could be extended to take format parameters
>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>> handle Parquet in the online serving setting).
>>
>> This would be a big project, so proposing a SPIP might be best.  If
>> people are around at the Spark Summit, that could be a good time to meet up
>> & then post notes back to the dev list.
>>
>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Specifically I’d like bring part of the discussion to Model and
>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>> outside of Spark for online serving.
>>>
>>> What’s the next step? Would folks be interested in getting together to
>>> discuss/get some feedback?
>>>
>>>
>>> _____________________________
>>> From: Felix Cheung <fe...@hotmail.com>
>>> Sent: Thursday, May 10, 2018 10:10 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>>> joseph@databricks.com>
>>> Cc: dev <de...@spark.apache.org>
>>>
>>>
>>>
>>> Huge +1 on this!
>>>
>>> ------------------------------
>>> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
>>> Holden Karau <ho...@pigscanfly.ca>
>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>> *To:* Joseph Bradley
>>> *Cc:* dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>>
>>>
>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
>>> wrote:
>>>
>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>
>>>> Awesome! I'm glad other folks think something like this belongs in
>>> Spark.
>>>
>>>> This was one of the original goals for mllib-local: to have local
>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>> libraries will be well worth it.
>>>>
>>>> We've talked about this need at Databricks and have also been syncing
>>>> with the creators of MLeap.  It'd be great to get this functionality into
>>>> Spark itself.  Some thoughts:
>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>> moved into a local sql package.
>>>> * This architecture may require some awkward APIs currently to have
>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>> architecture while making it feasible for 3rd party developers to extend
>>>> MLlib APIs (especially in Java).
>>>>
>>> I agree this could be interesting, and feed into the other discussion
>>> around when (or if) we should be considering Spark 3.0
>>> I _think_ we could probably do it with optional traits people could mix
>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>
>>>> * It could also be worth discussing local DataFrames.  They might not
>>>> be as important as per-Row transformations, but they would be helpful for
>>>> batching for higher throughput.
>>>>
>>> That could be interesting as well.
>>>
>>>>
>>>> I'll be interested to hear others' thoughts too!
>>>>
>>>> Joseph
>>>>
>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Hi y'all,
>>>>>
>>>>> With the renewed interest in ML in Apache Spark now seems like a good
>>>>> a time as any to revisit the online serving situation in Spark ML. DB &
>>>>> other's have done some excellent working moving a lot of the necessary
>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>> SparkContext.
>>>>>
>>>>> There are a few different commercial and non-commercial solutions
>>>>> round this, but currently our individual transform/predict methods are
>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>> element inputs (or local collections) that could be optionally implemented
>>>>> by stages which support this? That way we can have less copy and paste code
>>>>> possibly getting out of sync with our model training.
>>>>>
>>>>> I think continuing to have on-line serving grow in different projects
>>>>> is probably the right path, forward (folks have different needs), but I'd
>>>>> love to see us make it simpler for other projects to build reliable serving
>>>>> tools.
>>>>>
>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>> with their own commercial offerings, but hopefully if we make it easier for
>>>>> everyone the commercial vendors can benefit as well.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jo...@databricks.com>
wrote:

> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters (just
> like DataFrame reader/writers) to handle JSON (and maybe, eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If people
> are around at the Spark Summit, that could be a good time to meet up & then
> post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>> that rely on SparkContext. This is a big blocker on reusing  trained models
>> outside of Spark for online serving.
>>
>> What’s the next step? Would folks be interested in getting together to
>> discuss/get some feedback?
>>
>>
>> _____________________________
>> From: Felix Cheung <fe...@hotmail.com>
>> Sent: Thursday, May 10, 2018 10:10 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
>> joseph@databricks.com>
>> Cc: dev <de...@spark.apache.org>
>>
>>
>>
>> Huge +1 on this!
>>
>> ------------------------------
>> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
>> Holden Karau <ho...@pigscanfly.ca>
>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>> *To:* Joseph Bradley
>> *Cc:* dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>>
>>
>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
>> wrote:
>>
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>
>>> This was one of the original goals for mllib-local: to have local
>>> versions of MLlib models which could be deployed without the big Spark JARs
>>> and without a SparkContext or SparkSession.  There are related commercial
>>> offerings like this : ) but the overhead of maintaining those offerings is
>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>> libraries will be well worth it.
>>>
>>> We've talked about this need at Databricks and have also been syncing
>>> with the creators of MLeap.  It'd be great to get this functionality into
>>> Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods
>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>> local, lightweight versions of models in mllib-local, outside of the main
>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>> moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have
>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>> architecture while making it feasible for 3rd party developers to extend
>>> MLlib APIs (especially in Java).
>>>
>> I agree this could be interesting, and feed into the other discussion
>> around when (or if) we should be considering Spark 3.0
>> I _think_ we could probably do it with optional traits people could mix
>> in to avoid breaking the current APIs but I could be wrong on that point.
>>
>>> * It could also be worth discussing local DataFrames.  They might not be
>>> as important as per-Row transformations, but they would be helpful for
>>> batching for higher throughput.
>>>
>> That could be interesting as well.
>>
>>>
>>> I'll be interested to hear others' thoughts too!
>>>
>>> Joseph
>>>
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Hi y'all,
>>>>
>>>> With the renewed interest in ML in Apache Spark now seems like a good a
>>>> time as any to revisit the online serving situation in Spark ML. DB &
>>>> other's have done some excellent working moving a lot of the necessary
>>>> tools into a local linear algebra package that doesn't depend on having a
>>>> SparkContext.
>>>>
>>>> There are a few different commercial and non-commercial solutions round
>>>> this, but currently our individual transform/predict methods are private so
>>>> they either need to copy or re-implement (or put them selves in
>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>> trait for ML pipeline stages to expose to do transformation of single
>>>> element inputs (or local collections) that could be optionally implemented
>>>> by stages which support this? That way we can have less copy and paste code
>>>> possibly getting out of sync with our model training.
>>>>
>>>> I think continuing to have on-line serving grow in different projects
>>>> is probably the right path, forward (folks have different needs), but I'd
>>>> love to see us make it simpler for other projects to build reliable serving
>>>> tools.
>>>>
>>>> I realize this maybe puts some of the folks in an awkward position with
>>>> their own commercial offerings, but hopefully if we make it easier for
>>>> everyone the commercial vendors can benefit as well.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Joseph Bradley <jo...@databricks.com>.

Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of
Parquet.  It's easier to parse JSON without Spark, and using the same
format simplifies architecture.  Plus, some people want to check files into
version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just
like DataFrame reader/writers) to handle JSON (and maybe, eventually,
handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people
are around at the Spark Summit, that could be a good time to meet up & then
post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <fe...@hotmail.com>
wrote:

> Specifically I’d like bring part of the discussion to Model and
> PipelineModel, and various ModelReader and SharedReadWrite implementations
> that rely on SparkContext. This is a big blocker on reusing  trained models
> outside of Spark for online serving.
>
> What’s the next step? Would folks be interested in getting together to
> discuss/get some feedback?
>
>
> _____________________________
> From: Felix Cheung <fe...@hotmail.com>
> Sent: Thursday, May 10, 2018 10:10 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <
> joseph@databricks.com>
> Cc: dev <de...@spark.apache.org>
>
>
>
> Huge +1 on this!
>
> ------------------------------
> *From:* holden.karau@gmail.com <ho...@gmail.com> on behalf of
> Holden Karau <ho...@pigscanfly.ca>
> *Sent:* Thursday, May 10, 2018 9:39:26 AM
> *To:* Joseph Bradley
> *Cc:* dev
> *Subject:* Re: Revisiting Online serving of Spark models?
>
>
>
> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>
>> Awesome! I'm glad other folks think something like this belongs in Spark.
>
>> This was one of the original goals for mllib-local: to have local
>> versions of MLlib models which could be deployed without the big Spark JARs
>> and without a SparkContext or SparkSession.  There are related commercial
>> offerings like this : ) but the overhead of maintaining those offerings is
>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>> libraries will be well worth it.
>>
>> We've talked about this need at Databricks and have also been syncing
>> with the creators of MLeap.  It'd be great to get this functionality into
>> Spark itself.  Some thoughts:
>> * It'd be valuable to have this go beyond adding transform() methods
>> taking a Row to the current Models.  Instead, it would be ideal to have
>> local, lightweight versions of models in mllib-local, outside of the main
>> mllib package (for easier deployment with smaller & fewer dependencies).
>> * Supporting Pipelines is important.  For this, it would be ideal to
>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>> moved into a local sql package.
>> * This architecture may require some awkward APIs currently to have model
>> prediction logic in mllib-local, local model classes in mllib-local, and
>> regular (DataFrame-friendly) model classes in mllib.  We might find it
>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>> architecture while making it feasible for 3rd party developers to extend
>> MLlib APIs (especially in Java).
>>
> I agree this could be interesting, and feed into the other discussion
> around when (or if) we should be considering Spark 3.0
> I _think_ we could probably do it with optional traits people could mix in
> to avoid breaking the current APIs but I could be wrong on that point.
>
>> * It could also be worth discussing local DataFrames.  They might not be
>> as important as per-Row transformations, but they would be helpful for
>> batching for higher throughput.
>>
> That could be interesting as well.
>
>>
>> I'll be interested to hear others' thoughts too!
>>
>> Joseph
>>
>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Hi y'all,
>>>
>>> With the renewed interest in ML in Apache Spark now seems like a good a
>>> time as any to revisit the online serving situation in Spark ML. DB &
>>> other's have done some excellent working moving a lot of the necessary
>>> tools into a local linear algebra package that doesn't depend on having a
>>> SparkContext.
>>>
>>> There are a few different commercial and non-commercial solutions round
>>> this, but currently our individual transform/predict methods are private so
>>> they either need to copy or re-implement (or put them selves in
>>> org.apache.spark) to access them. How would folks feel about adding a new
>>> trait for ML pipeline stages to expose to do transformation of single
>>> element inputs (or local collections) that could be optionally implemented
>>> by stages which support this? That way we can have less copy and paste code
>>> possibly getting out of sync with our model training.
>>>
>>> I think continuing to have on-line serving grow in different projects is
>>> probably the right path, forward (folks have different needs), but I'd love
>>> to see us make it simpler for other projects to build reliable serving
>>> tools.
>>>
>>> I realize this maybe puts some of the folks in an awkward position with
>>> their own commercial offerings, but hopefully if we make it easier for
>>> everyone the commercial vendors can benefit as well.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?

_____________________________
From: Felix Cheung <fe...@hotmail.com>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <ho...@pigscanfly.ca>, Joseph Bradley <jo...@databricks.com>
Cc: dev <de...@spark.apache.org>

Huge +1 on this!

________________________________
From: holden.karau@gmail.com <ho...@gmail.com> on behalf of Holden Karau <ho...@pigscanfly.ca>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Felix Cheung <fe...@hotmail.com>.

Huge +1 on this!

________________________________
From: holden.karau@gmail.com <ho...@gmail.com> on behalf of Holden Karau <ho...@pigscanfly.ca>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Holden Karau <ho...@pigscanfly.ca>.

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jo...@databricks.com>
wrote:

> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>
> Awesome! I'm glad other folks think something like this belongs in Spark.

> This was one of the original goals for mllib-local: to have local versions
> of MLlib models which could be deployed without the big Spark JARs and
> without a SparkContext or SparkSession.  There are related commercial
> offerings like this : ) but the overhead of maintaining those offerings is
> pretty high.  Building good APIs within MLlib to avoid copying logic across
> libraries will be well worth it.
>
> We've talked about this need at Databricks and have also been syncing with
> the creators of MLeap.  It'd be great to get this functionality into Spark
> itself.  Some thoughts:
> * It'd be valuable to have this go beyond adding transform() methods
> taking a Row to the current Models.  Instead, it would be ideal to have
> local, lightweight versions of models in mllib-local, outside of the main
> mllib package (for easier deployment with smaller & fewer dependencies).
> * Supporting Pipelines is important.  For this, it would be ideal to
> utilize elements of Spark SQL, particularly Rows and Types, which could be
> moved into a local sql package.
> * This architecture may require some awkward APIs currently to have model
> prediction logic in mllib-local, local model classes in mllib-local, and
> regular (DataFrame-friendly) model classes in mllib.  We might find it
> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
> architecture while making it feasible for 3rd party developers to extend
> MLlib APIs (especially in Java).
>
I agree this could be interesting, and feed into the other discussion
around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in
to avoid breaking the current APIs but I could be wrong on that point.

> * It could also be worth discussing local DataFrames.  They might not be
> as important as per-Row transformations, but they would be helpful for
> batching for higher throughput.
>
That could be interesting as well.

>
> I'll be interested to hear others' thoughts too!
>
> Joseph
>
> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Hi y'all,
>>
>> With the renewed interest in ML in Apache Spark now seems like a good a
>> time as any to revisit the online serving situation in Spark ML. DB &
>> other's have done some excellent working moving a lot of the necessary
>> tools into a local linear algebra package that doesn't depend on having a
>> SparkContext.
>>
>> There are a few different commercial and non-commercial solutions round
>> this, but currently our individual transform/predict methods are private so
>> they either need to copy or re-implement (or put them selves in
>> org.apache.spark) to access them. How would folks feel about adding a new
>> trait for ML pipeline stages to expose to do transformation of single
>> element inputs (or local collections) that could be optionally implemented
>> by stages which support this? That way we can have less copy and paste code
>> possibly getting out of sync with our model training.
>>
>> I think continuing to have on-line serving grow in different projects is
>> probably the right path, forward (folks have different needs), but I'd love
>> to see us make it simpler for other projects to build reliable serving
>> tools.
>>
>> I realize this maybe puts some of the folks in an awkward position with
>> their own commercial offerings, but hopefully if we make it easier for
>> everyone the commercial vendors can benefit as well.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Posted by Joseph Bradley <jo...@databricks.com>.

Thanks for bringing this up Holden!  I'm a strong supporter of this.

This was one of the original goals for mllib-local: to have local versions
of MLlib models which could be deployed without the big Spark JARs and
without a SparkContext or SparkSession.  There are related commercial
offerings like this : ) but the overhead of maintaining those offerings is
pretty high.  Building good APIs within MLlib to avoid copying logic across
libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with
the creators of MLeap.  It'd be great to get this functionality into Spark
itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking
a Row to the current Models.  Instead, it would be ideal to have local,
lightweight versions of models in mllib-local, outside of the main mllib
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to
utilize elements of Spark SQL, particularly Rows and Types, which could be
moved into a local sql package.
* This architecture may require some awkward APIs currently to have model
prediction logic in mllib-local, local model classes in mllib-local, and
regular (DataFrame-friendly) model classes in mllib.  We might find it
helpful to break some DeveloperApis in Spark 3.0 to facilitate this
architecture while making it feasible for 3rd party developers to extend
MLlib APIs (especially in Java).
* It could also be worth discussing local DataFrames.  They might not be as
important as per-Row transformations, but they would be helpful for
batching for higher throughput.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi y'all,
>
> With the renewed interest in ML in Apache Spark now seems like a good a
> time as any to revisit the online serving situation in Spark ML. DB &
> other's have done some excellent working moving a lot of the necessary
> tools into a local linear algebra package that doesn't depend on having a
> SparkContext.
>
> There are a few different commercial and non-commercial solutions round
> this, but currently our individual transform/predict methods are private so
> they either need to copy or re-implement (or put them selves in
> org.apache.spark) to access them. How would folks feel about adding a new
> trait for ML pipeline stages to expose to do transformation of single
> element inputs (or local collections) that could be optionally implemented
> by stages which support this? That way we can have less copy and paste code
> possibly getting out of sync with our model training.
>
> I think continuing to have on-line serving grow in different projects is
> probably the right path, forward (folks have different needs), but I'd love
> to see us make it simpler for other projects to build reliable serving
> tools.
>
> I realize this maybe puts some of the folks in an awkward position with
> their own commercial offerings, but hopefully if we make it easier for
> everyone the commercial vendors can benefit as well.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>