You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hollin Wilkins <ho...@combust.ml> on 2017/02/02 16:42:10 UTC

[ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Hey everyone,


Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about
MLeap and how you can use it to build production services from your
Spark-trained ML pipelines. MLeap is an open-source technology that allows
Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
Models to a scoring engine instantly. The MLeap execution engine has no
dependencies on a Spark context and the serialization format is entirely
based on Protobuf 3 and JSON.


The recent 0.5.0 release provides serialization and inference support for
close to 100% of Spark transformers (we don’t yet support ALS and LDA).


MLeap is open-source, take a look at our Github page:

https://github.com/combust/mleap


Or join the conversation on Gitter:

https://gitter.im/combust/mleap


We have a set of documentation to help get you started here:

http://mleap-docs.combust.ml/


We even have a set of demos, for training ML Pipelines and linear, logistic
and random forest models:

https://github.com/combust/mleap-demo


Check out our latest MLeap-serving Docker image, which allows you to expose
a REST interface to your Spark ML pipeline models:

http://mleap-docs.combust.ml/mleap-serving/


Several companies are using MLeap in production and even more are currently
evaluating it. Take a look and tell us what you think! We hope to talk with
you soon and welcome feedback/suggestions!


Sincerely,

Hollin and Mikhail

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Hollin Wilkins <ho...@combust.ml>.

Hey Asher,

A phone call may be the best to discuss all of this. But in short:
1. It is quite easy to add custom pipelines/models to MLeap. All of our
out-of-the-box transformers can serve as a good example of how to do this.
We are also putting together documentation on how to do this in our docs
web site.
2. MLlib models are not supported, but it wouldn't be too difficult to add
support for them
3. We have benchmarked this, and with MLeap it was roughly 2200x faster
than SparkContext with a LocalRelation-backed DataFrame. The pipeline we
used for benchmarking included string indexing, one hot encoding, vector
assembly, scaling and a linear regression model. The reason for the speed
difference is that MLeap is optimized for one off requests, Spark is
incredible for scoring large batches of data because it takes time to
optimize your pipeline before execution. That time it takes to optimize is
noticeable when trying to build services around models.
4. Tensorflow support is early, but we have already built pipelines
including a Spark pipeline and a Tensorflow neural network all served from
one MLeap pipeline, using the same data structures as you would with just a
regular Spark pipeline. Eventually we will offer Tensorflow support as a
module that *just works TM* from Maven Central, but we are not quite there
yet.

Feel free to email me privately if you would like to discuss any of this
more, or join our gitter:
https://gitter.im/combust/mleap

Best,
Hollin

On Fri, Feb 3, 2017 at 10:48 AM, Asher Krim <ak...@hubspot.com> wrote:

> I have a bunch of questions for you Hollin:
>
> How easy is it to add support for custom pipelines/models?
> Are Spark mllib models supported?
> We currently run spark in local mode in an api service. It's not super
> terrible, but performance is a constant struggle. Have you benchmarked any
> performance differences between MLeap and vanilla Spark?
> What does Tensorflow support look like? I would love to serve models from
> a java stack while being agnostic to what framework was used to train them.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>
> On Fri, Feb 3, 2017 at 11:53 AM, Hollin Wilkins <ho...@combust.ml> wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>> wrote:
>>>
>>>> Is computational time for predictions on the order of few milliseconds
>>>> (< 10 ms) like the old mllib library?
>>>>
>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>>
>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>> about MLeap and how you can use it to build production services from your
>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>> based on Protobuf 3 and JSON.
>>>>>
>>>>>
>>>>> The recent 0.5.0 release provides serialization and inference support
>>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>>
>>>>>
>>>>> MLeap is open-source, take a look at our Github page:
>>>>>
>>>>> https://github.com/combust/mleap
>>>>>
>>>>>
>>>>> Or join the conversation on Gitter:
>>>>>
>>>>> https://gitter.im/combust/mleap
>>>>>
>>>>>
>>>>> We have a set of documentation to help get you started here:
>>>>>
>>>>> http://mleap-docs.combust.ml/
>>>>>
>>>>>
>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>> logistic and random forest models:
>>>>>
>>>>> https://github.com/combust/mleap-demo
>>>>>
>>>>>
>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>
>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>
>>>>>
>>>>> Several companies are using MLeap in production and even more are
>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> Hollin and Mikhail
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Asher Krim <ak...@hubspot.com>.

I have a bunch of questions for you Hollin:

How easy is it to add support for custom pipelines/models?
Are Spark mllib models supported?
We currently run spark in local mode in an api service. It's not super
terrible, but performance is a constant struggle. Have you benchmarked any
performance differences between MLeap and vanilla Spark?
What does Tensorflow support look like? I would love to serve models from a
java stack while being agnostic to what framework was used to train them.

Thanks,
Asher Krim
Senior Software Engineer

On Fri, Feb 3, 2017 at 11:53 AM, Hollin Wilkins <ho...@combust.ml> wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further optimized
> by using row-based transformations if your pipeline is simple to around 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com> wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>> wrote:
>>
>>> Is computational time for predictions on the order of few milliseconds
>>> (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>>
>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>> about MLeap and how you can use it to build production services from your
>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>> dependencies on a Spark context and the serialization format is entirely
>>>> based on Protobuf 3 and JSON.
>>>>
>>>>
>>>> The recent 0.5.0 release provides serialization and inference support
>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>
>>>>
>>>> MLeap is open-source, take a look at our Github page:
>>>>
>>>> https://github.com/combust/mleap
>>>>
>>>>
>>>> Or join the conversation on Gitter:
>>>>
>>>> https://gitter.im/combust/mleap
>>>>
>>>>
>>>> We have a set of documentation to help get you started here:
>>>>
>>>> http://mleap-docs.combust.ml/
>>>>
>>>>
>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>> logistic and random forest models:
>>>>
>>>> https://github.com/combust/mleap-demo
>>>>
>>>>
>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>> expose a REST interface to your Spark ML pipeline models:
>>>>
>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>
>>>>
>>>> Several companies are using MLeap in production and even more are
>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>> talk with you soon and welcome feedback/suggestions!
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> Hollin and Mikhail
>>>>
>>>
>>>
>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Hollin Wilkins <ho...@combust.ml>.

Hi All -


We got a number of great questions and ended up adding responses to them on
the MLeap Documentation page, in the FAQ section
<http://mleap-docs.combust.ml/faq.html>. We're also including a "condensed"
version at the bottom of this email.


We appreciate the interest and the discussion around MLeap - going from
research to production has been a key focus for us for a while and we are
very passionate about this topic. We welcome community feedback and support
(code, ideas, use-cases) and aim to make taking ML Pipelines to production
a pleasant experience.


Best,

Hollin and Mikhail


--------------------------


FAQs:


Does MLeap Support Custom Transformers?

Absolutely - our goal is to make writing custom transformers easy. For
documentation writing and contributing custom transformers, see the Custom
Transformers
<http://mleap-docs.combust.ml/mleap-runtime/custom-transformer.html> page.

What is MLeap Runtime’s Inference Performance?

MLeap is optimized to deliver execution of entire MLeap Pipelines in
microseconds
(1/1000s of a millisecond). We provide a benchmarking library
<https://github.com/combust/mleap/tree/master/mleap-benchmark> as part of
MLeap that reports the following response times on a pipeline comprised of
vector assemblers, standard scalers, string indexers, one-hot-encoders and:

   -

   Linear Regression: 6.2 microseconds vs 106 milliseconds with Spark using
   Local Relation
   -

   Random Forest: 6.8 microseconds vs 101 milliseconds with Spark Local
   Relation


What Should Be Considered When Making a Decision Between Using MLeap and
other Serialization/Execution Frameworks?

MLeap serialization is built with the following goals and requirements in
mind:


   -

   It should be easy for developers to add custom transformers in Scala and
   Java (we are adding Python and C support as well)
   -

   Serialization format should be flexible and meet state-of-the-art
   performance requirements. MLeap serializes to protobuf 3, making scalable
   deployment and execution of large pipelines and models like Random Forests
   and Neural Nets possible
   -

   Serialization should be optimized for ML Transformers and Pipelines
   -

   Serialization should be accessible for all environments and platforms,
   including low-level languages like C, C++ and Rust
   -

   Provide a common serialization framework for Spark, Scikit, and
   TensorFlow transformers


Is MLeap Ready For Production?

Yes - MLeap is used in a number of production environments today. MLeap
0.5.0 release provides a stable serialization and execution format for ML
Pipelines. Version 1.0.0 will guarantee backwards compatibility.

Why Not Use a SparkContext With a LocalRelation DataFrame?

APIs relying on Spark Context can be optimized to process queries in ~100ms
- if that meets your requirements, then LocalRelation is a possible
solution. However, MLeap’s use-cases require sub-20ms and in some cases
sub-millisecond response times.

Is Spark MLlib Supported?

Spark ML Pipelines already support a lot of the same transformers and
models that are part of MLlib. In addition, we offer a wrapper around MLlib
SupportVectorMachine in our mleap-spark-extension module. If you find that
something is missing from Spark ML that is found in MLlib, please let us
know or contribute your own wrapper to MLeap.

Does MLeap Work WIth Spark Streaming?

Yes - we will add a tutorial on that in the next few weeks.

How Does TensorFlow Integration Work?

Tensorflow integration works by using the official Tensorflow SWIG
wrappers. We may eventually change this to use JavaCPP bindings, or even
take an erlang-inspired approach and have a separate Tensorflow process for
executing Tensorflow graphs. However we end up doing it, the interface will
stay the same and you will always be able to transform your leap frames
with the TensorflowTransformer.

When Will Scikit-Learn Be Supported?

Scikit-Learn support is currently in beta and we are working to support the
following functionality in the initial release in early March:

   -

   Support for all scikit tranformers that have a corresponding Spark
   transformer
   -

   Provide both serialization and de-serialization of MLeap Bundles
   -

   Provide basic pandas support: Group-by aggregations, joins


How Can I Contribute?

Contribute an Estimator/Transformer from Spark or your own custom
transformer

   -

   Write documentation
   -

   Write a tutorial/walkthrough for an interesting ML problem
   -

   Use MLeap at your company and tell us what you think
   -

   Talk with us on Gitter <https://gitter.im/combust/mleap>


On Mon, Feb 6, 2017 at 12:01 AM, Aseem Bansal <as...@gmail.com> wrote:

> I agree with you that this is needed. There is a JIRA
> https://issues.apache.org/jira/browse/SPARK-10413
>
> On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Aseem,
>>
>> Due to production deploy, we did not upgrade to 2.0 but that's critical
>> item on our list.
>>
>> For exposing models out of PipelineModel, let me look into the ML
>> tasks...we should add it since dataframe should not be must for model
>> scoring...many times model are scored on api or streaming paths which don't
>> have micro batching involved...data directly lands from http or kafka/msg
>> queues...for such cases raw access to ML model is essential similar to
>> mllib model access...
>>
>> Thanks.
>> Deb
>> On Feb 4, 2017 9:58 PM, "Aseem Bansal" <as...@gmail.com> wrote:
>>
>>> @Debasish
>>>
>>> I see that the spark version being used in the project that you
>>> mentioned is 1.6.0. I would suggest that you take a look at some blogs
>>> related to Spark 2.0 Pipelines, Models in new ml package. The new ml
>>> package's API as of latest Spark 2.1.0 release has no way to call predict
>>> on single vector. There is no API exposed. It is WIP but not yet released.
>>>
>>> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> If we expose an API to access the raw models out of PipelineModel can't
>>>> we call predict directly on it from an API ? Is there a task open to expose
>>>> the model out of PipelineModel so that predict can be called on it....there
>>>> is no dependency of spark context in ml model...
>>>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>>
>>>>>
>>>>>    - In Spark 2.0 there is a class called PipelineModel. I know that
>>>>>    the title says pipeline but it is actually talking about PipelineModel
>>>>>    trained via using a Pipeline.
>>>>>    - Why PipelineModel instead of pipeline? Because usually there is
>>>>>    a series of stuff that needs to be done when doing ML which warrants an
>>>>>    ordered sequence of operations. Read the new spark ml docs or one of the
>>>>>    databricks blogs related to spark pipelines. If you have used python's
>>>>>    sklearn library the concept is inspired from there.
>>>>>    - "once model is deserialized as ml model from the store of choice
>>>>>    within ms" - The timing of loading the model was not what I was
>>>>>    referring to when I was talking about timing.
>>>>>    - "it can be used on incoming features to score through
>>>>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>>>>    not the new ml package.
>>>>>    - "why r we using dataframe and not the ML model directly from
>>>>>    API" - Because as of now the new ml package does not have the direct API.
>>>>>
>>>>>
>>>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <
>>>>> debasish.das83@gmail.com> wrote:
>>>>>
>>>>>> I am not sure why I will use pipeline to do scoring...idea is to
>>>>>> build a model, use model ser/deser feature to put it in the row or column
>>>>>> store of choice and provide a api access to the model...we support these
>>>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>>>> spark context in local or distributed mode...once model is deserialized as
>>>>>> ml model from the store of choice within ms, it can be used on incoming
>>>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>>>> 2200x speedup...why r we using dataframe and not the ML model directly from
>>>>>> API ?
>>>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>>>>
>>>>>>> Does this support Java 7?
>>>>>>> What is your timezone in case someone wanted to talk?
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Aseem,
>>>>>>>>
>>>>>>>> We have built pipelines that execute several string indexers, one
>>>>>>>> hot encoders, scaling, and a random forest or linear regression at the end.
>>>>>>>> Execution time for the linear regression was on the order of 11
>>>>>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>>>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>>>>> the time all the processing was done, we had somewhere around 1000 features
>>>>>>>> or so going into the linear regression after one hot encoding and
>>>>>>>> everything else.
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>> Hollin
>>>>>>>>
>>>>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Does this support Java 7?
>>>>>>>>>
>>>>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbansal2@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Is computational time for predictions on the order of few
>>>>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <
>>>>>>>>>> hollin@combust.ml> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop
>>>>>>>>>>> Summits about MLeap and how you can use it to build production services
>>>>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source technology
>>>>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML
>>>>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap execution
>>>>>>>>>>> engine has no dependencies on a Spark context and the serialization format
>>>>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>>>>>> and LDA).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/combust/mleap
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>>>>
>>>>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>>>>>
>>>>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We even have a set of demos, for training ML Pipelines and
>>>>>>>>>>> linear, logistic and random forest models:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Check out our latest MLeap-serving Docker image, which allows
>>>>>>>>>>> you to expose a REST interface to your Spark ML pipeline models:
>>>>>>>>>>>
>>>>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Several companies are using MLeap in production and even more
>>>>>>>>>>> are currently evaluating it. Take a look and tell us what you think! We
>>>>>>>>>>> hope to talk with you soon and welcome feedback/suggestions!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sincerely,
>>>>>>>>>>>
>>>>>>>>>>> Hollin and Mikhail
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

I agree with you that this is needed. There is a JIRA
https://issues.apache.org/jira/browse/SPARK-10413

On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das <de...@gmail.com>
wrote:

> Hi Aseem,
>
> Due to production deploy, we did not upgrade to 2.0 but that's critical
> item on our list.
>
> For exposing models out of PipelineModel, let me look into the ML
> tasks...we should add it since dataframe should not be must for model
> scoring...many times model are scored on api or streaming paths which don't
> have micro batching involved...data directly lands from http or kafka/msg
> queues...for such cases raw access to ML model is essential similar to
> mllib model access...
>
> Thanks.
> Deb
> On Feb 4, 2017 9:58 PM, "Aseem Bansal" <as...@gmail.com> wrote:
>
>> @Debasish
>>
>> I see that the spark version being used in the project that you mentioned
>> is 1.6.0. I would suggest that you take a look at some blogs related to
>> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
>> of latest Spark 2.1.0 release has no way to call predict on single vector.
>> There is no API exposed. It is WIP but not yet released.
>>
>> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> If we expose an API to access the raw models out of PipelineModel can't
>>> we call predict directly on it from an API ? Is there a task open to expose
>>> the model out of PipelineModel so that predict can be called on it....there
>>> is no dependency of spark context in ml model...
>>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>
>>>>
>>>>    - In Spark 2.0 there is a class called PipelineModel. I know that
>>>>    the title says pipeline but it is actually talking about PipelineModel
>>>>    trained via using a Pipeline.
>>>>    - Why PipelineModel instead of pipeline? Because usually there is a
>>>>    series of stuff that needs to be done when doing ML which warrants an
>>>>    ordered sequence of operations. Read the new spark ml docs or one of the
>>>>    databricks blogs related to spark pipelines. If you have used python's
>>>>    sklearn library the concept is inspired from there.
>>>>    - "once model is deserialized as ml model from the store of choice
>>>>    within ms" - The timing of loading the model was not what I was
>>>>    referring to when I was talking about timing.
>>>>    - "it can be used on incoming features to score through
>>>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>>>    not the new ml package.
>>>>    - "why r we using dataframe and not the ML model directly from API"
>>>>    - Because as of now the new ml package does not have the direct API.
>>>>
>>>>
>>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.das83@gmail.com
>>>> > wrote:
>>>>
>>>>> I am not sure why I will use pipeline to do scoring...idea is to build
>>>>> a model, use model ser/deser feature to put it in the row or column store
>>>>> of choice and provide a api access to the model...we support these
>>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>>> spark context in local or distributed mode...once model is deserialized as
>>>>> ml model from the store of choice within ms, it can be used on incoming
>>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>>> 2200x speedup...why r we using dataframe and not the ML model directly from
>>>>> API ?
>>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>> What is your timezone in case someone wanted to talk?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Aseem,
>>>>>>>
>>>>>>> We have built pipelines that execute several string indexers, one
>>>>>>> hot encoders, scaling, and a random forest or linear regression at the end.
>>>>>>> Execution time for the linear regression was on the order of 11
>>>>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>>>> the time all the processing was done, we had somewhere around 1000 features
>>>>>>> or so going into the linear regression after one hot encoding and
>>>>>>> everything else.
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>> Hollin
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Does this support Java 7?
>>>>>>>>
>>>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is computational time for predictions on the order of few
>>>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>>>
>>>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hollin@combust.ml
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop
>>>>>>>>>> Summits about MLeap and how you can use it to build production services
>>>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source technology
>>>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML
>>>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap execution
>>>>>>>>>> engine has no dependencies on a Spark context and the serialization format
>>>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>>>>> and LDA).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>>>>
>>>>>>>>>> https://github.com/combust/mleap
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>>>
>>>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>>>>
>>>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We even have a set of demos, for training ML Pipelines and
>>>>>>>>>> linear, logistic and random forest models:
>>>>>>>>>>
>>>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Check out our latest MLeap-serving Docker image, which allows you
>>>>>>>>>> to expose a REST interface to your Spark ML pipeline models:
>>>>>>>>>>
>>>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Several companies are using MLeap in production and even more are
>>>>>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sincerely,
>>>>>>>>>>
>>>>>>>>>> Hollin and Mikhail
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Debasish Das <de...@gmail.com>.

Hi Aseem,

Due to production deploy, we did not upgrade to 2.0 but that's critical
item on our list.

For exposing models out of PipelineModel, let me look into the ML
tasks...we should add it since dataframe should not be must for model
scoring...many times model are scored on api or streaming paths which don't
have micro batching involved...data directly lands from http or kafka/msg
queues...for such cases raw access to ML model is essential similar to
mllib model access...

Thanks.
Deb
On Feb 4, 2017 9:58 PM, "Aseem Bansal" <as...@gmail.com> wrote:

> @Debasish
>
> I see that the spark version being used in the project that you mentioned
> is 1.6.0. I would suggest that you take a look at some blogs related to
> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
> of latest Spark 2.1.0 release has no way to call predict on single vector.
> There is no API exposed. It is WIP but not yet released.
>
> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> If we expose an API to access the raw models out of PipelineModel can't
>> we call predict directly on it from an API ? Is there a task open to expose
>> the model out of PipelineModel so that predict can be called on it....there
>> is no dependency of spark context in ml model...
>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>
>>>
>>>    - In Spark 2.0 there is a class called PipelineModel. I know that
>>>    the title says pipeline but it is actually talking about PipelineModel
>>>    trained via using a Pipeline.
>>>    - Why PipelineModel instead of pipeline? Because usually there is a
>>>    series of stuff that needs to be done when doing ML which warrants an
>>>    ordered sequence of operations. Read the new spark ml docs or one of the
>>>    databricks blogs related to spark pipelines. If you have used python's
>>>    sklearn library the concept is inspired from there.
>>>    - "once model is deserialized as ml model from the store of choice
>>>    within ms" - The timing of loading the model was not what I was
>>>    referring to when I was talking about timing.
>>>    - "it can be used on incoming features to score through
>>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>>    not the new ml package.
>>>    - "why r we using dataframe and not the ML model directly from API"
>>>    - Because as of now the new ml package does not have the direct API.
>>>
>>>
>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure why I will use pipeline to do scoring...idea is to build
>>>> a model, use model ser/deser feature to put it in the row or column store
>>>> of choice and provide a api access to the model...we support these
>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>> spark context in local or distributed mode...once model is deserialized as
>>>> ml model from the store of choice within ms, it can be used on incoming
>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>> 2200x speedup...why r we using dataframe and not the ML model directly from
>>>> API ?
>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>> What is your timezone in case someone wanted to talk?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>>>> wrote:
>>>>>
>>>>>> Hey Aseem,
>>>>>>
>>>>>> We have built pipelines that execute several string indexers, one hot
>>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>>> Execution time for the linear regression was on the order of 11
>>>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>>> the time all the processing was done, we had somewhere around 1000 features
>>>>>> or so going into the linear regression after one hot encoding and
>>>>>> everything else.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Hollin
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Does this support Java 7?
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is computational time for predictions on the order of few
>>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>>
>>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop
>>>>>>>>> Summits about MLeap and how you can use it to build production services
>>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source technology
>>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML
>>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap execution
>>>>>>>>> engine has no dependencies on a Spark context and the serialization format
>>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>>>> and LDA).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>>>
>>>>>>>>> https://github.com/combust/mleap
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>>
>>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>>>
>>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>>>>> logistic and random forest models:
>>>>>>>>>
>>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Check out our latest MLeap-serving Docker image, which allows you
>>>>>>>>> to expose a REST interface to your Spark ML pipeline models:
>>>>>>>>>
>>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Several companies are using MLeap in production and even more are
>>>>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>>
>>>>>>>>> Hollin and Mikhail
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

@Debasish

I see that the spark version being used in the project that you mentioned
is 1.6.0. I would suggest that you take a look at some blogs related to
Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
of latest Spark 2.1.0 release has no way to call predict on single vector.
There is no API exposed. It is WIP but not yet released.

On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <de...@gmail.com>
wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on it....there
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>
>>
>>    - In Spark 2.0 there is a class called PipelineModel. I know that the
>>    title says pipeline but it is actually talking about PipelineModel trained
>>    via using a Pipeline.
>>    - Why PipelineModel instead of pipeline? Because usually there is a
>>    series of stuff that needs to be done when doing ML which warrants an
>>    ordered sequence of operations. Read the new spark ml docs or one of the
>>    databricks blogs related to spark pipelines. If you have used python's
>>    sklearn library the concept is inspired from there.
>>    - "once model is deserialized as ml model from the store of choice
>>    within ms" - The timing of loading the model was not what I was
>>    referring to when I was talking about timing.
>>    - "it can be used on incoming features to score through
>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>    not the new ml package.
>>    - "why r we using dataframe and not the ML model directly from API" -
>>    Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>
>>>> Does this support Java 7?
>>>> What is your timezone in case someone wanted to talk?
>>>>
>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey Aseem,
>>>>>
>>>>> We have built pipelines that execute several string indexers, one hot
>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>> Execution time for the linear regression was on the order of 11
>>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>> the time all the processing was done, we had somewhere around 1000 features
>>>>> or so going into the linear regression after one hot encoding and
>>>>> everything else.
>>>>>
>>>>> Hope this helps,
>>>>> Hollin
>>>>>
>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is computational time for predictions on the order of few
>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>
>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>>
>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop
>>>>>>>> Summits about MLeap and how you can use it to build production services
>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source technology
>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML
>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap execution
>>>>>>>> engine has no dependencies on a Spark context and the serialization format
>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>
>>>>>>>>
>>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>>> and LDA).
>>>>>>>>
>>>>>>>>
>>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>>
>>>>>>>> https://github.com/combust/mleap
>>>>>>>>
>>>>>>>>
>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>
>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>
>>>>>>>>
>>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>>
>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>
>>>>>>>>
>>>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>>>> logistic and random forest models:
>>>>>>>>
>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>
>>>>>>>>
>>>>>>>> Check out our latest MLeap-serving Docker image, which allows you
>>>>>>>> to expose a REST interface to your Spark ML pipeline models:
>>>>>>>>
>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>
>>>>>>>>
>>>>>>>> Several companies are using MLeap in production and even more are
>>>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> Hollin and Mikhail
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Chris Fregly <ch...@fregly.com>.

to date, i haven't seen very good performance coming from mleap. i believe ram from databricks keeps getting you guys on stage at the spark summits, but i've been unimpressed with the performance numbers - as well as your choice to reimplement own non-standard "pmml-like" mechanism which incurs heavy technical debt on the development side.

creating technical debt is a very databricks-like thing as seen in their own product - so it's no surprise that databricks supports and encourages this type of engineering effort.

@hollin: please correct me if i'm wrong, but the numbers you guys have quoted in the past are at very low scale. at one point you were quoting 40-50ms which is pretty bad. 11ms is better, but these are all at low scale which is not good.

i'm not sure where the 2-3ms numbers are coming from, but even that is not realistic in most real-world scenarios at scale.

checkout our 100% open source solution to this exact problem starting at http://pipeline.io. you'll find links to the github repo, youtube demos, and slideshare conference talks, online training, and lots more.

our entire focus at PipelineIO is optimizing. deploying, a/b + bandit testing, and scaling Scikit-Learn + Spark ML + Tensorflow AI models for high-performance predictions.

this focus on performance and scale is an extension of our team's long history of building highly scalable, highly available, and highly performance distributed ML and AI systems at netflix, twitter, mesosphere - and even databricks. :)

reminder that everything here is 100% open source. no product pitches here. we work for you guys/gals - aka the community!

please contact me directly if you're looking to solve this problem the best way possible.

we can get you up and running in your own cloud-based or on-premise environment in minutes. we support aws, google cloud, and azure - basically anywhere that runs docker.

any time zone works. we're completely global with free 24x7 support for everyone in the community.

thanks! hope this is useful.

Chris Fregly
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London

On Feb 4, 2017, 12:06 PM -0600, Debasish Das <de...@gmail.com>, wrote:
>
> Except of course lda als and neural net model....for them the model need to be either prescored and cached on a kv store or the matrices / graph should be kept on kv store to access them using a REST API to serve the output..for neural net its more fun since its a distributed or local graph over which tensorflow compute needs to run...
>
>
> In trapezium we support writing these models to store like cassandra and lucene for example and then provide config driven akka-http based API to add the business logic to access these model from a store and expose the model serving as REST endpoint
>
>
> Matrix, graph and kernel models we use a lot and for them turned out that mllib style model predict were useful if we change the underlying store...
>
> On Feb 4, 2017 9:37 AM, "Debasish Das" <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)> wrote:
> >
> > If we expose an API to access the raw models out of PipelineModel can't we call predict directly on it from an API ? Is there a task open to expose the model out of PipelineModel so that predict can be called on it....there is no dependency of spark context in ml model...
> >
> > On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbansal2@gmail.com (mailto:asmbansal2@gmail.com)> wrote:
> > > In Spark 2.0 there is a class called PipelineModel. I know that the title says pipeline but it is actually talking about PipelineModel trained via using a Pipeline.
> > > Why PipelineModel instead of pipeline? Because usually there is a series of stuff that needs to be done when doing ML which warrants an ordered sequence of operations. Read the new spark ml docs or one of the databricks blogs related to spark pipelines. If you have used python's sklearn library the concept is inspired from there.
> > > "once model is deserialized as ml model from the store of choice within ms" - The timing of loading the model was not what I was referring to when I was talking about timing.
> > > "it can be used on incoming features to score through spark.ml.Model predict API". The predict API is in the old mllib package not the new ml package.
> > > "why r we using dataframe and not the ML model directly from API" - Because as of now the new ml package does not have the direct API.
> > >
> > >
> > >
> > > On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)> wrote:
> > > >
> > > > I am not sure why I will use pipeline to do scoring...idea is to build a model, use model ser/deser feature to put it in the row or column store of choice and provide a api access to the model...we support these primitives in github.com/Verizon/trapezium...the (http://github.com/Verizon/trapezium...the) api has access to spark context in local or distributed mode...once model is deserialized as ml model from the store of choice within ms, it can be used on incoming features to score through spark.ml.Model predict API...I am not clear on 2200x speedup...why r we using dataframe and not the ML model directly from API ?
> > > >
> > > > On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbansal2@gmail.com (mailto:asmbansal2@gmail.com)> wrote:
> > > > > Does this support Java 7?
> > > > > What is your timezone in case someone wanted to talk?
> > > > >
> > > > >
> > > > > On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hollin@combust.ml (mailto:hollin@combust.ml)> wrote:
> > > > > > Hey Aseem,
> > > > > >
> > > > > > We have built pipelines that execute several string indexers, one hot encoders, scaling, and a random forest or linear regression at the end. Execution time for the linear regression was on the order of 11 microseconds, a bit longer for random forest. This can be further optimized by using row-based transformations if your pipeline is simple to around 2-3 microseconds. The pipeline operated on roughly 12 input features, and by the time all the processing was done, we had somewhere around 1000 features or so going into the linear regression after one hot encoding and everything else.
> > > > > >
> > > > > > Hope this helps,
> > > > > > Hollin
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbansal2@gmail.com (mailto:asmbansal2@gmail.com)> wrote:
> > > > > > > Does this support Java 7?
> > > > > > >
> > > > > > > On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbansal2@gmail.com (mailto:asmbansal2@gmail.com)> wrote:
> > > > > > > > Is computational time for predictions on the order of few milliseconds (< 10 ms) like the old mllib library?
> > > > > > > >
> > > > > > > > On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hollin@combust.ml (mailto:hollin@combust.ml)> wrote:
> > > > > > > > >
> > > > > > > > > Hey everyone,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about MLeap and how you can use it to build production services from your Spark-trained ML pipelines. MLeap is an open-source technology that allows Data Scientists and Engineers to deploy Spark-trained ML Pipelines and Models to a scoring engine instantly. The MLeap execution engine has no dependencies on a Spark context and the serialization format is entirely based on Protobuf 3 and JSON.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The recent 0.5.0 release provides serialization and inference support for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > MLeap is open-source, take a look at our Github page:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > https://github.com/combust/mleap
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Or join the conversation on Gitter:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > https://gitter.im/combust/mleap
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > We have a set of documentation to help get you started here:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > http://mleap-docs.combust.ml/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > We even have a set of demos, for training ML Pipelines and linear, logistic and random forest models:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > https://github.com/combust/mleap-demo
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Check out our latest MLeap-serving Docker image, which allows you to expose a REST interface to your Spark ML pipeline models:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > http://mleap-docs.combust.ml/mleap-serving/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Several companies are using MLeap in production and even more are currently evaluating it. Take a look and tell us what you think! We hope to talk with you soon and welcome feedback/suggestions!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Sincerely,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hollin and Mikhail
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Debasish Das <de...@gmail.com>.

Except of course lda als and neural net model....for them the model need to
be either prescored and cached on a kv store or the matrices / graph should
be kept on kv store to access them using a REST API to serve the
output..for neural net its more fun since its a distributed or local  graph
over which tensorflow compute needs to run...

In trapezium we support writing these models to store like cassandra and
lucene for example and then provide config driven akka-http based API to
add the business logic to access these model from a store and expose the
model serving as REST endpoint

Matrix, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das" <de...@gmail.com> wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on it....there
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>
>>
>>    - In Spark 2.0 there is a class called PipelineModel. I know that the
>>    title says pipeline but it is actually talking about PipelineModel trained
>>    via using a Pipeline.
>>    - Why PipelineModel instead of pipeline? Because usually there is a
>>    series of stuff that needs to be done when doing ML which warrants an
>>    ordered sequence of operations. Read the new spark ml docs or one of the
>>    databricks blogs related to spark pipelines. If you have used python's
>>    sklearn library the concept is inspired from there.
>>    - "once model is deserialized as ml model from the store of choice
>>    within ms" - The timing of loading the model was not what I was
>>    referring to when I was talking about timing.
>>    - "it can be used on incoming features to score through
>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>    not the new ml package.
>>    - "why r we using dataframe and not the ML model directly from API" -
>>    Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>>
>>>> Does this support Java 7?
>>>> What is your timezone in case someone wanted to talk?
>>>>
>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey Aseem,
>>>>>
>>>>> We have built pipelines that execute several string indexers, one hot
>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>> Execution time for the linear regression was on the order of 11
>>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>> the time all the processing was done, we had somewhere around 1000 features
>>>>> or so going into the linear regression after one hot encoding and
>>>>> everything else.
>>>>>
>>>>> Hope this helps,
>>>>> Hollin
>>>>>
>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is computational time for predictions on the order of few
>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>
>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>>
>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop
>>>>>>>> Summits about MLeap and how you can use it to build production services
>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source technology
>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML
>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap execution
>>>>>>>> engine has no dependencies on a Spark context and the serialization format
>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>
>>>>>>>>
>>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>>> and LDA).
>>>>>>>>
>>>>>>>>
>>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>>
>>>>>>>> https://github.com/combust/mleap
>>>>>>>>
>>>>>>>>
>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>
>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>
>>>>>>>>
>>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>>
>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>
>>>>>>>>
>>>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>>>> logistic and random forest models:
>>>>>>>>
>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>
>>>>>>>>
>>>>>>>> Check out our latest MLeap-serving Docker image, which allows you
>>>>>>>> to expose a REST interface to your Spark ML pipeline models:
>>>>>>>>
>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>
>>>>>>>>
>>>>>>>> Several companies are using MLeap in production and even more are
>>>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> Hollin and Mikhail
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Debasish Das <de...@gmail.com>.

If we expose an API to access the raw models out of PipelineModel can't we
call predict directly on it from an API ? Is there a task open to expose
the model out of PipelineModel so that predict can be called on it....there
is no dependency of spark context in ml model...
On Feb 4, 2017 9:11 AM, "Aseem Bansal" <as...@gmail.com> wrote:

>
>    - In Spark 2.0 there is a class called PipelineModel. I know that the
>    title says pipeline but it is actually talking about PipelineModel trained
>    via using a Pipeline.
>    - Why PipelineModel instead of pipeline? Because usually there is a
>    series of stuff that needs to be done when doing ML which warrants an
>    ordered sequence of operations. Read the new spark ml docs or one of the
>    databricks blogs related to spark pipelines. If you have used python's
>    sklearn library the concept is inspired from there.
>    - "once model is deserialized as ml model from the store of choice
>    within ms" - The timing of loading the model was not what I was
>    referring to when I was talking about timing.
>    - "it can be used on incoming features to score through spark.ml.Model
>    predict API". The predict API is in the old mllib package not the new ml
>    package.
>    - "why r we using dataframe and not the ML model directly from API" -
>    Because as of now the new ml package does not have the direct API.
>
>
> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> I am not sure why I will use pipeline to do scoring...idea is to build a
>> model, use model ser/deser feature to put it in the row or column store of
>> choice and provide a api access to the model...we support these primitives
>> in github.com/Verizon/trapezium...the api has access to spark context in
>> local or distributed mode...once model is deserialized as ml model from the
>> store of choice within ms, it can be used on incoming features to score
>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>> r we using dataframe and not the ML model directly from API ?
>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>>
>>> Does this support Java 7?
>>> What is your timezone in case someone wanted to talk?
>>>
>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>>> wrote:
>>>
>>>> Hey Aseem,
>>>>
>>>> We have built pipelines that execute several string indexers, one hot
>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>> Execution time for the linear regression was on the order of 11
>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>> the time all the processing was done, we had somewhere around 1000 features
>>>> or so going into the linear regression after one hot encoding and
>>>> everything else.
>>>>
>>>> Hope this helps,
>>>> Hollin
>>>>
>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>>> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is computational time for predictions on the order of few
>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>
>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>>
>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>>>> about MLeap and how you can use it to build production services from your
>>>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>>>> based on Protobuf 3 and JSON.
>>>>>>>
>>>>>>>
>>>>>>> The recent 0.5.0 release provides serialization and inference
>>>>>>> support for close to 100% of Spark transformers (we don’t yet support ALS
>>>>>>> and LDA).
>>>>>>>
>>>>>>>
>>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>>
>>>>>>> https://github.com/combust/mleap
>>>>>>>
>>>>>>>
>>>>>>> Or join the conversation on Gitter:
>>>>>>>
>>>>>>> https://gitter.im/combust/mleap
>>>>>>>
>>>>>>>
>>>>>>> We have a set of documentation to help get you started here:
>>>>>>>
>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>
>>>>>>>
>>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>>> logistic and random forest models:
>>>>>>>
>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>
>>>>>>>
>>>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>>>
>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>
>>>>>>>
>>>>>>> Several companies are using MLeap in production and even more are
>>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> Hollin and Mikhail
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

   - In Spark 2.0 there is a class called PipelineModel. I know that the
   title says pipeline but it is actually talking about PipelineModel trained
   via using a Pipeline.
   - Why PipelineModel instead of pipeline? Because usually there is a
   series of stuff that needs to be done when doing ML which warrants an
   ordered sequence of operations. Read the new spark ml docs or one of the
   databricks blogs related to spark pipelines. If you have used python's
   sklearn library the concept is inspired from there.
   - "once model is deserialized as ml model from the store of choice
   within ms" - The timing of loading the model was not what I was
   referring to when I was talking about timing.
   - "it can be used on incoming features to score through spark.ml.Model
   predict API". The predict API is in the old mllib package not the new ml
   package.
   - "why r we using dataframe and not the ML model directly from API" -
   Because as of now the new ml package does not have the direct API.


On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <de...@gmail.com>
wrote:

> I am not sure why I will use pipeline to do scoring...idea is to build a
> model, use model ser/deser feature to put it in the row or column store of
> choice and provide a api access to the model...we support these primitives
> in github.com/Verizon/trapezium...the api has access to spark context in
> local or distributed mode...once model is deserialized as ml model from the
> store of choice within ms, it can be used on incoming features to score
> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
> r we using dataframe and not the ML model directly from API ?
> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:
>
>> Does this support Java 7?
>> What is your timezone in case someone wanted to talk?
>>
>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml>
>> wrote:
>>
>>> Hey Aseem,
>>>
>>> We have built pipelines that execute several string indexers, one hot
>>> encoders, scaling, and a random forest or linear regression at the end.
>>> Execution time for the linear regression was on the order of 11
>>> microseconds, a bit longer for random forest. This can be further optimized
>>> by using row-based transformations if your pipeline is simple to around 2-3
>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>> the time all the processing was done, we had somewhere around 1000 features
>>> or so going into the linear regression after one hot encoding and
>>> everything else.
>>>
>>> Hope this helps,
>>> Hollin
>>>
>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>>> wrote:
>>>
>>>> Does this support Java 7?
>>>>
>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is computational time for predictions on the order of few milliseconds
>>>>> (< 10 ms) like the old mllib library?
>>>>>
>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>>
>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>>> about MLeap and how you can use it to build production services from your
>>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>>> based on Protobuf 3 and JSON.
>>>>>>
>>>>>>
>>>>>> The recent 0.5.0 release provides serialization and inference support
>>>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>>>
>>>>>>
>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>
>>>>>> https://github.com/combust/mleap
>>>>>>
>>>>>>
>>>>>> Or join the conversation on Gitter:
>>>>>>
>>>>>> https://gitter.im/combust/mleap
>>>>>>
>>>>>>
>>>>>> We have a set of documentation to help get you started here:
>>>>>>
>>>>>> http://mleap-docs.combust.ml/
>>>>>>
>>>>>>
>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>> logistic and random forest models:
>>>>>>
>>>>>> https://github.com/combust/mleap-demo
>>>>>>
>>>>>>
>>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>>
>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>
>>>>>>
>>>>>> Several companies are using MLeap in production and even more are
>>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> Hollin and Mikhail
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Debasish Das <de...@gmail.com>.

I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
local or distributed mode...once model is deserialized as ml model from the
store of choice within ms, it can be used on incoming features to score
through spark.ml.Model predict API...I am not clear on 2200x speedup...why
r we using dataframe and not the ML model directly from API ?
On Feb 4, 2017 7:52 AM, "Aseem Bansal" <as...@gmail.com> wrote:

> Does this support Java 7?
> What is your timezone in case someone wanted to talk?
>
> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml> wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com>
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>>> wrote:
>>>
>>>> Is computational time for predictions on the order of few milliseconds
>>>> (< 10 ms) like the old mllib library?
>>>>
>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>>
>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>> about MLeap and how you can use it to build production services from your
>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>> based on Protobuf 3 and JSON.
>>>>>
>>>>>
>>>>> The recent 0.5.0 release provides serialization and inference support
>>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>>
>>>>>
>>>>> MLeap is open-source, take a look at our Github page:
>>>>>
>>>>> https://github.com/combust/mleap
>>>>>
>>>>>
>>>>> Or join the conversation on Gitter:
>>>>>
>>>>> https://gitter.im/combust/mleap
>>>>>
>>>>>
>>>>> We have a set of documentation to help get you started here:
>>>>>
>>>>> http://mleap-docs.combust.ml/
>>>>>
>>>>>
>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>> logistic and random forest models:
>>>>>
>>>>> https://github.com/combust/mleap-demo
>>>>>
>>>>>
>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>
>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>
>>>>>
>>>>> Several companies are using MLeap in production and even more are
>>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> Hollin and Mikhail
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

Does this support Java 7?
What is your timezone in case someone wanted to talk?

On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <ho...@combust.ml> wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further optimized
> by using row-based transformations if your pipeline is simple to around 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com> wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com>
>> wrote:
>>
>>> Is computational time for predictions on the order of few milliseconds
>>> (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>>
>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>> about MLeap and how you can use it to build production services from your
>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>> dependencies on a Spark context and the serialization format is entirely
>>>> based on Protobuf 3 and JSON.
>>>>
>>>>
>>>> The recent 0.5.0 release provides serialization and inference support
>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>
>>>>
>>>> MLeap is open-source, take a look at our Github page:
>>>>
>>>> https://github.com/combust/mleap
>>>>
>>>>
>>>> Or join the conversation on Gitter:
>>>>
>>>> https://gitter.im/combust/mleap
>>>>
>>>>
>>>> We have a set of documentation to help get you started here:
>>>>
>>>> http://mleap-docs.combust.ml/
>>>>
>>>>
>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>> logistic and random forest models:
>>>>
>>>> https://github.com/combust/mleap-demo
>>>>
>>>>
>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>> expose a REST interface to your Spark ML pipeline models:
>>>>
>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>
>>>>
>>>> Several companies are using MLeap in production and even more are
>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>> talk with you soon and welcome feedback/suggestions!
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> Hollin and Mikhail
>>>>
>>>
>>>
>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Hollin Wilkins <ho...@combust.ml>.

Hey Aseem,

We have built pipelines that execute several string indexers, one hot
encoders, scaling, and a random forest or linear regression at the end.
Execution time for the linear regression was on the order of 11
microseconds, a bit longer for random forest. This can be further optimized
by using row-based transformations if your pipeline is simple to around 2-3
microseconds. The pipeline operated on roughly 12 input features, and by
the time all the processing was done, we had somewhere around 1000 features
or so going into the linear regression after one hot encoding and
everything else.

Hope this helps,
Hollin

On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <as...@gmail.com> wrote:

> Does this support Java 7?
>
> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com> wrote:
>
>> Is computational time for predictions on the order of few milliseconds (<
>> 10 ms) like the old mllib library?
>>
>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml>
>> wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>> about MLeap and how you can use it to build production services from your
>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>> dependencies on a Spark context and the serialization format is entirely
>>> based on Protobuf 3 and JSON.
>>>
>>>
>>> The recent 0.5.0 release provides serialization and inference support
>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>
>>>
>>> MLeap is open-source, take a look at our Github page:
>>>
>>> https://github.com/combust/mleap
>>>
>>>
>>> Or join the conversation on Gitter:
>>>
>>> https://gitter.im/combust/mleap
>>>
>>>
>>> We have a set of documentation to help get you started here:
>>>
>>> http://mleap-docs.combust.ml/
>>>
>>>
>>> We even have a set of demos, for training ML Pipelines and linear,
>>> logistic and random forest models:
>>>
>>> https://github.com/combust/mleap-demo
>>>
>>>
>>> Check out our latest MLeap-serving Docker image, which allows you to
>>> expose a REST interface to your Spark ML pipeline models:
>>>
>>> http://mleap-docs.combust.ml/mleap-serving/
>>>
>>>
>>> Several companies are using MLeap in production and even more are
>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>> talk with you soon and welcome feedback/suggestions!
>>>
>>>
>>> Sincerely,
>>>
>>> Hollin and Mikhail
>>>
>>
>>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

Does this support Java 7?

On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <as...@gmail.com> wrote:

> Is computational time for predictions on the order of few milliseconds (<
> 10 ms) like the old mllib library?
>
> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml> wrote:
>
>> Hey everyone,
>>
>>
>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>> about MLeap and how you can use it to build production services from your
>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>> Models to a scoring engine instantly. The MLeap execution engine has no
>> dependencies on a Spark context and the serialization format is entirely
>> based on Protobuf 3 and JSON.
>>
>>
>> The recent 0.5.0 release provides serialization and inference support for
>> close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>
>>
>> MLeap is open-source, take a look at our Github page:
>>
>> https://github.com/combust/mleap
>>
>>
>> Or join the conversation on Gitter:
>>
>> https://gitter.im/combust/mleap
>>
>>
>> We have a set of documentation to help get you started here:
>>
>> http://mleap-docs.combust.ml/
>>
>>
>> We even have a set of demos, for training ML Pipelines and linear,
>> logistic and random forest models:
>>
>> https://github.com/combust/mleap-demo
>>
>>
>> Check out our latest MLeap-serving Docker image, which allows you to
>> expose a REST interface to your Spark ML pipeline models:
>>
>> http://mleap-docs.combust.ml/mleap-serving/
>>
>>
>> Several companies are using MLeap in production and even more are
>> currently evaluating it. Take a look and tell us what you think! We hope to
>> talk with you soon and welcome feedback/suggestions!
>>
>>
>> Sincerely,
>>
>> Hollin and Mikhail
>>
>
>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

Posted by Aseem Bansal <as...@gmail.com>.

Is computational time for predictions on the order of few milliseconds (<
10 ms) like the old mllib library?

On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <ho...@combust.ml> wrote:

> Hey everyone,
>
>
> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about
> MLeap and how you can use it to build production services from your
> Spark-trained ML pipelines. MLeap is an open-source technology that allows
> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
> Models to a scoring engine instantly. The MLeap execution engine has no
> dependencies on a Spark context and the serialization format is entirely
> based on Protobuf 3 and JSON.
>
>
> The recent 0.5.0 release provides serialization and inference support for
> close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>
>
> MLeap is open-source, take a look at our Github page:
>
> https://github.com/combust/mleap
>
>
> Or join the conversation on Gitter:
>
> https://gitter.im/combust/mleap
>
>
> We have a set of documentation to help get you started here:
>
> http://mleap-docs.combust.ml/
>
>
> We even have a set of demos, for training ML Pipelines and linear,
> logistic and random forest models:
>
> https://github.com/combust/mleap-demo
>
>
> Check out our latest MLeap-serving Docker image, which allows you to
> expose a REST interface to your Spark ML pipeline models:
>
> http://mleap-docs.combust.ml/mleap-serving/
>
>
> Several companies are using MLeap in production and even more are
> currently evaluating it. Take a look and tell us what you think! We hope to
> talk with you soon and welcome feedback/suggestions!
>
>
> Sincerely,
>
> Hollin and Mikhail
>