You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aseem Bansal <as...@gmail.com> on 2016/09/01 12:37:28 UTC

Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Hi

Currently trying to use NaiveBayes to make predictions. But facing issues
that doing the predictions takes order of few seconds. I tried with other
model examples shipped with Spark but they also ran in minimum of 500 ms
when I used Scala API. With

Has anyone used spark ML to do predictions for a single row under 20 ms?

I am not doing premature optimization. The use case is that we are doing
real time predictions and we need results 20ms. Maximum 30ms. This is a
hard limit for our use case.

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Nick Pentreath <ni...@gmail.com>.

I should also point out that right now your only option is to code up your
own export functionality (or be able to read Spark's format in your serving
system), and translate that into the correct format for some other linear
algebra or ML library, and use that for serving.

On Thu, 1 Sep 2016 at 15:37 Nick Pentreath <ni...@gmail.com> wrote:

> Right now you are correct that Spark ML APIs do not support predicting on
> a single instance (whether Vector for the models or a Row for a pipeline).
>
> See https://issues.apache.org/jira/browse/SPARK-10413 and
> https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> discussion.
>
> There may be movement in the short term to support the single Vector case.
> But anything for pipelines is not immediately on the horizon I'd say.
>
> N
>
>
> On Thu, 1 Sep 2016 at 15:21 Aseem Bansal <as...@gmail.com> wrote:
>
>> I understand from a theoretical perspective that the model itself is not
>> distributed. Thus it can be used for making predictions for a vector or a
>> RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I
>> create a model from a large data the recommended way is to use the ml
>> library for fit. I have the option of getting a
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html
>>  or wrapping it as
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html
>>
>> Both of these do not have any method which supports Vectors. How do I
>> bridge this gap in the API from my side? Is there anything in Spark's API
>> which I have missed? Or do I need to extract the parameters and use another
>> library for the predictions for a single row?
>>
>> On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> How the model is built isn't that related to how it scores things.
>>> Here we're just talking about scoring. NaiveBayesModel can score
>>> Vector which is not a distributed entity. That's what you want to use.
>>> You do not want to use a whole distributed operation to score one
>>> record. This isn't related to .ml vs .mllib APIs.
>>>
>>> On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <as...@gmail.com>
>>> wrote:
>>> > I understand your point.
>>> >
>>> > Is there something like a bridge? Is it possible to convert the model
>>> > trained using Dataset<Row> (i.e. the distributed one) to the one which
>>> uses
>>> > vectors? In Spark 1.6 the mllib packages had everything as per vectors
>>> and
>>> > that should be faster as per my understanding. But in many spark blogs
>>> we
>>> > saw that spark is moving towards the ml package and mllib package will
>>> be
>>> > phased out. So how can someone train using huge data and then use it
>>> on a
>>> > row by row basis?
>>> >
>>> > Thanks for your inputs.
>>> >
>>> > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
>>> >>
>>> >> If you're trying to score a single example by way of an RDD or
>>> >> Dataset, then no it will never be that fast. It's a whole distributed
>>> >> operation, and while you might manage low latency for one job at a
>>> >> time, consider what will happen when hundreds of them are running at
>>> >> once. It's just huge overkill for scoring a single example (but,
>>> >> pretty fine for high-er latency, high throughput batch operations)
>>> >>
>>> >> However if you're scoring a Vector locally I can't imagine it's that
>>> >> slow. It does some linear algebra but it's not that complicated. Even
>>> >> something unoptimized should be fast.
>>> >>
>>> >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com>
>>> wrote:
>>> >> > Hi
>>> >> >
>>> >> > Currently trying to use NaiveBayes to make predictions. But facing
>>> >> > issues
>>> >> > that doing the predictions takes order of few seconds. I tried with
>>> >> > other
>>> >> > model examples shipped with Spark but they also ran in minimum of
>>> 500 ms
>>> >> > when I used Scala API. With
>>> >> >
>>> >> > Has anyone used spark ML to do predictions for a single row under
>>> 20 ms?
>>> >> >
>>> >> > I am not doing premature optimization. The use case is that we are
>>> doing
>>> >> > real time predictions and we need results 20ms. Maximum 30ms. This
>>> is a
>>> >> > hard
>>> >> > limit for our use case.
>>> >
>>> >
>>>
>>
>>

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Aseem Bansal <as...@gmail.com>.

Hi

Thanks for all the details. I was able to convert from ml.NaiveBayesModel
to mllib.NaiveBayesModel and get it done. It is fast for our use case.

Just one question. Before mllib is removed can ml package be expected to
reach feature parity with mllib?

On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen <so...@cloudera.com> wrote:

> Yeah there's a method to predict one Vector in the .mllib API but not
> the newer one. You could possibly hack your way into calling it
> anyway, or just clone the logic.
>
> On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
> > Right now you are correct that Spark ML APIs do not support predicting
> on a
> > single instance (whether Vector for the models or a Row for a pipeline).
> >
> > See https://issues.apache.org/jira/browse/SPARK-10413 and
> > https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> > discussion.
> >
> > There may be movement in the short term to support the single Vector
> case.
> > But anything for pipelines is not immediately on the horizon I'd say.
> >
> > N
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Sean Owen <so...@cloudera.com>.

Yeah there's a method to predict one Vector in the .mllib API but not
the newer one. You could possibly hack your way into calling it
anyway, or just clone the logic.

On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath <ni...@gmail.com> wrote:
> Right now you are correct that Spark ML APIs do not support predicting on a
> single instance (whether Vector for the models or a Row for a pipeline).
>
> See https://issues.apache.org/jira/browse/SPARK-10413 and
> https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> discussion.
>
> There may be movement in the short term to support the single Vector case.
> But anything for pipelines is not immediately on the horizon I'd say.
>
> N

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Nick Pentreath <ni...@gmail.com>.

Right now you are correct that Spark ML APIs do not support predicting on a
single instance (whether Vector for the models or a Row for a pipeline).

See https://issues.apache.org/jira/browse/SPARK-10413 and
https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
discussion.

There may be movement in the short term to support the single Vector case.
But anything for pipelines is not immediately on the horizon I'd say.

N

On Thu, 1 Sep 2016 at 15:21 Aseem Bansal <as...@gmail.com> wrote:

> I understand from a theoretical perspective that the model itself is not
> distributed. Thus it can be used for making predictions for a vector or a
> RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I
> create a model from a large data the recommended way is to use the ml
> library for fit. I have the option of getting a
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html
>  or wrapping it as
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html
>
> Both of these do not have any method which supports Vectors. How do I
> bridge this gap in the API from my side? Is there anything in Spark's API
> which I have missed? Or do I need to extract the parameters and use another
> library for the predictions for a single row?
>
> On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> How the model is built isn't that related to how it scores things.
>> Here we're just talking about scoring. NaiveBayesModel can score
>> Vector which is not a distributed entity. That's what you want to use.
>> You do not want to use a whole distributed operation to score one
>> record. This isn't related to .ml vs .mllib APIs.
>>
>> On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <as...@gmail.com>
>> wrote:
>> > I understand your point.
>> >
>> > Is there something like a bridge? Is it possible to convert the model
>> > trained using Dataset<Row> (i.e. the distributed one) to the one which
>> uses
>> > vectors? In Spark 1.6 the mllib packages had everything as per vectors
>> and
>> > that should be faster as per my understanding. But in many spark blogs
>> we
>> > saw that spark is moving towards the ml package and mllib package will
>> be
>> > phased out. So how can someone train using huge data and then use it on
>> a
>> > row by row basis?
>> >
>> > Thanks for your inputs.
>> >
>> > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> If you're trying to score a single example by way of an RDD or
>> >> Dataset, then no it will never be that fast. It's a whole distributed
>> >> operation, and while you might manage low latency for one job at a
>> >> time, consider what will happen when hundreds of them are running at
>> >> once. It's just huge overkill for scoring a single example (but,
>> >> pretty fine for high-er latency, high throughput batch operations)
>> >>
>> >> However if you're scoring a Vector locally I can't imagine it's that
>> >> slow. It does some linear algebra but it's not that complicated. Even
>> >> something unoptimized should be fast.
>> >>
>> >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com>
>> wrote:
>> >> > Hi
>> >> >
>> >> > Currently trying to use NaiveBayes to make predictions. But facing
>> >> > issues
>> >> > that doing the predictions takes order of few seconds. I tried with
>> >> > other
>> >> > model examples shipped with Spark but they also ran in minimum of
>> 500 ms
>> >> > when I used Scala API. With
>> >> >
>> >> > Has anyone used spark ML to do predictions for a single row under 20
>> ms?
>> >> >
>> >> > I am not doing premature optimization. The use case is that we are
>> doing
>> >> > real time predictions and we need results 20ms. Maximum 30ms. This
>> is a
>> >> > hard
>> >> > limit for our use case.
>> >
>> >
>>
>
>

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Aseem Bansal <as...@gmail.com>.

I understand from a theoretical perspective that the model itself is not
distributed. Thus it can be used for making predictions for a vector or a
RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I
create a model from a large data the recommended way is to use the ml
library for fit. I have the option of getting a
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html
 or wrapping it as
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html

Both of these do not have any method which supports Vectors. How do I
bridge this gap in the API from my side? Is there anything in Spark's API
which I have missed? Or do I need to extract the parameters and use another
library for the predictions for a single row?

On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote:

> How the model is built isn't that related to how it scores things.
> Here we're just talking about scoring. NaiveBayesModel can score
> Vector which is not a distributed entity. That's what you want to use.
> You do not want to use a whole distributed operation to score one
> record. This isn't related to .ml vs .mllib APIs.
>
> On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <as...@gmail.com> wrote:
> > I understand your point.
> >
> > Is there something like a bridge? Is it possible to convert the model
> > trained using Dataset<Row> (i.e. the distributed one) to the one which
> uses
> > vectors? In Spark 1.6 the mllib packages had everything as per vectors
> and
> > that should be faster as per my understanding. But in many spark blogs we
> > saw that spark is moving towards the ml package and mllib package will be
> > phased out. So how can someone train using huge data and then use it on a
> > row by row basis?
> >
> > Thanks for your inputs.
> >
> > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> If you're trying to score a single example by way of an RDD or
> >> Dataset, then no it will never be that fast. It's a whole distributed
> >> operation, and while you might manage low latency for one job at a
> >> time, consider what will happen when hundreds of them are running at
> >> once. It's just huge overkill for scoring a single example (but,
> >> pretty fine for high-er latency, high throughput batch operations)
> >>
> >> However if you're scoring a Vector locally I can't imagine it's that
> >> slow. It does some linear algebra but it's not that complicated. Even
> >> something unoptimized should be fast.
> >>
> >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com>
> wrote:
> >> > Hi
> >> >
> >> > Currently trying to use NaiveBayes to make predictions. But facing
> >> > issues
> >> > that doing the predictions takes order of few seconds. I tried with
> >> > other
> >> > model examples shipped with Spark but they also ran in minimum of 500
> ms
> >> > when I used Scala API. With
> >> >
> >> > Has anyone used spark ML to do predictions for a single row under 20
> ms?
> >> >
> >> > I am not doing premature optimization. The use case is that we are
> doing
> >> > real time predictions and we need results 20ms. Maximum 30ms. This is
> a
> >> > hard
> >> > limit for our use case.
> >
> >
>

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Sean Owen <so...@cloudera.com>.

How the model is built isn't that related to how it scores things.
Here we're just talking about scoring. NaiveBayesModel can score
Vector which is not a distributed entity. That's what you want to use.
You do not want to use a whole distributed operation to score one
record. This isn't related to .ml vs .mllib APIs.

On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <as...@gmail.com> wrote:
> I understand your point.
>
> Is there something like a bridge? Is it possible to convert the model
> trained using Dataset<Row> (i.e. the distributed one) to the one which uses
> vectors? In Spark 1.6 the mllib packages had everything as per vectors and
> that should be faster as per my understanding. But in many spark blogs we
> saw that spark is moving towards the ml package and mllib package will be
> phased out. So how can someone train using huge data and then use it on a
> row by row basis?
>
> Thanks for your inputs.
>
> On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> If you're trying to score a single example by way of an RDD or
>> Dataset, then no it will never be that fast. It's a whole distributed
>> operation, and while you might manage low latency for one job at a
>> time, consider what will happen when hundreds of them are running at
>> once. It's just huge overkill for scoring a single example (but,
>> pretty fine for high-er latency, high throughput batch operations)
>>
>> However if you're scoring a Vector locally I can't imagine it's that
>> slow. It does some linear algebra but it's not that complicated. Even
>> something unoptimized should be fast.
>>
>> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com> wrote:
>> > Hi
>> >
>> > Currently trying to use NaiveBayes to make predictions. But facing
>> > issues
>> > that doing the predictions takes order of few seconds. I tried with
>> > other
>> > model examples shipped with Spark but they also ran in minimum of 500 ms
>> > when I used Scala API. With
>> >
>> > Has anyone used spark ML to do predictions for a single row under 20 ms?
>> >
>> > I am not doing premature optimization. The use case is that we are doing
>> > real time predictions and we need results 20ms. Maximum 30ms. This is a
>> > hard
>> > limit for our use case.
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Aseem Bansal <as...@gmail.com>.

I understand your point.

Is there something like a bridge? Is it possible to convert the model
trained using Dataset<Row> (i.e. the distributed one) to the one which uses
vectors? In Spark 1.6 the mllib packages had everything as per vectors and
that should be faster as per my understanding. But in many spark blogs we
saw that spark is moving towards the ml package and mllib package will be
phased out. So how can someone train using huge data and then use it on a
row by row basis?

Thanks for your inputs.

On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:

> If you're trying to score a single example by way of an RDD or
> Dataset, then no it will never be that fast. It's a whole distributed
> operation, and while you might manage low latency for one job at a
> time, consider what will happen when hundreds of them are running at
> once. It's just huge overkill for scoring a single example (but,
> pretty fine for high-er latency, high throughput batch operations)
>
> However if you're scoring a Vector locally I can't imagine it's that
> slow. It does some linear algebra but it's not that complicated. Even
> something unoptimized should be fast.
>
> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com> wrote:
> > Hi
> >
> > Currently trying to use NaiveBayes to make predictions. But facing issues
> > that doing the predictions takes order of few seconds. I tried with other
> > model examples shipped with Spark but they also ran in minimum of 500 ms
> > when I used Scala API. With
> >
> > Has anyone used spark ML to do predictions for a single row under 20 ms?
> >
> > I am not doing premature optimization. The use case is that we are doing
> > real time predictions and we need results 20ms. Maximum 30ms. This is a
> hard
> > limit for our use case.
>

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

Posted by Sean Owen <so...@cloudera.com>.

If you're trying to score a single example by way of an RDD or
Dataset, then no it will never be that fast. It's a whole distributed
operation, and while you might manage low latency for one job at a
time, consider what will happen when hundreds of them are running at
once. It's just huge overkill for scoring a single example (but,
pretty fine for high-er latency, high throughput batch operations)

However if you're scoring a Vector locally I can't imagine it's that
slow. It does some linear algebra but it's not that complicated. Even
something unoptimized should be fast.

On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <as...@gmail.com> wrote:
> Hi
>
> Currently trying to use NaiveBayes to make predictions. But facing issues
> that doing the predictions takes order of few seconds. I tried with other
> model examples shipped with Spark but they also ran in minimum of 500 ms
> when I used Scala API. With
>
> Has anyone used spark ML to do predictions for a single row under 20 ms?
>
> I am not doing premature optimization. The use case is that we are doing
> real time predictions and we need results 20ms. Maximum 30ms. This is a hard
> limit for our use case.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org