You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Mars Hall <ma...@salesforce.com> on 2017/09/23 01:35:13 UTC

Re: Eventserver API in an Engine?

I'm bringing this thread back to life!

There is another thread here this week:
*How to training and deploy on different machine?*

In it, Pat replies:

You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer
> and EventServe. These are 2 JVM processes the take events and respond to
> queries and so must be available all the time. You will run `pio
> eventserver` and `pio deploy` on this machine.
>

This is exactly what I'm talking about. Two processes on a single machine
to run a complete deployment. Doesn't it make sense to allow these APIs to
coexist in a single JVM?

Sure, in some cases you may want to scale out and tune two different JVMs
for these two different use-cases, but for most of us, making it so the
main runtime only requires a single process/JVM would make PredictionIO
much more friendly to operate.

A few more comments inline below…


On Wed, Jul 12, 2017 at 7:43 PM, Kenneth Chan <ke...@apache.org> wrote:

> Mars, i totally understand and agree we should make developer successful.
> but Would like to understand your problem more before jump into conclusion
>
> first, a complete PIO setup has following:
> 1. PIO framework layer
> 2. PIO administration (e.g. PIO app)
> 3. PIO event server
> 4. one or more PIO engines
>
> the storage and setup config applied to 1 globally and the rest 2, 3, 4
> would run on top of 1.
>
> my understanding is that the Buildpack would take engine code and then
> build, release and deploy it which can then serve query.
>
> when heroku user  use buildpack,
> - Where is the event server in the picture?
>

The eventserver is considered optional. If a Heroku user wants to use
events API, then they must provision a second Heroku app for the
eventserver:

https://github.com/heroku/predictionio-buildpack/blob/master/CUSTOM.md#user-content-eventserver


> - How user setup the storage config for 1?
>

With the Heroku buildpack, PostgreSQL is the default for all storage
sources, and it is automatically configured.


> - if i use build pack to deploy another engine, does it share 1 and 2
> above?
>

No. Every engine is another Heroku app. Every eventserver is another Heroku
app. These can be configured to intentionally share databases/storage, such
as for a specific engine+eventserver pair.



> On Wed, Jul 12, 2017 at 3:21 PM, Mars Hall <ma...@heroku.com> wrote:
>
>> The key motivation behind this idea/request is to:
>>
>>     Simplify baseline PredictionIO deployment, both conceptually &
>> technically.
>>
>> My vision with this thread is to:
>>
>>     Enable single-process, single network-listener PredictionIO app
>> deployment
>>     (i.e. Queries & Events APIs in the same process.)
>>
>>
>> Attempting to address some previous questions & statements…
>>
>>
>> From Pat Ferrel on Tue, 11 Jul 2017 10:53:48 -0700 (PDT):
>> > how much of your problem is workflow vs installation vs bundling of
>> APIs? Can you explain it more?
>>
>> I am focused on deploying PredictionIO on Heroku via this buildpack:
>>   https://github.com/heroku/predictionio-buildpack
>>
>> Heroku is an app-centric platform, where each app gets a single routable
>> network port. By default apps get a URL like:
>>   https://tdx-classi.herokuapp.com (an example PIO Classification engine)
>>
>> Deploying a separate Eventserver app that must be configured to share
>> storage config & backends leads to all kinds of complexity, especially when
>> unsuspectingly a developer might want to deploy a new engine with a
>> different storage config but not realize that Eventserver is not simply
>> shareable. Despite a lot of docs & discussion suggesting its share-ability,
>> there is precious little documentation that presents how the multi-backend
>> Storage really works in PIO. (I didn't understand it until I read a bunch
>> of Storage source code.)
>>
>>
>> From Kenneth Chan on Tue, 11 Jul 2017 12:49:58 -0700 (PDT):
>> > For example, one can modify the classification to train a classifier on
>> the same set of data used by recommendation.
>> …and later on Wed, 12 Jul 2017 13:44:01 -0700:
>> > My concern of embedding event server in engine is
>> > - what problem are we solving by providing an illusion that events are
>> only limited for one engine?
>>
>> This is a great ideal target, but the reality is that it takes some
>> significant design & engineering to reach that level of data share-ability.
>> I'm not suggesting that we do anything to undercut the possibilities of
>> such a distributed architecture. I suggest that we streamline PIO for
>> everyone that is not at that level of distributed architecture. Make PIO
>> not *require* it.
>>
>> The best example I have is that you can run Spark in local mode, without
>> worrying about any aspect of its ideal distributed purpose. (In fact
>> PredictionIO is built on this feature of Spark!) I don't know the history
>> there, but would imagine Spark was not always so friendly for small or
>> embedded tasks like this.
>>
>>
>> A huge part of my reality is seeing how many newcomers fumble around and
>> get frustrated. I'm looking at PredictionIO from a very Heroku-style
>> perspective of "how do we help [new] developers be successful", which is
>> probably going to seem like I want to take away capabilities. I just want
>> to make the onramp more graceful!
>>
>> *Mars
>>
>> ( <> .. <> )
>
>
>


-- 
*Mars Hall
415-818-7039 <(415)%20818-7039>
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

Re: Eventserver API in an Engine?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
And glad you did.

The needs of Heroku are just as important as any user of an Apache project *but no more so* since one extremely important measure of TLP eligibility is to demonstrate freedom from corporate dominance.

So let me chime in with my own reasons to look at a major refactoring of PIO;
Simplify deployment, one server with integrated engine(s) all incorporated into a single REST API and a single JVM process (perhaps identical to what Mars is asking for)
No need to “train” or ‘deploy” on different machines but full access to clustered compute and storage services (also something Mars mentions)
Kappa, non-Spark-based Engines, pure clean REST API that allows GUIs to be plugged in, optional true security (SSL+Auth).
The ML/AI community is moving on from Hadoop Mapreduce, to Spark, to TensorFlow and Streaming online learners (Kappa) and this requires independence from any specific compute backend.
Multi-tenant, with multiple instances and types of Engines.
Secure, TLS + Authentications + Authorization but optionally done so no overhead when it isn’t needed.
The CLI is just another client communicating with the server’s REST API and can be replaced with custom admin GUIs, for example.

We now have an MVP that delivers the above requirements but as a replacement to PIO. We at first saw this as PIO-Kappa. Early code was named this. But things have changed since it requires some major re-thinking and so now has its own name—Harness. To get these features the re-thinking of the PIO codebase will also be required along with a *lot* of work to implement. We chose to start from scratch as an easier route. The sever has one JVM process with REST for all input and query endpoints and even methods to trigger training for Lambda Engines. We have benchmarked performance on our scaffold Template (minimal operational Engine) at 6ms/request for one user (connection) in one thread on a 2013 Macbook Pro in localhost mode—add 1 ms for SSL+Auth. Since it uses akka-http is will also handle a self-tuning number of parallel requests (no benchmarks yet). So suffice to say it is fast.

Templates for this server are quite a bit different because they now include their own robust validation mechanism for input, query, and engine.json but also because Templates must now do some of what pio does. With this responsibility comes great freedom. Freedom to use any compute backend. Freedom to use any storage mechanism for model or input. Freedom to be Kappa, Lambda, or any hybrid between. And Engines get new functionality from the server as listed in the requirements.

Even though there are structural Template differences they remain JSON input compatible with PIO. We took a PIO Template we had created in 2016 that uses Vowpal Wabbit as a compute backend and re-implemented it in this new ML Server as a clean Kappa Template. Therefore we can talk about the differences with some evidence to back up statements. There was 0 change to input so backups of the PIO engine were moved to the new server quite easily with CLI and no change to data.

There are long tedious discussions that could be made about how to get what Mars and I are asking for from PIO but Apache is a do-ocracy. All of our asks can be done incrementally with incremental disruption—or they can be done at once (and have been). There are so many trade-offs that the discussion will, in all likelihood never end. 

I therefore suggest that Mars *do* what he thinks is needed, or alternatively, I am willing to donate what we have running. I’m planning to make the UR a Kappa algorithm soon, requiring no `pio train` (and no Spark). This must, of necessity be done on the new server framework so whether the new framework becomes part of PIO 2 or not, is a choice for the team. I suppose I could just push it to an “experimental” branch but this is something I’m not willing to *do* without some indication it is welcome. 

https://github.com/actionml/harness <https://github.com/actionml/harness>
https://github.com/actionml/harness/blob/develop/commands.md <https://github.com/actionml/harness/blob/develop/commands.md>
https://github.com/actionml/harness/blob/develop/rest_spec.md <https://github.com/actionml/harness/blob/develop/rest_spec.md>
Template contract: https://github.com/actionml/harness/tree/develop/rest-server/core/src/main/scala/com/actionml/core/template

The major downside I will volunteer is that Templates will require a fair bit of work to port and we have no Spark based ones to use as examples yet. Also we have not integrated PIO-Stores as the lead-in diagram implies. Remember it is an MVP running a Template in a production environment but makes no effort to replicate all PIO features.

 
On Sep 22, 2017, at 6:35 PM, Mars Hall <ma...@salesforce.com> wrote:

I'm bringing this thread back to life!

There is another thread here this week:
How to training and deploy on different machine?

In it, Pat replies:

You will have to spread the pio “workflow” out over a permanent deploy+eventserver machine. I usually call this a combo PredictionServer and EventServe. These are 2 JVM processes the take events and respond to queries and so must be available all the time. You will run `pio eventserver` and `pio deploy` on this machine.

This is exactly what I'm talking about. Two processes on a single machine to run a complete deployment. Doesn't it make sense to allow these APIs to coexist in a single JVM?

Sure, in some cases you may want to scale out and tune two different JVMs for these two different use-cases, but for most of us, making it so the main runtime only requires a single process/JVM would make PredictionIO much more friendly to operate.

A few more comments inline below…


On Wed, Jul 12, 2017 at 7:43 PM, Kenneth Chan <kenneth@apache.org <ma...@apache.org>> wrote:
Mars, i totally understand and agree we should make developer successful. but Would like to understand your problem more before jump into conclusion

first, a complete PIO setup has following:
1. PIO framework layer
2. PIO administration (e.g. PIO app)
3. PIO event server 
4. one or more PIO engines

the storage and setup config applied to 1 globally and the rest 2, 3, 4 would run on top of 1.

my understanding is that the Buildpack would take engine code and then build, release and deploy it which can then serve query.

when heroku user  use buildpack, 
- Where is the event server in the picture?

The eventserver is considered optional. If a Heroku user wants to use events API, then they must provision a second Heroku app for the eventserver:
  https://github.com/heroku/predictionio-buildpack/blob/master/CUSTOM.md#user-content-eventserver <https://github.com/heroku/predictionio-buildpack/blob/master/CUSTOM.md#user-content-eventserver>
 
- How user setup the storage config for 1?

With the Heroku buildpack, PostgreSQL is the default for all storage sources, and it is automatically configured.
 
- if i use build pack to deploy another engine, does it share 1 and 2 above?

No. Every engine is another Heroku app. Every eventserver is another Heroku app. These can be configured to intentionally share databases/storage, such as for a specific engine+eventserver pair.

 
On Wed, Jul 12, 2017 at 3:21 PM, Mars Hall <mars@heroku.com <ma...@heroku.com>> wrote:
The key motivation behind this idea/request is to:

    Simplify baseline PredictionIO deployment, both conceptually & technically.

My vision with this thread is to:

    Enable single-process, single network-listener PredictionIO app deployment
    (i.e. Queries & Events APIs in the same process.)


Attempting to address some previous questions & statements…


From Pat Ferrel on Tue, 11 Jul 2017 10:53:48 -0700 (PDT):
> how much of your problem is workflow vs installation vs bundling of APIs? Can you explain it more?

I am focused on deploying PredictionIO on Heroku via this buildpack:
  https://github.com/heroku/predictionio-buildpack <https://github.com/heroku/predictionio-buildpack>

Heroku is an app-centric platform, where each app gets a single routable network port. By default apps get a URL like:
  https://tdx-classi.herokuapp.com <https://tdx-classi.herokuapp.com/> (an example PIO Classification engine)

Deploying a separate Eventserver app that must be configured to share storage config & backends leads to all kinds of complexity, especially when unsuspectingly a developer might want to deploy a new engine with a different storage config but not realize that Eventserver is not simply shareable. Despite a lot of docs & discussion suggesting its share-ability, there is precious little documentation that presents how the multi-backend Storage really works in PIO. (I didn't understand it until I read a bunch of Storage source code.)


From Kenneth Chan on Tue, 11 Jul 2017 12:49:58 -0700 (PDT):
> For example, one can modify the classification to train a classifier on the same set of data used by recommendation.
…and later on Wed, 12 Jul 2017 13:44:01 -0700:
> My concern of embedding event server in engine is
> - what problem are we solving by providing an illusion that events are only limited for one engine?

This is a great ideal target, but the reality is that it takes some significant design & engineering to reach that level of data share-ability. I'm not suggesting that we do anything to undercut the possibilities of such a distributed architecture. I suggest that we streamline PIO for everyone that is not at that level of distributed architecture. Make PIO not *require* it.

The best example I have is that you can run Spark in local mode, without worrying about any aspect of its ideal distributed purpose. (In fact PredictionIO is built on this feature of Spark!) I don't know the history there, but would imagine Spark was not always so friendly for small or embedded tasks like this.


A huge part of my reality is seeing how many newcomers fumble around and get frustrated. I'm looking at PredictionIO from a very Heroku-style perspective of "how do we help [new] developers be successful", which is probably going to seem like I want to take away capabilities. I just want to make the onramp more graceful!

*Mars

( <> .. <> )




-- 
*Mars Hall
415-818-7039 <tel:(415)%20818-7039>
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California