You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Donald Szeto <do...@apache.org> on 2016/09/16 17:42:56 UTC

Remove engine registration

Hi all,

I want to start the discussion of removing engine registration. How many
people actually take advantage of being able to run pio commands everywhere
outside of an engine template directory? This will be a nontrivial change
on the operational side so I want to gauge the potential impact to existing
users.

Pros:
- Stateless build. This would work well with many PaaS.
- Eliminate the "pio build" command once and for all.
- Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
- Potentially better experience with IDE since engine templates no longer
depends on an SBT plugin.

Cons:
- Inability to run pio engine training and deployment commands outside of
engine template directory.
- No automatic version matching of PIO binary distribution and artifacts
version used in the engine template.
- A less unified user experience: from pio-build-train-deploy to build,
then pio-train-deploy.

Regards,
Donald

Re: Remove engine registration

Posted by Suneel Marthi <su...@gmail.com>.

This is how the Flink guys are doing it -
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65870673

They call them FLIP (Flink Improvement Process) similar to KLIP in Kafka.

You could use a shared google doc as well as the Wiki and Mailing lists.
Make sure that the Wiki is the final repository of everything.



On Thu, Sep 22, 2016 at 7:52 PM, Donald Szeto <do...@apache.org> wrote:

> (Dropping user list for dev activities.)
>
> Sounds good. Let's start this collaboration. We should establish a common
> place for collaborative design documents. Do you guys feel like using a
> shared Google Drive, or the Apache wiki?
>
> On Wed, Sep 21, 2016 at 3:09 PM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
>
> > General purpose registry for service discovery is a much bigger thing. We
> > should first think about how we could make PIO more modular and divide it
> > into some logical parts, which could be abstracted and then turned to
> > services, before deciding to create some kind of general registry. There
> > was an issue brought up of creating an admin server as an alternative to
> > Console. The same for the eventserver, which could be treated as a very
> > special case of service responsible for providing eventdata. The serving
> > part of PIO is also another example.
> >
> > As Donald mentioned, it would be sensible create some shared doc, where
> we
> > could try to come up with new design decisions and outline the steps to
> get
> > there. I suppose that discussing one thing such as refactoring a manifest
> > might lead to other changes and propositions in different areas. I'd be
> > willing to help with that.
> >
> > śr., 21.09.2016 o 22:44 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> > napisał:
> >
> >> What do you think about using a general purpose registry, that can also
> >> be used to discover cluster machines, or microservices?
> >> Something like consul.io or docker swarm with and ASF compatible
> >> license? This would be a real step into the future and since some work
> is
> >> needed anyway…
> >>
> >> I think Donald is right that much of this can be made optional—with a
> >> mind towards making a single machine install easy and a cluster install
> >> almost as easy
> >>
> >>
> >> On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:
> >>
> >> I second with removing engine manifests and add a separate registry for
> >> other meta data (such as where to push engine code, models, and misc.
> >> discovery).
> >>
> >> The current design is a result of realizing the need that producing
> >> predictions from the model requires custom code (scoring function) as
> >> well.
> >> We have bundled training code, predicting (scoring) code together as an
> >> engine, different input parameters as different engine variants, and
> >> engine
> >> instances as an immutable list of metadata that points to an engine,
> >> engine
> >> variant, and trained models. We can definitely draw clearer boundaries
> and
> >> names. We should start a design doc somewhere. Any suggestions?
> >>
> >> I propose to start by making registration optional, then start to
> refactor
> >> manifest and build a proper engine registry.
> >>
> >> Regards,
> >> Donald
> >>
> >> On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
> >> wrote:
> >>
> >> > I think that getting rid of the manifest.json and introducing a new
> kind
> >> > of resourse-id for an engine to be registered is a good idea.
> >> >
> >> > Currently in the repository there are three important keys:
> >> > * engine id
> >> > * engine version - depends only on the path the engine was built at to
> >> > distinguish copies
> >> > * engine instance id - because of the name may be associated with the
> >> > engine itself, but in fact is the identificator of trained models for
> an
> >> > engine.
> >> > When running deploy you either get the latest trained model for the
> >> > engine-id and engine-version, what strictly ties it to the location it
> >> was
> >> > compiled at or you specify engine instance id. I am not sure, but I
> >> think
> >> > that in the latter case you could get a model for a completely
> different
> >> > engine, what could potentially fail because of initialization with
> >> improper
> >> > parameters.
> >> > What is more, the engine object creation relies only on the full name
> of
> >> > the EngineFactory, so the actual engine, which gets loaded is
> >> determined by
> >> > the current CLASSPATH. I guess that it is probably the place, which
> >> should
> >> > be modified if we want a multi-tenant architecture.
> >> > I have to admit that these things hadn't been completely clear to me,
> >> > until I went through the code.
> >> >
> >> > We could introduce a new type of service for engine and model
> >> management.
> >> > I like the idea of the repository to push built engines under chosen
> >> ids.
> >> > We could also add some versioning of them if necessary.
> >> > I treat this approach purely as some kind of package management
> system.
> >> >
> >> > As Pat said, a similar approach would let us rely only on the
> repository
> >> > and thanks to that run pio commands regardless of the machine and
> >> location.
> >> >
> >> > Separating the engine part from the rest of PIO could potentially
> enable
> >> > us to come up with different architectures in the future and push us
> >> > towards micro-services ecosystem.
> >> >
> >> > What do you think of separating models from engines in more visible
> >> way? I
> >> > mean that engine variants in terms of algorithm parameters are more
> like
> >> > model variants. I just see an engine only as code being a dependency
> for
> >> > application related models/algorithms. So you would register an engine
> >> - as
> >> > a code once and run training for some domain specific data (app) and
> >> > algorithm parameters, what would result in a different identifier,
> that
> >> > would be later used for deployment.
> >> >
> >> > Regards,
> >> > Marcin
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <
> pat@occamsmachete.com
> >> >
> >> > napisał:
> >> >
> >> >> This sounds like a good case for Donald’s suggestion.
> >> >>
> >> >> What I was trying to add to the discussion is a way to make all
> >> commands
> >> >> rely on state in the megastore, rather than any file on any machine
> in
> >> a
> >> >> cluster or on ordering of execution or execution from a location in a
> >> >> directory structure. All commands would then be stateless.
> >> >>
> >> >> This enables real use cases like provisioning PIO machines and
> running
> >> >> `pio deploy <resource-id>` to get a new PredictionServer.
> Provisioning
> >> can
> >> >> be container and discovery based rather cleanly.
> >> >>
> >> >>
> >> >> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
> >> >>
> >> >> Hello folks,
> >> >>
> >> >> Great to hear about this possibility. I've been working on running
> >> >> PredictionIO on Heroku https://www.heroku.com
> >> >>
> >> >> Heroku's 12-factor architecture https://12factor.net prefers
> >> "stateless
> >> >> builds" to ensure that compiled artifacts result in processes which
> >> may be
> >> >> cheaply restarted, replaced, and scaled via process count & size. I
> >> imagine
> >> >> this stateless property would be valuable for others as well.
> >> >>
> >> >> The fact that `pio build` inserts stateful metadata into a database
> >> >> causes ripples throughout the lifecycle of PIO engines on Heroku:
> >> >>
> >> >> * An engine cannot be built for production without the production
> >> >> database available. When a production database contains PII
> (personally
> >> >> identifiable information) which has security compliance requirements,
> >> the
> >> >> build system may not be privileged to access that PII data. This also
> >> >> affects CI (continuous integration/testing), where engines would need
> >> to be
> >> >> rebuilt in production, defeating assurances CI is supposed to
> provide.
> >> >>
> >> >> * The build artifacts cannot be reliably reused. "Slugs" at Heroku
> are
> >> >> intended to be stateless, so that you can rollback to a previous
> >> version
> >> >> during the lifetime of an app. With `pio build` causing database
> >> >> side-effects, there's a greater-than-zero probability of
> >> slug-to-metadata
> >> >> inconsistencies eventually surfacing in a long-running system.
> >> >>
> >> >>
> >> >> From my user-perspective, a few changes to the CLI would fix it:
> >> >>
> >> >> 1. add a "skip registration" option, `pio build
> >> >> --without-engine-registration`
> >> >> 2. a new command `pio app register` that could be run separately in
> the
> >> >> built engine (before training)
> >> >>
> >> >> Alas, I do not know PredictionIO internals, so I can only offer a
> >> >> suggestion for how this might be solved.
> >> >>
> >> >>
> >> >> Donald, one specific note,
> >> >>
> >> >> Regarding "No automatic version matching of PIO binary distribution
> and
> >> >> artifacts version used in the engine template":
> >> >>
> >> >> The Heroku slug contains the PredictionIO binary distribution used to
> >> >> build the engine, so there's never a version matching issue. I guess
> >> some
> >> >> systems might deploy only the engine artifacts to production where a
> >> >> pre-existing PIO binary is available, but that seems like a risky
> >> practice
> >> >> for long-running systems.
> >> >>
> >> >>
> >> >> Thanks for listening,
> >> >>
> >> >> *Mars Hall
> >> >> Customer Facing Architect
> >> >> Salesforce App Cloud / Heroku
> >> >> San Francisco, California
> >> >>
> >> >>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> >> >>>
> >> >>> Hi all,
> >> >>>
> >> >>> I want to start the discussion of removing engine registration. How
> >> >> many people actually take advantage of being able to run pio commands
> >> >> everywhere outside of an engine template directory? This will be a
> >> >> nontrivial change on the operational side so I want to gauge the
> >> potential
> >> >> impact to existing users.
> >> >>>
> >> >>> Pros:
> >> >>> - Stateless build. This would work well with many PaaS.
> >> >>> - Eliminate the "pio build" command once and for all.
> >> >>> - Ability to use your own build system, i.e. Maven, Ant, Gradle,
> etc.
> >> >>> - Potentially better experience with IDE since engine templates no
> >> >> longer depends on an SBT plugin.
> >> >>>
> >> >>> Cons:
> >> >>> - Inability to run pio engine training and deployment commands
> outside
> >> >> of engine template directory.
> >> >>> - No automatic version matching of PIO binary distribution and
> >> >> artifacts version used in the engine template.
> >> >>> - A less unified user experience: from pio-build-train-deploy to
> >> build,
> >> >> then pio-train-deploy.
> >> >>>
> >> >>> Regards,
> >> >>> Donald
> >> >>
> >> >>
> >> >>
> >>
> >>
>

Re: Remove engine registration

Posted by Donald Szeto <do...@apache.org>.

(Dropping user list for dev activities.)

Sounds good. Let's start this collaboration. We should establish a common
place for collaborative design documents. Do you guys feel like using a
shared Google Drive, or the Apache wiki?

On Wed, Sep 21, 2016 at 3:09 PM, Marcin Ziemiński <zi...@gmail.com> wrote:

> General purpose registry for service discovery is a much bigger thing. We
> should first think about how we could make PIO more modular and divide it
> into some logical parts, which could be abstracted and then turned to
> services, before deciding to create some kind of general registry. There
> was an issue brought up of creating an admin server as an alternative to
> Console. The same for the eventserver, which could be treated as a very
> special case of service responsible for providing eventdata. The serving
> part of PIO is also another example.
>
> As Donald mentioned, it would be sensible create some shared doc, where we
> could try to come up with new design decisions and outline the steps to get
> there. I suppose that discussing one thing such as refactoring a manifest
> might lead to other changes and propositions in different areas. I'd be
> willing to help with that.
>
> śr., 21.09.2016 o 22:44 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> napisał:
>
>> What do you think about using a general purpose registry, that can also
>> be used to discover cluster machines, or microservices?
>> Something like consul.io or docker swarm with and ASF compatible
>> license? This would be a real step into the future and since some work is
>> needed anyway…
>>
>> I think Donald is right that much of this can be made optional—with a
>> mind towards making a single machine install easy and a cluster install
>> almost as easy
>>
>>
>> On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:
>>
>> I second with removing engine manifests and add a separate registry for
>> other meta data (such as where to push engine code, models, and misc.
>> discovery).
>>
>> The current design is a result of realizing the need that producing
>> predictions from the model requires custom code (scoring function) as
>> well.
>> We have bundled training code, predicting (scoring) code together as an
>> engine, different input parameters as different engine variants, and
>> engine
>> instances as an immutable list of metadata that points to an engine,
>> engine
>> variant, and trained models. We can definitely draw clearer boundaries and
>> names. We should start a design doc somewhere. Any suggestions?
>>
>> I propose to start by making registration optional, then start to refactor
>> manifest and build a proper engine registry.
>>
>> Regards,
>> Donald
>>
>> On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
>> wrote:
>>
>> > I think that getting rid of the manifest.json and introducing a new kind
>> > of resourse-id for an engine to be registered is a good idea.
>> >
>> > Currently in the repository there are three important keys:
>> > * engine id
>> > * engine version - depends only on the path the engine was built at to
>> > distinguish copies
>> > * engine instance id - because of the name may be associated with the
>> > engine itself, but in fact is the identificator of trained models for an
>> > engine.
>> > When running deploy you either get the latest trained model for the
>> > engine-id and engine-version, what strictly ties it to the location it
>> was
>> > compiled at or you specify engine instance id. I am not sure, but I
>> think
>> > that in the latter case you could get a model for a completely different
>> > engine, what could potentially fail because of initialization with
>> improper
>> > parameters.
>> > What is more, the engine object creation relies only on the full name of
>> > the EngineFactory, so the actual engine, which gets loaded is
>> determined by
>> > the current CLASSPATH. I guess that it is probably the place, which
>> should
>> > be modified if we want a multi-tenant architecture.
>> > I have to admit that these things hadn't been completely clear to me,
>> > until I went through the code.
>> >
>> > We could introduce a new type of service for engine and model
>> management.
>> > I like the idea of the repository to push built engines under chosen
>> ids.
>> > We could also add some versioning of them if necessary.
>> > I treat this approach purely as some kind of package management system.
>> >
>> > As Pat said, a similar approach would let us rely only on the repository
>> > and thanks to that run pio commands regardless of the machine and
>> location.
>> >
>> > Separating the engine part from the rest of PIO could potentially enable
>> > us to come up with different architectures in the future and push us
>> > towards micro-services ecosystem.
>> >
>> > What do you think of separating models from engines in more visible
>> way? I
>> > mean that engine variants in terms of algorithm parameters are more like
>> > model variants. I just see an engine only as code being a dependency for
>> > application related models/algorithms. So you would register an engine
>> - as
>> > a code once and run training for some domain specific data (app) and
>> > algorithm parameters, what would result in a different identifier, that
>> > would be later used for deployment.
>> >
>> > Regards,
>> > Marcin
>> >
>> >
>> >
>> >
>> >
>> > niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pat@occamsmachete.com
>> >
>> > napisał:
>> >
>> >> This sounds like a good case for Donald’s suggestion.
>> >>
>> >> What I was trying to add to the discussion is a way to make all
>> commands
>> >> rely on state in the megastore, rather than any file on any machine in
>> a
>> >> cluster or on ordering of execution or execution from a location in a
>> >> directory structure. All commands would then be stateless.
>> >>
>> >> This enables real use cases like provisioning PIO machines and running
>> >> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning
>> can
>> >> be container and discovery based rather cleanly.
>> >>
>> >>
>> >> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>> >>
>> >> Hello folks,
>> >>
>> >> Great to hear about this possibility. I've been working on running
>> >> PredictionIO on Heroku https://www.heroku.com
>> >>
>> >> Heroku's 12-factor architecture https://12factor.net prefers
>> "stateless
>> >> builds" to ensure that compiled artifacts result in processes which
>> may be
>> >> cheaply restarted, replaced, and scaled via process count & size. I
>> imagine
>> >> this stateless property would be valuable for others as well.
>> >>
>> >> The fact that `pio build` inserts stateful metadata into a database
>> >> causes ripples throughout the lifecycle of PIO engines on Heroku:
>> >>
>> >> * An engine cannot be built for production without the production
>> >> database available. When a production database contains PII (personally
>> >> identifiable information) which has security compliance requirements,
>> the
>> >> build system may not be privileged to access that PII data. This also
>> >> affects CI (continuous integration/testing), where engines would need
>> to be
>> >> rebuilt in production, defeating assurances CI is supposed to provide.
>> >>
>> >> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
>> >> intended to be stateless, so that you can rollback to a previous
>> version
>> >> during the lifetime of an app. With `pio build` causing database
>> >> side-effects, there's a greater-than-zero probability of
>> slug-to-metadata
>> >> inconsistencies eventually surfacing in a long-running system.
>> >>
>> >>
>> >> From my user-perspective, a few changes to the CLI would fix it:
>> >>
>> >> 1. add a "skip registration" option, `pio build
>> >> --without-engine-registration`
>> >> 2. a new command `pio app register` that could be run separately in the
>> >> built engine (before training)
>> >>
>> >> Alas, I do not know PredictionIO internals, so I can only offer a
>> >> suggestion for how this might be solved.
>> >>
>> >>
>> >> Donald, one specific note,
>> >>
>> >> Regarding "No automatic version matching of PIO binary distribution and
>> >> artifacts version used in the engine template":
>> >>
>> >> The Heroku slug contains the PredictionIO binary distribution used to
>> >> build the engine, so there's never a version matching issue. I guess
>> some
>> >> systems might deploy only the engine artifacts to production where a
>> >> pre-existing PIO binary is available, but that seems like a risky
>> practice
>> >> for long-running systems.
>> >>
>> >>
>> >> Thanks for listening,
>> >>
>> >> *Mars Hall
>> >> Customer Facing Architect
>> >> Salesforce App Cloud / Heroku
>> >> San Francisco, California
>> >>
>> >>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I want to start the discussion of removing engine registration. How
>> >> many people actually take advantage of being able to run pio commands
>> >> everywhere outside of an engine template directory? This will be a
>> >> nontrivial change on the operational side so I want to gauge the
>> potential
>> >> impact to existing users.
>> >>>
>> >>> Pros:
>> >>> - Stateless build. This would work well with many PaaS.
>> >>> - Eliminate the "pio build" command once and for all.
>> >>> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
>> >>> - Potentially better experience with IDE since engine templates no
>> >> longer depends on an SBT plugin.
>> >>>
>> >>> Cons:
>> >>> - Inability to run pio engine training and deployment commands outside
>> >> of engine template directory.
>> >>> - No automatic version matching of PIO binary distribution and
>> >> artifacts version used in the engine template.
>> >>> - A less unified user experience: from pio-build-train-deploy to
>> build,
>> >> then pio-train-deploy.
>> >>>
>> >>> Regards,
>> >>> Donald
>> >>
>> >>
>> >>
>>
>>

Re: Remove engine registration

Posted by Marcin Ziemiński <zi...@gmail.com>.

General purpose registry for service discovery is a much bigger thing. We
should first think about how we could make PIO more modular and divide it
into some logical parts, which could be abstracted and then turned to
services, before deciding to create some kind of general registry. There
was an issue brought up of creating an admin server as an alternative to
Console. The same for the eventserver, which could be treated as a very
special case of service responsible for providing eventdata. The serving
part of PIO is also another example.

As Donald mentioned, it would be sensible create some shared doc, where we
could try to come up with new design decisions and outline the steps to get
there. I suppose that discussing one thing such as refactoring a manifest
might lead to other changes and propositions in different areas. I'd be
willing to help with that.

śr., 21.09.2016 o 22:44 użytkownik Pat Ferrel <pa...@occamsmachete.com>
napisał:

> What do you think about using a general purpose registry, that can also be
> used to discover cluster machines, or microservices?
> Something like consul.io or docker swarm with and ASF compatible license?
> This would be a real step into the future and since some work is needed
> anyway…
>
> I think Donald is right that much of this can be made optional—with a mind
> towards making a single machine install easy and a cluster install almost
> as easy
>
>
> On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:
>
> I second with removing engine manifests and add a separate registry for
> other meta data (such as where to push engine code, models, and misc.
> discovery).
>
> The current design is a result of realizing the need that producing
> predictions from the model requires custom code (scoring function) as well.
> We have bundled training code, predicting (scoring) code together as an
> engine, different input parameters as different engine variants, and engine
> instances as an immutable list of metadata that points to an engine, engine
> variant, and trained models. We can definitely draw clearer boundaries and
> names. We should start a design doc somewhere. Any suggestions?
>
> I propose to start by making registration optional, then start to refactor
> manifest and build a proper engine registry.
>
> Regards,
> Donald
>
> On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
>
> > I think that getting rid of the manifest.json and introducing a new kind
> > of resourse-id for an engine to be registered is a good idea.
> >
> > Currently in the repository there are three important keys:
> > * engine id
> > * engine version - depends only on the path the engine was built at to
> > distinguish copies
> > * engine instance id - because of the name may be associated with the
> > engine itself, but in fact is the identificator of trained models for an
> > engine.
> > When running deploy you either get the latest trained model for the
> > engine-id and engine-version, what strictly ties it to the location it
> was
> > compiled at or you specify engine instance id. I am not sure, but I think
> > that in the latter case you could get a model for a completely different
> > engine, what could potentially fail because of initialization with
> improper
> > parameters.
> > What is more, the engine object creation relies only on the full name of
> > the EngineFactory, so the actual engine, which gets loaded is determined
> by
> > the current CLASSPATH. I guess that it is probably the place, which
> should
> > be modified if we want a multi-tenant architecture.
> > I have to admit that these things hadn't been completely clear to me,
> > until I went through the code.
> >
> > We could introduce a new type of service for engine and model management.
> > I like the idea of the repository to push built engines under chosen ids.
> > We could also add some versioning of them if necessary.
> > I treat this approach purely as some kind of package management system.
> >
> > As Pat said, a similar approach would let us rely only on the repository
> > and thanks to that run pio commands regardless of the machine and
> location.
> >
> > Separating the engine part from the rest of PIO could potentially enable
> > us to come up with different architectures in the future and push us
> > towards micro-services ecosystem.
> >
> > What do you think of separating models from engines in more visible way?
> I
> > mean that engine variants in terms of algorithm parameters are more like
> > model variants. I just see an engine only as code being a dependency for
> > application related models/algorithms. So you would register an engine -
> as
> > a code once and run training for some domain specific data (app) and
> > algorithm parameters, what would result in a different identifier, that
> > would be later used for deployment.
> >
> > Regards,
> > Marcin
> >
> >
> >
> >
> >
> > niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> > napisał:
> >
> >> This sounds like a good case for Donald’s suggestion.
> >>
> >> What I was trying to add to the discussion is a way to make all commands
> >> rely on state in the megastore, rather than any file on any machine in a
> >> cluster or on ordering of execution or execution from a location in a
> >> directory structure. All commands would then be stateless.
> >>
> >> This enables real use cases like provisioning PIO machines and running
> >> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning
> can
> >> be container and discovery based rather cleanly.
> >>
> >>
> >> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
> >>
> >> Hello folks,
> >>
> >> Great to hear about this possibility. I've been working on running
> >> PredictionIO on Heroku https://www.heroku.com
> >>
> >> Heroku's 12-factor architecture https://12factor.net prefers "stateless
> >> builds" to ensure that compiled artifacts result in processes which may
> be
> >> cheaply restarted, replaced, and scaled via process count & size. I
> imagine
> >> this stateless property would be valuable for others as well.
> >>
> >> The fact that `pio build` inserts stateful metadata into a database
> >> causes ripples throughout the lifecycle of PIO engines on Heroku:
> >>
> >> * An engine cannot be built for production without the production
> >> database available. When a production database contains PII (personally
> >> identifiable information) which has security compliance requirements,
> the
> >> build system may not be privileged to access that PII data. This also
> >> affects CI (continuous integration/testing), where engines would need
> to be
> >> rebuilt in production, defeating assurances CI is supposed to provide.
> >>
> >> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
> >> intended to be stateless, so that you can rollback to a previous version
> >> during the lifetime of an app. With `pio build` causing database
> >> side-effects, there's a greater-than-zero probability of
> slug-to-metadata
> >> inconsistencies eventually surfacing in a long-running system.
> >>
> >>
> >> From my user-perspective, a few changes to the CLI would fix it:
> >>
> >> 1. add a "skip registration" option, `pio build
> >> --without-engine-registration`
> >> 2. a new command `pio app register` that could be run separately in the
> >> built engine (before training)
> >>
> >> Alas, I do not know PredictionIO internals, so I can only offer a
> >> suggestion for how this might be solved.
> >>
> >>
> >> Donald, one specific note,
> >>
> >> Regarding "No automatic version matching of PIO binary distribution and
> >> artifacts version used in the engine template":
> >>
> >> The Heroku slug contains the PredictionIO binary distribution used to
> >> build the engine, so there's never a version matching issue. I guess
> some
> >> systems might deploy only the engine artifacts to production where a
> >> pre-existing PIO binary is available, but that seems like a risky
> practice
> >> for long-running systems.
> >>
> >>
> >> Thanks for listening,
> >>
> >> *Mars Hall
> >> Customer Facing Architect
> >> Salesforce App Cloud / Heroku
> >> San Francisco, California
> >>
> >>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I want to start the discussion of removing engine registration. How
> >> many people actually take advantage of being able to run pio commands
> >> everywhere outside of an engine template directory? This will be a
> >> nontrivial change on the operational side so I want to gauge the
> potential
> >> impact to existing users.
> >>>
> >>> Pros:
> >>> - Stateless build. This would work well with many PaaS.
> >>> - Eliminate the "pio build" command once and for all.
> >>> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> >>> - Potentially better experience with IDE since engine templates no
> >> longer depends on an SBT plugin.
> >>>
> >>> Cons:
> >>> - Inability to run pio engine training and deployment commands outside
> >> of engine template directory.
> >>> - No automatic version matching of PIO binary distribution and
> >> artifacts version used in the engine template.
> >>> - A less unified user experience: from pio-build-train-deploy to build,
> >> then pio-train-deploy.
> >>>
> >>> Regards,
> >>> Donald
> >>
> >>
> >>
>
>

Re: Remove engine registration

Posted by Marcin Ziemiński <zi...@gmail.com>.

General purpose registry for service discovery is a much bigger thing. We
should first think about how we could make PIO more modular and divide it
into some logical parts, which could be abstracted and then turned to
services, before deciding to create some kind of general registry. There
was an issue brought up of creating an admin server as an alternative to
Console. The same for the eventserver, which could be treated as a very
special case of service responsible for providing eventdata. The serving
part of PIO is also another example.

As Donald mentioned, it would be sensible create some shared doc, where we
could try to come up with new design decisions and outline the steps to get
there. I suppose that discussing one thing such as refactoring a manifest
might lead to other changes and propositions in different areas. I'd be
willing to help with that.

śr., 21.09.2016 o 22:44 użytkownik Pat Ferrel <pa...@occamsmachete.com>
napisał:

> What do you think about using a general purpose registry, that can also be
> used to discover cluster machines, or microservices?
> Something like consul.io or docker swarm with and ASF compatible license?
> This would be a real step into the future and since some work is needed
> anyway…
>
> I think Donald is right that much of this can be made optional—with a mind
> towards making a single machine install easy and a cluster install almost
> as easy
>
>
> On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:
>
> I second with removing engine manifests and add a separate registry for
> other meta data (such as where to push engine code, models, and misc.
> discovery).
>
> The current design is a result of realizing the need that producing
> predictions from the model requires custom code (scoring function) as well.
> We have bundled training code, predicting (scoring) code together as an
> engine, different input parameters as different engine variants, and engine
> instances as an immutable list of metadata that points to an engine, engine
> variant, and trained models. We can definitely draw clearer boundaries and
> names. We should start a design doc somewhere. Any suggestions?
>
> I propose to start by making registration optional, then start to refactor
> manifest and build a proper engine registry.
>
> Regards,
> Donald
>
> On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
>
> > I think that getting rid of the manifest.json and introducing a new kind
> > of resourse-id for an engine to be registered is a good idea.
> >
> > Currently in the repository there are three important keys:
> > * engine id
> > * engine version - depends only on the path the engine was built at to
> > distinguish copies
> > * engine instance id - because of the name may be associated with the
> > engine itself, but in fact is the identificator of trained models for an
> > engine.
> > When running deploy you either get the latest trained model for the
> > engine-id and engine-version, what strictly ties it to the location it
> was
> > compiled at or you specify engine instance id. I am not sure, but I think
> > that in the latter case you could get a model for a completely different
> > engine, what could potentially fail because of initialization with
> improper
> > parameters.
> > What is more, the engine object creation relies only on the full name of
> > the EngineFactory, so the actual engine, which gets loaded is determined
> by
> > the current CLASSPATH. I guess that it is probably the place, which
> should
> > be modified if we want a multi-tenant architecture.
> > I have to admit that these things hadn't been completely clear to me,
> > until I went through the code.
> >
> > We could introduce a new type of service for engine and model management.
> > I like the idea of the repository to push built engines under chosen ids.
> > We could also add some versioning of them if necessary.
> > I treat this approach purely as some kind of package management system.
> >
> > As Pat said, a similar approach would let us rely only on the repository
> > and thanks to that run pio commands regardless of the machine and
> location.
> >
> > Separating the engine part from the rest of PIO could potentially enable
> > us to come up with different architectures in the future and push us
> > towards micro-services ecosystem.
> >
> > What do you think of separating models from engines in more visible way?
> I
> > mean that engine variants in terms of algorithm parameters are more like
> > model variants. I just see an engine only as code being a dependency for
> > application related models/algorithms. So you would register an engine -
> as
> > a code once and run training for some domain specific data (app) and
> > algorithm parameters, what would result in a different identifier, that
> > would be later used for deployment.
> >
> > Regards,
> > Marcin
> >
> >
> >
> >
> >
> > niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> > napisał:
> >
> >> This sounds like a good case for Donald’s suggestion.
> >>
> >> What I was trying to add to the discussion is a way to make all commands
> >> rely on state in the megastore, rather than any file on any machine in a
> >> cluster or on ordering of execution or execution from a location in a
> >> directory structure. All commands would then be stateless.
> >>
> >> This enables real use cases like provisioning PIO machines and running
> >> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning
> can
> >> be container and discovery based rather cleanly.
> >>
> >>
> >> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
> >>
> >> Hello folks,
> >>
> >> Great to hear about this possibility. I've been working on running
> >> PredictionIO on Heroku https://www.heroku.com
> >>
> >> Heroku's 12-factor architecture https://12factor.net prefers "stateless
> >> builds" to ensure that compiled artifacts result in processes which may
> be
> >> cheaply restarted, replaced, and scaled via process count & size. I
> imagine
> >> this stateless property would be valuable for others as well.
> >>
> >> The fact that `pio build` inserts stateful metadata into a database
> >> causes ripples throughout the lifecycle of PIO engines on Heroku:
> >>
> >> * An engine cannot be built for production without the production
> >> database available. When a production database contains PII (personally
> >> identifiable information) which has security compliance requirements,
> the
> >> build system may not be privileged to access that PII data. This also
> >> affects CI (continuous integration/testing), where engines would need
> to be
> >> rebuilt in production, defeating assurances CI is supposed to provide.
> >>
> >> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
> >> intended to be stateless, so that you can rollback to a previous version
> >> during the lifetime of an app. With `pio build` causing database
> >> side-effects, there's a greater-than-zero probability of
> slug-to-metadata
> >> inconsistencies eventually surfacing in a long-running system.
> >>
> >>
> >> From my user-perspective, a few changes to the CLI would fix it:
> >>
> >> 1. add a "skip registration" option, `pio build
> >> --without-engine-registration`
> >> 2. a new command `pio app register` that could be run separately in the
> >> built engine (before training)
> >>
> >> Alas, I do not know PredictionIO internals, so I can only offer a
> >> suggestion for how this might be solved.
> >>
> >>
> >> Donald, one specific note,
> >>
> >> Regarding "No automatic version matching of PIO binary distribution and
> >> artifacts version used in the engine template":
> >>
> >> The Heroku slug contains the PredictionIO binary distribution used to
> >> build the engine, so there's never a version matching issue. I guess
> some
> >> systems might deploy only the engine artifacts to production where a
> >> pre-existing PIO binary is available, but that seems like a risky
> practice
> >> for long-running systems.
> >>
> >>
> >> Thanks for listening,
> >>
> >> *Mars Hall
> >> Customer Facing Architect
> >> Salesforce App Cloud / Heroku
> >> San Francisco, California
> >>
> >>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I want to start the discussion of removing engine registration. How
> >> many people actually take advantage of being able to run pio commands
> >> everywhere outside of an engine template directory? This will be a
> >> nontrivial change on the operational side so I want to gauge the
> potential
> >> impact to existing users.
> >>>
> >>> Pros:
> >>> - Stateless build. This would work well with many PaaS.
> >>> - Eliminate the "pio build" command once and for all.
> >>> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> >>> - Potentially better experience with IDE since engine templates no
> >> longer depends on an SBT plugin.
> >>>
> >>> Cons:
> >>> - Inability to run pio engine training and deployment commands outside
> >> of engine template directory.
> >>> - No automatic version matching of PIO binary distribution and
> >> artifacts version used in the engine template.
> >>> - A less unified user experience: from pio-build-train-deploy to build,
> >> then pio-train-deploy.
> >>>
> >>> Regards,
> >>> Donald
> >>
> >>
> >>
>
>

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

What do you think about using a general purpose registry, that can also be used to discover cluster machines, or microservices?
Something like consul.io or docker swarm with and ASF compatible license? This would be a real step into the future and since some work is needed anyway…

I think Donald is right that much of this can be made optional—with a mind towards making a single machine install easy and a cluster install almost as easy


On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:

I second with removing engine manifests and add a separate registry for
other meta data (such as where to push engine code, models, and misc.
discovery).

The current design is a result of realizing the need that producing
predictions from the model requires custom code (scoring function) as well.
We have bundled training code, predicting (scoring) code together as an
engine, different input parameters as different engine variants, and engine
instances as an immutable list of metadata that points to an engine, engine
variant, and trained models. We can definitely draw clearer boundaries and
names. We should start a design doc somewhere. Any suggestions?

I propose to start by making registration optional, then start to refactor
manifest and build a proper engine registry.

Regards,
Donald

On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
wrote:

> I think that getting rid of the manifest.json and introducing a new kind
> of resourse-id for an engine to be registered is a good idea.
> 
> Currently in the repository there are three important keys:
> * engine id
> * engine version - depends only on the path the engine was built at to
> distinguish copies
> * engine instance id - because of the name may be associated with the
> engine itself, but in fact is the identificator of trained models for an
> engine.
> When running deploy you either get the latest trained model for the
> engine-id and engine-version, what strictly ties it to the location it was
> compiled at or you specify engine instance id. I am not sure, but I think
> that in the latter case you could get a model for a completely different
> engine, what could potentially fail because of initialization with improper
> parameters.
> What is more, the engine object creation relies only on the full name of
> the EngineFactory, so the actual engine, which gets loaded is determined by
> the current CLASSPATH. I guess that it is probably the place, which should
> be modified if we want a multi-tenant architecture.
> I have to admit that these things hadn't been completely clear to me,
> until I went through the code.
> 
> We could introduce a new type of service for engine and model management.
> I like the idea of the repository to push built engines under chosen ids.
> We could also add some versioning of them if necessary.
> I treat this approach purely as some kind of package management system.
> 
> As Pat said, a similar approach would let us rely only on the repository
> and thanks to that run pio commands regardless of the machine and location.
> 
> Separating the engine part from the rest of PIO could potentially enable
> us to come up with different architectures in the future and push us
> towards micro-services ecosystem.
> 
> What do you think of separating models from engines in more visible way? I
> mean that engine variants in terms of algorithm parameters are more like
> model variants. I just see an engine only as code being a dependency for
> application related models/algorithms. So you would register an engine - as
> a code once and run training for some domain specific data (app) and
> algorithm parameters, what would result in a different identifier, that
> would be later used for deployment.
> 
> Regards,
> Marcin
> 
> 
> 
> 
> 
> niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> napisał:
> 
>> This sounds like a good case for Donald’s suggestion.
>> 
>> What I was trying to add to the discussion is a way to make all commands
>> rely on state in the megastore, rather than any file on any machine in a
>> cluster or on ordering of execution or execution from a location in a
>> directory structure. All commands would then be stateless.
>> 
>> This enables real use cases like provisioning PIO machines and running
>> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
>> be container and discovery based rather cleanly.
>> 
>> 
>> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>> 
>> Hello folks,
>> 
>> Great to hear about this possibility. I've been working on running
>> PredictionIO on Heroku https://www.heroku.com
>> 
>> Heroku's 12-factor architecture https://12factor.net prefers "stateless
>> builds" to ensure that compiled artifacts result in processes which may be
>> cheaply restarted, replaced, and scaled via process count & size. I imagine
>> this stateless property would be valuable for others as well.
>> 
>> The fact that `pio build` inserts stateful metadata into a database
>> causes ripples throughout the lifecycle of PIO engines on Heroku:
>> 
>> * An engine cannot be built for production without the production
>> database available. When a production database contains PII (personally
>> identifiable information) which has security compliance requirements, the
>> build system may not be privileged to access that PII data. This also
>> affects CI (continuous integration/testing), where engines would need to be
>> rebuilt in production, defeating assurances CI is supposed to provide.
>> 
>> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
>> intended to be stateless, so that you can rollback to a previous version
>> during the lifetime of an app. With `pio build` causing database
>> side-effects, there's a greater-than-zero probability of slug-to-metadata
>> inconsistencies eventually surfacing in a long-running system.
>> 
>> 
>> From my user-perspective, a few changes to the CLI would fix it:
>> 
>> 1. add a "skip registration" option, `pio build
>> --without-engine-registration`
>> 2. a new command `pio app register` that could be run separately in the
>> built engine (before training)
>> 
>> Alas, I do not know PredictionIO internals, so I can only offer a
>> suggestion for how this might be solved.
>> 
>> 
>> Donald, one specific note,
>> 
>> Regarding "No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template":
>> 
>> The Heroku slug contains the PredictionIO binary distribution used to
>> build the engine, so there's never a version matching issue. I guess some
>> systems might deploy only the engine artifacts to production where a
>> pre-existing PIO binary is available, but that seems like a risky practice
>> for long-running systems.
>> 
>> 
>> Thanks for listening,
>> 
>> *Mars Hall
>> Customer Facing Architect
>> Salesforce App Cloud / Heroku
>> San Francisco, California
>> 
>>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
>>> 
>>> Hi all,
>>> 
>>> I want to start the discussion of removing engine registration. How
>> many people actually take advantage of being able to run pio commands
>> everywhere outside of an engine template directory? This will be a
>> nontrivial change on the operational side so I want to gauge the potential
>> impact to existing users.
>>> 
>>> Pros:
>>> - Stateless build. This would work well with many PaaS.
>>> - Eliminate the "pio build" command once and for all.
>>> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
>>> - Potentially better experience with IDE since engine templates no
>> longer depends on an SBT plugin.
>>> 
>>> Cons:
>>> - Inability to run pio engine training and deployment commands outside
>> of engine template directory.
>>> - No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template.
>>> - A less unified user experience: from pio-build-train-deploy to build,
>> then pio-train-deploy.
>>> 
>>> Regards,
>>> Donald
>> 
>> 
>>

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

What do you think about using a general purpose registry, that can also be used to discover cluster machines, or microservices?
Something like consul.io or docker swarm with and ASF compatible license? This would be a real step into the future and since some work is needed anyway…

I think Donald is right that much of this can be made optional—with a mind towards making a single machine install easy and a cluster install almost as easy


On Sep 21, 2016, at 1:18 PM, Donald Szeto <do...@apache.org> wrote:

I second with removing engine manifests and add a separate registry for
other meta data (such as where to push engine code, models, and misc.
discovery).

The current design is a result of realizing the need that producing
predictions from the model requires custom code (scoring function) as well.
We have bundled training code, predicting (scoring) code together as an
engine, different input parameters as different engine variants, and engine
instances as an immutable list of metadata that points to an engine, engine
variant, and trained models. We can definitely draw clearer boundaries and
names. We should start a design doc somewhere. Any suggestions?

I propose to start by making registration optional, then start to refactor
manifest and build a proper engine registry.

Regards,
Donald

On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
wrote:

> I think that getting rid of the manifest.json and introducing a new kind
> of resourse-id for an engine to be registered is a good idea.
> 
> Currently in the repository there are three important keys:
> * engine id
> * engine version - depends only on the path the engine was built at to
> distinguish copies
> * engine instance id - because of the name may be associated with the
> engine itself, but in fact is the identificator of trained models for an
> engine.
> When running deploy you either get the latest trained model for the
> engine-id and engine-version, what strictly ties it to the location it was
> compiled at or you specify engine instance id. I am not sure, but I think
> that in the latter case you could get a model for a completely different
> engine, what could potentially fail because of initialization with improper
> parameters.
> What is more, the engine object creation relies only on the full name of
> the EngineFactory, so the actual engine, which gets loaded is determined by
> the current CLASSPATH. I guess that it is probably the place, which should
> be modified if we want a multi-tenant architecture.
> I have to admit that these things hadn't been completely clear to me,
> until I went through the code.
> 
> We could introduce a new type of service for engine and model management.
> I like the idea of the repository to push built engines under chosen ids.
> We could also add some versioning of them if necessary.
> I treat this approach purely as some kind of package management system.
> 
> As Pat said, a similar approach would let us rely only on the repository
> and thanks to that run pio commands regardless of the machine and location.
> 
> Separating the engine part from the rest of PIO could potentially enable
> us to come up with different architectures in the future and push us
> towards micro-services ecosystem.
> 
> What do you think of separating models from engines in more visible way? I
> mean that engine variants in terms of algorithm parameters are more like
> model variants. I just see an engine only as code being a dependency for
> application related models/algorithms. So you would register an engine - as
> a code once and run training for some domain specific data (app) and
> algorithm parameters, what would result in a different identifier, that
> would be later used for deployment.
> 
> Regards,
> Marcin
> 
> 
> 
> 
> 
> niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> napisał:
> 
>> This sounds like a good case for Donald’s suggestion.
>> 
>> What I was trying to add to the discussion is a way to make all commands
>> rely on state in the megastore, rather than any file on any machine in a
>> cluster or on ordering of execution or execution from a location in a
>> directory structure. All commands would then be stateless.
>> 
>> This enables real use cases like provisioning PIO machines and running
>> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
>> be container and discovery based rather cleanly.
>> 
>> 
>> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>> 
>> Hello folks,
>> 
>> Great to hear about this possibility. I've been working on running
>> PredictionIO on Heroku https://www.heroku.com
>> 
>> Heroku's 12-factor architecture https://12factor.net prefers "stateless
>> builds" to ensure that compiled artifacts result in processes which may be
>> cheaply restarted, replaced, and scaled via process count & size. I imagine
>> this stateless property would be valuable for others as well.
>> 
>> The fact that `pio build` inserts stateful metadata into a database
>> causes ripples throughout the lifecycle of PIO engines on Heroku:
>> 
>> * An engine cannot be built for production without the production
>> database available. When a production database contains PII (personally
>> identifiable information) which has security compliance requirements, the
>> build system may not be privileged to access that PII data. This also
>> affects CI (continuous integration/testing), where engines would need to be
>> rebuilt in production, defeating assurances CI is supposed to provide.
>> 
>> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
>> intended to be stateless, so that you can rollback to a previous version
>> during the lifetime of an app. With `pio build` causing database
>> side-effects, there's a greater-than-zero probability of slug-to-metadata
>> inconsistencies eventually surfacing in a long-running system.
>> 
>> 
>> From my user-perspective, a few changes to the CLI would fix it:
>> 
>> 1. add a "skip registration" option, `pio build
>> --without-engine-registration`
>> 2. a new command `pio app register` that could be run separately in the
>> built engine (before training)
>> 
>> Alas, I do not know PredictionIO internals, so I can only offer a
>> suggestion for how this might be solved.
>> 
>> 
>> Donald, one specific note,
>> 
>> Regarding "No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template":
>> 
>> The Heroku slug contains the PredictionIO binary distribution used to
>> build the engine, so there's never a version matching issue. I guess some
>> systems might deploy only the engine artifacts to production where a
>> pre-existing PIO binary is available, but that seems like a risky practice
>> for long-running systems.
>> 
>> 
>> Thanks for listening,
>> 
>> *Mars Hall
>> Customer Facing Architect
>> Salesforce App Cloud / Heroku
>> San Francisco, California
>> 
>>> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
>>> 
>>> Hi all,
>>> 
>>> I want to start the discussion of removing engine registration. How
>> many people actually take advantage of being able to run pio commands
>> everywhere outside of an engine template directory? This will be a
>> nontrivial change on the operational side so I want to gauge the potential
>> impact to existing users.
>>> 
>>> Pros:
>>> - Stateless build. This would work well with many PaaS.
>>> - Eliminate the "pio build" command once and for all.
>>> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
>>> - Potentially better experience with IDE since engine templates no
>> longer depends on an SBT plugin.
>>> 
>>> Cons:
>>> - Inability to run pio engine training and deployment commands outside
>> of engine template directory.
>>> - No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template.
>>> - A less unified user experience: from pio-build-train-deploy to build,
>> then pio-train-deploy.
>>> 
>>> Regards,
>>> Donald
>> 
>> 
>>

Re: Remove engine registration

Posted by Donald Szeto <do...@apache.org>.

I second with removing engine manifests and add a separate registry for
other meta data (such as where to push engine code, models, and misc.
discovery).

The current design is a result of realizing the need that producing
predictions from the model requires custom code (scoring function) as well.
We have bundled training code, predicting (scoring) code together as an
engine, different input parameters as different engine variants, and engine
instances as an immutable list of metadata that points to an engine, engine
variant, and trained models. We can definitely draw clearer boundaries and
names. We should start a design doc somewhere. Any suggestions?

I propose to start by making registration optional, then start to refactor
manifest and build a proper engine registry.

Regards,
Donald

On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
wrote:

> I think that getting rid of the manifest.json and introducing a new kind
> of resourse-id for an engine to be registered is a good idea.
>
> Currently in the repository there are three important keys:
> * engine id
> * engine version - depends only on the path the engine was built at to
> distinguish copies
> * engine instance id - because of the name may be associated with the
> engine itself, but in fact is the identificator of trained models for an
> engine.
> When running deploy you either get the latest trained model for the
> engine-id and engine-version, what strictly ties it to the location it was
> compiled at or you specify engine instance id. I am not sure, but I think
> that in the latter case you could get a model for a completely different
> engine, what could potentially fail because of initialization with improper
> parameters.
> What is more, the engine object creation relies only on the full name of
> the EngineFactory, so the actual engine, which gets loaded is determined by
> the current CLASSPATH. I guess that it is probably the place, which should
> be modified if we want a multi-tenant architecture.
> I have to admit that these things hadn't been completely clear to me,
> until I went through the code.
>
> We could introduce a new type of service for engine and model management.
> I like the idea of the repository to push built engines under chosen ids.
> We could also add some versioning of them if necessary.
> I treat this approach purely as some kind of package management system.
>
> As Pat said, a similar approach would let us rely only on the repository
> and thanks to that run pio commands regardless of the machine and location.
>
> Separating the engine part from the rest of PIO could potentially enable
> us to come up with different architectures in the future and push us
> towards micro-services ecosystem.
>
> What do you think of separating models from engines in more visible way? I
> mean that engine variants in terms of algorithm parameters are more like
> model variants. I just see an engine only as code being a dependency for
> application related models/algorithms. So you would register an engine - as
> a code once and run training for some domain specific data (app) and
> algorithm parameters, what would result in a different identifier, that
> would be later used for deployment.
>
> Regards,
> Marcin
>
>
>
>
>
> niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> napisał:
>
>> This sounds like a good case for Donald’s suggestion.
>>
>> What I was trying to add to the discussion is a way to make all commands
>> rely on state in the megastore, rather than any file on any machine in a
>> cluster or on ordering of execution or execution from a location in a
>> directory structure. All commands would then be stateless.
>>
>> This enables real use cases like provisioning PIO machines and running
>> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
>> be container and discovery based rather cleanly.
>>
>>
>> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>>
>> Hello folks,
>>
>> Great to hear about this possibility. I've been working on running
>> PredictionIO on Heroku https://www.heroku.com
>>
>> Heroku's 12-factor architecture https://12factor.net prefers "stateless
>> builds" to ensure that compiled artifacts result in processes which may be
>> cheaply restarted, replaced, and scaled via process count & size. I imagine
>> this stateless property would be valuable for others as well.
>>
>> The fact that `pio build` inserts stateful metadata into a database
>> causes ripples throughout the lifecycle of PIO engines on Heroku:
>>
>> * An engine cannot be built for production without the production
>> database available. When a production database contains PII (personally
>> identifiable information) which has security compliance requirements, the
>> build system may not be privileged to access that PII data. This also
>> affects CI (continuous integration/testing), where engines would need to be
>> rebuilt in production, defeating assurances CI is supposed to provide.
>>
>> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
>> intended to be stateless, so that you can rollback to a previous version
>> during the lifetime of an app. With `pio build` causing database
>> side-effects, there's a greater-than-zero probability of slug-to-metadata
>> inconsistencies eventually surfacing in a long-running system.
>>
>>
>> From my user-perspective, a few changes to the CLI would fix it:
>>
>> 1. add a "skip registration" option, `pio build
>> --without-engine-registration`
>> 2. a new command `pio app register` that could be run separately in the
>> built engine (before training)
>>
>> Alas, I do not know PredictionIO internals, so I can only offer a
>> suggestion for how this might be solved.
>>
>>
>> Donald, one specific note,
>>
>> Regarding "No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template":
>>
>> The Heroku slug contains the PredictionIO binary distribution used to
>> build the engine, so there's never a version matching issue. I guess some
>> systems might deploy only the engine artifacts to production where a
>> pre-existing PIO binary is available, but that seems like a risky practice
>> for long-running systems.
>>
>>
>> Thanks for listening,
>>
>> *Mars Hall
>> Customer Facing Architect
>> Salesforce App Cloud / Heroku
>> San Francisco, California
>>
>> > On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
>> >
>> > Hi all,
>> >
>> > I want to start the discussion of removing engine registration. How
>> many people actually take advantage of being able to run pio commands
>> everywhere outside of an engine template directory? This will be a
>> nontrivial change on the operational side so I want to gauge the potential
>> impact to existing users.
>> >
>> > Pros:
>> > - Stateless build. This would work well with many PaaS.
>> > - Eliminate the "pio build" command once and for all.
>> > - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
>> > - Potentially better experience with IDE since engine templates no
>> longer depends on an SBT plugin.
>> >
>> > Cons:
>> > - Inability to run pio engine training and deployment commands outside
>> of engine template directory.
>> > - No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template.
>> > - A less unified user experience: from pio-build-train-deploy to build,
>> then pio-train-deploy.
>> >
>> > Regards,
>> > Donald
>>
>>
>>

Re: Remove engine registration

Posted by Donald Szeto <do...@apache.org>.

I second with removing engine manifests and add a separate registry for
other meta data (such as where to push engine code, models, and misc.
discovery).

The current design is a result of realizing the need that producing
predictions from the model requires custom code (scoring function) as well.
We have bundled training code, predicting (scoring) code together as an
engine, different input parameters as different engine variants, and engine
instances as an immutable list of metadata that points to an engine, engine
variant, and trained models. We can definitely draw clearer boundaries and
names. We should start a design doc somewhere. Any suggestions?

I propose to start by making registration optional, then start to refactor
manifest and build a proper engine registry.

Regards,
Donald

On Wed, Sep 21, 2016 at 12:29 PM, Marcin Ziemiński <zi...@gmail.com>
wrote:

> I think that getting rid of the manifest.json and introducing a new kind
> of resourse-id for an engine to be registered is a good idea.
>
> Currently in the repository there are three important keys:
> * engine id
> * engine version - depends only on the path the engine was built at to
> distinguish copies
> * engine instance id - because of the name may be associated with the
> engine itself, but in fact is the identificator of trained models for an
> engine.
> When running deploy you either get the latest trained model for the
> engine-id and engine-version, what strictly ties it to the location it was
> compiled at or you specify engine instance id. I am not sure, but I think
> that in the latter case you could get a model for a completely different
> engine, what could potentially fail because of initialization with improper
> parameters.
> What is more, the engine object creation relies only on the full name of
> the EngineFactory, so the actual engine, which gets loaded is determined by
> the current CLASSPATH. I guess that it is probably the place, which should
> be modified if we want a multi-tenant architecture.
> I have to admit that these things hadn't been completely clear to me,
> until I went through the code.
>
> We could introduce a new type of service for engine and model management.
> I like the idea of the repository to push built engines under chosen ids.
> We could also add some versioning of them if necessary.
> I treat this approach purely as some kind of package management system.
>
> As Pat said, a similar approach would let us rely only on the repository
> and thanks to that run pio commands regardless of the machine and location.
>
> Separating the engine part from the rest of PIO could potentially enable
> us to come up with different architectures in the future and push us
> towards micro-services ecosystem.
>
> What do you think of separating models from engines in more visible way? I
> mean that engine variants in terms of algorithm parameters are more like
> model variants. I just see an engine only as code being a dependency for
> application related models/algorithms. So you would register an engine - as
> a code once and run training for some domain specific data (app) and
> algorithm parameters, what would result in a different identifier, that
> would be later used for deployment.
>
> Regards,
> Marcin
>
>
>
>
>
> niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
> napisał:
>
>> This sounds like a good case for Donald’s suggestion.
>>
>> What I was trying to add to the discussion is a way to make all commands
>> rely on state in the megastore, rather than any file on any machine in a
>> cluster or on ordering of execution or execution from a location in a
>> directory structure. All commands would then be stateless.
>>
>> This enables real use cases like provisioning PIO machines and running
>> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
>> be container and discovery based rather cleanly.
>>
>>
>> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>>
>> Hello folks,
>>
>> Great to hear about this possibility. I've been working on running
>> PredictionIO on Heroku https://www.heroku.com
>>
>> Heroku's 12-factor architecture https://12factor.net prefers "stateless
>> builds" to ensure that compiled artifacts result in processes which may be
>> cheaply restarted, replaced, and scaled via process count & size. I imagine
>> this stateless property would be valuable for others as well.
>>
>> The fact that `pio build` inserts stateful metadata into a database
>> causes ripples throughout the lifecycle of PIO engines on Heroku:
>>
>> * An engine cannot be built for production without the production
>> database available. When a production database contains PII (personally
>> identifiable information) which has security compliance requirements, the
>> build system may not be privileged to access that PII data. This also
>> affects CI (continuous integration/testing), where engines would need to be
>> rebuilt in production, defeating assurances CI is supposed to provide.
>>
>> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
>> intended to be stateless, so that you can rollback to a previous version
>> during the lifetime of an app. With `pio build` causing database
>> side-effects, there's a greater-than-zero probability of slug-to-metadata
>> inconsistencies eventually surfacing in a long-running system.
>>
>>
>> From my user-perspective, a few changes to the CLI would fix it:
>>
>> 1. add a "skip registration" option, `pio build
>> --without-engine-registration`
>> 2. a new command `pio app register` that could be run separately in the
>> built engine (before training)
>>
>> Alas, I do not know PredictionIO internals, so I can only offer a
>> suggestion for how this might be solved.
>>
>>
>> Donald, one specific note,
>>
>> Regarding "No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template":
>>
>> The Heroku slug contains the PredictionIO binary distribution used to
>> build the engine, so there's never a version matching issue. I guess some
>> systems might deploy only the engine artifacts to production where a
>> pre-existing PIO binary is available, but that seems like a risky practice
>> for long-running systems.
>>
>>
>> Thanks for listening,
>>
>> *Mars Hall
>> Customer Facing Architect
>> Salesforce App Cloud / Heroku
>> San Francisco, California
>>
>> > On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
>> >
>> > Hi all,
>> >
>> > I want to start the discussion of removing engine registration. How
>> many people actually take advantage of being able to run pio commands
>> everywhere outside of an engine template directory? This will be a
>> nontrivial change on the operational side so I want to gauge the potential
>> impact to existing users.
>> >
>> > Pros:
>> > - Stateless build. This would work well with many PaaS.
>> > - Eliminate the "pio build" command once and for all.
>> > - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
>> > - Potentially better experience with IDE since engine templates no
>> longer depends on an SBT plugin.
>> >
>> > Cons:
>> > - Inability to run pio engine training and deployment commands outside
>> of engine template directory.
>> > - No automatic version matching of PIO binary distribution and
>> artifacts version used in the engine template.
>> > - A less unified user experience: from pio-build-train-deploy to build,
>> then pio-train-deploy.
>> >
>> > Regards,
>> > Donald
>>
>>
>>

Re: Remove engine registration

Posted by Marcin Ziemiński <zi...@gmail.com>.

I think that getting rid of the manifest.json and introducing a new kind of
resourse-id for an engine to be registered is a good idea.

Currently in the repository there are three important keys:
* engine id
* engine version - depends only on the path the engine was built at to
distinguish copies
* engine instance id - because of the name may be associated with the
engine itself, but in fact is the identificator of trained models for an
engine.
When running deploy you either get the latest trained model for the
engine-id and engine-version, what strictly ties it to the location it was
compiled at or you specify engine instance id. I am not sure, but I think
that in the latter case you could get a model for a completely different
engine, what could potentially fail because of initialization with improper
parameters.
What is more, the engine object creation relies only on the full name of
the EngineFactory, so the actual engine, which gets loaded is determined by
the current CLASSPATH. I guess that it is probably the place, which should
be modified if we want a multi-tenant architecture.
I have to admit that these things hadn't been completely clear to me, until
I went through the code.

We could introduce a new type of service for engine and model management. I
like the idea of the repository to push built engines under chosen ids. We
could also add some versioning of them if necessary.
I treat this approach purely as some kind of package management system.

As Pat said, a similar approach would let us rely only on the repository
and thanks to that run pio commands regardless of the machine and location.

Separating the engine part from the rest of PIO could potentially enable us
to come up with different architectures in the future and push us towards
micro-services ecosystem.

What do you think of separating models from engines in more visible way? I
mean that engine variants in terms of algorithm parameters are more like
model variants. I just see an engine only as code being a dependency for
application related models/algorithms. So you would register an engine - as
a code once and run training for some domain specific data (app) and
algorithm parameters, what would result in a different identifier, that
would be later used for deployment.

Regards,
Marcin

niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
napisał:

> This sounds like a good case for Donald’s suggestion.
>
> What I was trying to add to the discussion is a way to make all commands
> rely on state in the megastore, rather than any file on any machine in a
> cluster or on ordering of execution or execution from a location in a
> directory structure. All commands would then be stateless.
>
> This enables real use cases like provisioning PIO machines and running
> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
> be container and discovery based rather cleanly.
>
>
> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>
> Hello folks,
>
> Great to hear about this possibility. I've been working on running
> PredictionIO on Heroku https://www.heroku.com
>
> Heroku's 12-factor architecture https://12factor.net prefers "stateless
> builds" to ensure that compiled artifacts result in processes which may be
> cheaply restarted, replaced, and scaled via process count & size. I imagine
> this stateless property would be valuable for others as well.
>
> The fact that `pio build` inserts stateful metadata into a database causes
> ripples throughout the lifecycle of PIO engines on Heroku:
>
> * An engine cannot be built for production without the production database
> available. When a production database contains PII (personally identifiable
> information) which has security compliance requirements, the build system
> may not be privileged to access that PII data. This also affects CI
> (continuous integration/testing), where engines would need to be rebuilt in
> production, defeating assurances CI is supposed to provide.
>
> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
> intended to be stateless, so that you can rollback to a previous version
> during the lifetime of an app. With `pio build` causing database
> side-effects, there's a greater-than-zero probability of slug-to-metadata
> inconsistencies eventually surfacing in a long-running system.
>
>
> From my user-perspective, a few changes to the CLI would fix it:
>
> 1. add a "skip registration" option, `pio build
> --without-engine-registration`
> 2. a new command `pio app register` that could be run separately in the
> built engine (before training)
>
> Alas, I do not know PredictionIO internals, so I can only offer a
> suggestion for how this might be solved.
>
>
> Donald, one specific note,
>
> Regarding "No automatic version matching of PIO binary distribution and
> artifacts version used in the engine template":
>
> The Heroku slug contains the PredictionIO binary distribution used to
> build the engine, so there's never a version matching issue. I guess some
> systems might deploy only the engine artifacts to production where a
> pre-existing PIO binary is available, but that seems like a risky practice
> for long-running systems.
>
>
> Thanks for listening,
>
> *Mars Hall
> Customer Facing Architect
> Salesforce App Cloud / Heroku
> San Francisco, California
>
> > On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> >
> > Hi all,
> >
> > I want to start the discussion of removing engine registration. How many
> people actually take advantage of being able to run pio commands everywhere
> outside of an engine template directory? This will be a nontrivial change
> on the operational side so I want to gauge the potential impact to existing
> users.
> >
> > Pros:
> > - Stateless build. This would work well with many PaaS.
> > - Eliminate the "pio build" command once and for all.
> > - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> > - Potentially better experience with IDE since engine templates no
> longer depends on an SBT plugin.
> >
> > Cons:
> > - Inability to run pio engine training and deployment commands outside
> of engine template directory.
> > - No automatic version matching of PIO binary distribution and artifacts
> version used in the engine template.
> > - A less unified user experience: from pio-build-train-deploy to build,
> then pio-train-deploy.
> >
> > Regards,
> > Donald
>
>
>

Re: Remove engine registration

Posted by Marcin Ziemiński <zi...@gmail.com>.

I think that getting rid of the manifest.json and introducing a new kind of
resourse-id for an engine to be registered is a good idea.

Currently in the repository there are three important keys:
* engine id
* engine version - depends only on the path the engine was built at to
distinguish copies
* engine instance id - because of the name may be associated with the
engine itself, but in fact is the identificator of trained models for an
engine.
When running deploy you either get the latest trained model for the
engine-id and engine-version, what strictly ties it to the location it was
compiled at or you specify engine instance id. I am not sure, but I think
that in the latter case you could get a model for a completely different
engine, what could potentially fail because of initialization with improper
parameters.
What is more, the engine object creation relies only on the full name of
the EngineFactory, so the actual engine, which gets loaded is determined by
the current CLASSPATH. I guess that it is probably the place, which should
be modified if we want a multi-tenant architecture.
I have to admit that these things hadn't been completely clear to me, until
I went through the code.

We could introduce a new type of service for engine and model management. I
like the idea of the repository to push built engines under chosen ids. We
could also add some versioning of them if necessary.
I treat this approach purely as some kind of package management system.

As Pat said, a similar approach would let us rely only on the repository
and thanks to that run pio commands regardless of the machine and location.

Separating the engine part from the rest of PIO could potentially enable us
to come up with different architectures in the future and push us towards
micro-services ecosystem.

What do you think of separating models from engines in more visible way? I
mean that engine variants in terms of algorithm parameters are more like
model variants. I just see an engine only as code being a dependency for
application related models/algorithms. So you would register an engine - as
a code once and run training for some domain specific data (app) and
algorithm parameters, what would result in a different identifier, that
would be later used for deployment.

Regards,
Marcin

niedz., 18.09.2016 o 20:02 użytkownik Pat Ferrel <pa...@occamsmachete.com>
napisał:

> This sounds like a good case for Donald’s suggestion.
>
> What I was trying to add to the discussion is a way to make all commands
> rely on state in the megastore, rather than any file on any machine in a
> cluster or on ordering of execution or execution from a location in a
> directory structure. All commands would then be stateless.
>
> This enables real use cases like provisioning PIO machines and running
> `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can
> be container and discovery based rather cleanly.
>
>
> On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:
>
> Hello folks,
>
> Great to hear about this possibility. I've been working on running
> PredictionIO on Heroku https://www.heroku.com
>
> Heroku's 12-factor architecture https://12factor.net prefers "stateless
> builds" to ensure that compiled artifacts result in processes which may be
> cheaply restarted, replaced, and scaled via process count & size. I imagine
> this stateless property would be valuable for others as well.
>
> The fact that `pio build` inserts stateful metadata into a database causes
> ripples throughout the lifecycle of PIO engines on Heroku:
>
> * An engine cannot be built for production without the production database
> available. When a production database contains PII (personally identifiable
> information) which has security compliance requirements, the build system
> may not be privileged to access that PII data. This also affects CI
> (continuous integration/testing), where engines would need to be rebuilt in
> production, defeating assurances CI is supposed to provide.
>
> * The build artifacts cannot be reliably reused. "Slugs" at Heroku are
> intended to be stateless, so that you can rollback to a previous version
> during the lifetime of an app. With `pio build` causing database
> side-effects, there's a greater-than-zero probability of slug-to-metadata
> inconsistencies eventually surfacing in a long-running system.
>
>
> From my user-perspective, a few changes to the CLI would fix it:
>
> 1. add a "skip registration" option, `pio build
> --without-engine-registration`
> 2. a new command `pio app register` that could be run separately in the
> built engine (before training)
>
> Alas, I do not know PredictionIO internals, so I can only offer a
> suggestion for how this might be solved.
>
>
> Donald, one specific note,
>
> Regarding "No automatic version matching of PIO binary distribution and
> artifacts version used in the engine template":
>
> The Heroku slug contains the PredictionIO binary distribution used to
> build the engine, so there's never a version matching issue. I guess some
> systems might deploy only the engine artifacts to production where a
> pre-existing PIO binary is available, but that seems like a risky practice
> for long-running systems.
>
>
> Thanks for listening,
>
> *Mars Hall
> Customer Facing Architect
> Salesforce App Cloud / Heroku
> San Francisco, California
>
> > On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> >
> > Hi all,
> >
> > I want to start the discussion of removing engine registration. How many
> people actually take advantage of being able to run pio commands everywhere
> outside of an engine template directory? This will be a nontrivial change
> on the operational side so I want to gauge the potential impact to existing
> users.
> >
> > Pros:
> > - Stateless build. This would work well with many PaaS.
> > - Eliminate the "pio build" command once and for all.
> > - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> > - Potentially better experience with IDE since engine templates no
> longer depends on an SBT plugin.
> >
> > Cons:
> > - Inability to run pio engine training and deployment commands outside
> of engine template directory.
> > - No automatic version matching of PIO binary distribution and artifacts
> version used in the engine template.
> > - A less unified user experience: from pio-build-train-deploy to build,
> then pio-train-deploy.
> >
> > Regards,
> > Donald
>
>
>

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This sounds like a good case for Donald’s suggestion. 

What I was trying to add to the discussion is a way to make all commands rely on state in the megastore, rather than any file on any machine in a cluster or on ordering of execution or execution from a location in a directory structure. All commands would then be stateless.

This enables real use cases like provisioning PIO machines and running `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can be container and discovery based rather cleanly.

On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:

Hello folks,

Great to hear about this possibility. I've been working on running PredictionIO on Heroku https://www.heroku.com

Heroku's 12-factor architecture https://12factor.net prefers "stateless builds" to ensure that compiled artifacts result in processes which may be cheaply restarted, replaced, and scaled via process count & size. I imagine this stateless property would be valuable for others as well.

The fact that `pio build` inserts stateful metadata into a database causes ripples throughout the lifecycle of PIO engines on Heroku:

* An engine cannot be built for production without the production database available. When a production database contains PII (personally identifiable information) which has security compliance requirements, the build system may not be privileged to access that PII data. This also affects CI (continuous integration/testing), where engines would need to be rebuilt in production, defeating assurances CI is supposed to provide.

* The build artifacts cannot be reliably reused. "Slugs" at Heroku are intended to be stateless, so that you can rollback to a previous version during the lifetime of an app. With `pio build` causing database side-effects, there's a greater-than-zero probability of slug-to-metadata inconsistencies eventually surfacing in a long-running system.

From my user-perspective, a few changes to the CLI would fix it:

1. add a "skip registration" option, `pio build --without-engine-registration`
2. a new command `pio app register` that could be run separately in the built engine (before training)

Alas, I do not know PredictionIO internals, so I can only offer a suggestion for how this might be solved.

Donald, one specific note,

Regarding "No automatic version matching of PIO binary distribution and artifacts version used in the engine template":

The Heroku slug contains the PredictionIO binary distribution used to build the engine, so there's never a version matching issue. I guess some systems might deploy only the engine artifacts to production where a pre-existing PIO binary is available, but that seems like a risky practice for long-running systems.

Thanks for listening,

*Mars Hall
Customer Facing Architect
Salesforce App Cloud / Heroku
San Francisco, California

> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I want to start the discussion of removing engine registration. How many people actually take advantage of being able to run pio commands everywhere outside of an engine template directory? This will be a nontrivial change on the operational side so I want to gauge the potential impact to existing users.
> 
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer depends on an SBT plugin.
> 
> Cons:
> - Inability to run pio engine training and deployment commands outside of engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build, then pio-train-deploy.
> 
> Regards,
> Donald

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This sounds like a good case for Donald’s suggestion. 

What I was trying to add to the discussion is a way to make all commands rely on state in the megastore, rather than any file on any machine in a cluster or on ordering of execution or execution from a location in a directory structure. All commands would then be stateless.

This enables real use cases like provisioning PIO machines and running `pio deploy <resource-id>` to get a new PredictionServer. Provisioning can be container and discovery based rather cleanly.

On Sep 17, 2016, at 5:26 PM, Mars Hall <ma...@heroku.com> wrote:

Hello folks,

Great to hear about this possibility. I've been working on running PredictionIO on Heroku https://www.heroku.com

Heroku's 12-factor architecture https://12factor.net prefers "stateless builds" to ensure that compiled artifacts result in processes which may be cheaply restarted, replaced, and scaled via process count & size. I imagine this stateless property would be valuable for others as well.

The fact that `pio build` inserts stateful metadata into a database causes ripples throughout the lifecycle of PIO engines on Heroku:

* An engine cannot be built for production without the production database available. When a production database contains PII (personally identifiable information) which has security compliance requirements, the build system may not be privileged to access that PII data. This also affects CI (continuous integration/testing), where engines would need to be rebuilt in production, defeating assurances CI is supposed to provide.

* The build artifacts cannot be reliably reused. "Slugs" at Heroku are intended to be stateless, so that you can rollback to a previous version during the lifetime of an app. With `pio build` causing database side-effects, there's a greater-than-zero probability of slug-to-metadata inconsistencies eventually surfacing in a long-running system.

From my user-perspective, a few changes to the CLI would fix it:

1. add a "skip registration" option, `pio build --without-engine-registration`
2. a new command `pio app register` that could be run separately in the built engine (before training)

Alas, I do not know PredictionIO internals, so I can only offer a suggestion for how this might be solved.

Donald, one specific note,

Regarding "No automatic version matching of PIO binary distribution and artifacts version used in the engine template":

The Heroku slug contains the PredictionIO binary distribution used to build the engine, so there's never a version matching issue. I guess some systems might deploy only the engine artifacts to production where a pre-existing PIO binary is available, but that seems like a risky practice for long-running systems.

Thanks for listening,

*Mars Hall
Customer Facing Architect
Salesforce App Cloud / Heroku
San Francisco, California

> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I want to start the discussion of removing engine registration. How many people actually take advantage of being able to run pio commands everywhere outside of an engine template directory? This will be a nontrivial change on the operational side so I want to gauge the potential impact to existing users.
> 
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer depends on an SBT plugin.
> 
> Cons:
> - Inability to run pio engine training and deployment commands outside of engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build, then pio-train-deploy.
> 
> Regards,
> Donald

Re: Remove engine registration

Posted by Mars Hall <ma...@heroku.com>.

Hello folks,

Great to hear about this possibility. I've been working on running PredictionIO on Heroku https://www.heroku.com

Heroku's 12-factor architecture https://12factor.net prefers "stateless builds" to ensure that compiled artifacts result in processes which may be cheaply restarted, replaced, and scaled via process count & size. I imagine this stateless property would be valuable for others as well.

The fact that `pio build` inserts stateful metadata into a database causes ripples throughout the lifecycle of PIO engines on Heroku:

* An engine cannot be built for production without the production database available. When a production database contains PII (personally identifiable information) which has security compliance requirements, the build system may not be privileged to access that PII data. This also affects CI (continuous integration/testing), where engines would need to be rebuilt in production, defeating assurances CI is supposed to provide.

* The build artifacts cannot be reliably reused. "Slugs" at Heroku are intended to be stateless, so that you can rollback to a previous version during the lifetime of an app. With `pio build` causing database side-effects, there's a greater-than-zero probability of slug-to-metadata inconsistencies eventually surfacing in a long-running system.

From my user-perspective, a few changes to the CLI would fix it:

1. add a "skip registration" option, `pio build --without-engine-registration`
2. a new command `pio app register` that could be run separately in the built engine (before training)

Alas, I do not know PredictionIO internals, so I can only offer a suggestion for how this might be solved.

Donald, one specific note,

Regarding "No automatic version matching of PIO binary distribution and artifacts version used in the engine template":

The Heroku slug contains the PredictionIO binary distribution used to build the engine, so there's never a version matching issue. I guess some systems might deploy only the engine artifacts to production where a pre-existing PIO binary is available, but that seems like a risky practice for long-running systems.

Thanks for listening,

*Mars Hall
Customer Facing Architect
Salesforce App Cloud / Heroku
San Francisco, California

> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I want to start the discussion of removing engine registration. How many people actually take advantage of being able to run pio commands everywhere outside of an engine template directory? This will be a nontrivial change on the operational side so I want to gauge the potential impact to existing users.
> 
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer depends on an SBT plugin.
> 
> Cons:
> - Inability to run pio engine training and deployment commands outside of engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build, then pio-train-deploy.
> 
> Regards,
> Donald

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, a new thing though it might serve some of the same purposes. The idea is to only use engine instance information from the metadata store so the template commands will work from anywhere and mostly in any order. On a cluster machine if the engine instance data is in the metastore and the binary exists wherever it was on the machine registered, then `pio deploy <new-kind-of-instance-id>` would work without any other part of the workflow. Also `pio train <new-kind-of-instance-id>` would work from any cluster machine with no need of special folder layout or manifest.json

Sorry to overload the term but though this new type of engine instance would have much of the same info, it would also have to contain the path to the binary and maybe other things.

On Sep 16, 2016, at 7:51 PM, Kenneth Chan <ke...@apache.org> wrote:

Pat, would you explain more about the 'instanceId' as in
`pio register --variant path/to/some-engine.json --instanceId some-REST-compatible-resource-id` ?

Currently PIO also has a concept of engineInstanceId, which is output of train. I think you are referring to different thing, right?

Kenneth

On Fri, Sep 16, 2016 at 12:58 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
This is a great discussion topic and a great idea.

However the cons must also be addressed, we will need to do this before multi-tenant deploys can happen and the benefits are just as large as removing `pio build`

It would be great to get rid of manifest.json and put all metadata in the store with an externally visible id so all parts of the workflow on all machines will get the right metadata and any template specific commands will run from anywhere on any cluster machine and in any order. All we need is a global engine-instance id. This will make engine-instances behave more like datasets, which are given permanent ids with `pio app new …` This might be a new form of `pio register` and it implies a new optional param to pio template specific commands (the instance id) but removes a lot of misunderstandings people have and easy mistakes in workflow.

So workflow would be:
1) build with SBT/mvn
2) register any time engine.json changes so make the json file an optional param to `pio register --variant path/to/some-engine.json --instanceId some-REST-compatible-resource-id` the instance could also be auto-generated and output or optionally in the engine.json. `pio engine list` lists registered instances with instanceId. The path to the binary would be put in the instanceId and would be expected to be the same on all cluster machines that need it.
3) `pio train --instanceId` optional if it’s in engine.json
4) `pio deploy --instanceId` optional if it’s in engine.json
5) with easily recognized exceptions all the above can happen in any order on any cluster machine and from any directory.

This takes one big step to multi-tenancy since the instance data has an externally visible id—call it a REST resource id…

I bring this up not to confuse the issue but because if we change the workflow commands we should avoid doing it often because of the disruption it brings.

On Sep 16, 2016, at 10:42 AM, Donald Szeto <donald@apache.org <ma...@apache.org>> wrote:

Hi all,

I want to start the discussion of removing engine registration. How many people actually take advantage of being able to run pio commands everywhere outside of an engine template directory? This will be a nontrivial change on the operational side so I want to gauge the potential impact to existing users.

Pros:
- Stateless build. This would work well with many PaaS.
- Eliminate the "pio build" command once and for all.
- Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
- Potentially better experience with IDE since engine templates no longer depends on an SBT plugin.

Cons:
- Inability to run pio engine training and deployment commands outside of engine template directory.
- No automatic version matching of PIO binary distribution and artifacts version used in the engine template.
- A less unified user experience: from pio-build-train-deploy to build, then pio-train-deploy.

Regards,
Donald

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry to overload the term but though this new type of engine instance would have much of the same info, it would also have to contain the path to the binary and maybe other things.

On Sep 16, 2016, at 7:51 PM, Kenneth Chan <ke...@apache.org> wrote:

Pat, would you explain more about the 'instanceId' as in
`pio register --variant path/to/some-engine.json --instanceId some-REST-compatible-resource-id` ?

Currently PIO also has a concept of engineInstanceId, which is output of train. I think you are referring to different thing, right?

Kenneth

On Fri, Sep 16, 2016 at 12:58 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
This is a great discussion topic and a great idea.

However the cons must also be addressed, we will need to do this before multi-tenant deploys can happen and the benefits are just as large as removing `pio build`

This takes one big step to multi-tenancy since the instance data has an externally visible id—call it a REST resource id…

I bring this up not to confuse the issue but because if we change the workflow commands we should avoid doing it often because of the disruption it brings.

On Sep 16, 2016, at 10:42 AM, Donald Szeto <donald@apache.org <ma...@apache.org>> wrote:

Hi all,

Regards,
Donald

Re: Remove engine registration

Posted by Kenneth Chan <ke...@apache.org>.

Pat, would you explain more about the 'instanceId' as in
`pio register --variant path/to/some-engine.json --instanceId
some-REST-compatible-resource-id`  ?

Currently PIO also has a concept of engineInstanceId, which is output of
train. I think you are referring to different thing, right?

Kenneth


On Fri, Sep 16, 2016 at 12:58 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This is a great discussion topic and a great idea.
>
> However the cons must also be addressed, we will need to do this before
> multi-tenant deploys can happen and the benefits are just as large as
> removing `pio build`
>
> It would be great to get rid of manifest.json and put all metadata in the
> store with an externally visible id so all parts of the workflow on all
> machines will get the right metadata and any template specific commands
> will run from anywhere on any cluster machine and in any order. All we need
> is a global engine-instance id. This will make engine-instances behave more
> like datasets, which are given permanent ids with `pio app new …` This
> might be a new form of `pio register` and it implies a new optional param
> to pio template specific commands (the instance id) but removes a lot of
> misunderstandings people have and easy mistakes in workflow.
>
> So workflow would be:
> 1) build with SBT/mvn
> 2) register any time engine.json changes so make the json file an optional
> param to `pio register --variant path/to/some-engine.json --instanceId
> some-REST-compatible-resource-id` the instance could also be
> auto-generated and output or optionally in the engine.json. `pio engine
> list` lists registered instances with instanceId. The path to the binary
> would be put in the instanceId and would be expected to be the same on all
> cluster machines that need it.
> 3) `pio train --instanceId` optional if it’s in engine.json
> 4) `pio deploy --instanceId` optional if it’s in engine.json
> 5) with easily recognized exceptions all the above can happen in any order
> on any cluster machine and from any directory.
>
> This takes one big step to multi-tenancy since the instance data has an
> externally visible id—call it a REST resource id…
>
> I bring this up not to confuse the issue but because if we change the
> workflow commands we should avoid doing it often because of the disruption
> it brings.
>
>
> On Sep 16, 2016, at 10:42 AM, Donald Szeto <do...@apache.org> wrote:
>
> Hi all,
>
> I want to start the discussion of removing engine registration. How many
> people actually take advantage of being able to run pio commands everywhere
> outside of an engine template directory? This will be a nontrivial change
> on the operational side so I want to gauge the potential impact to existing
> users.
>
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer
> depends on an SBT plugin.
>
> Cons:
> - Inability to run pio engine training and deployment commands outside of
> engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts
> version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build,
> then pio-train-deploy.
>
> Regards,
> Donald
>
>

Re: Remove engine registration

Posted by Kenneth Chan <ke...@apache.org>.

Pat, would you explain more about the 'instanceId' as in
`pio register --variant path/to/some-engine.json --instanceId
some-REST-compatible-resource-id`  ?

Currently PIO also has a concept of engineInstanceId, which is output of
train. I think you are referring to different thing, right?

Kenneth


On Fri, Sep 16, 2016 at 12:58 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This is a great discussion topic and a great idea.
>
> However the cons must also be addressed, we will need to do this before
> multi-tenant deploys can happen and the benefits are just as large as
> removing `pio build`
>
> It would be great to get rid of manifest.json and put all metadata in the
> store with an externally visible id so all parts of the workflow on all
> machines will get the right metadata and any template specific commands
> will run from anywhere on any cluster machine and in any order. All we need
> is a global engine-instance id. This will make engine-instances behave more
> like datasets, which are given permanent ids with `pio app new …` This
> might be a new form of `pio register` and it implies a new optional param
> to pio template specific commands (the instance id) but removes a lot of
> misunderstandings people have and easy mistakes in workflow.
>
> So workflow would be:
> 1) build with SBT/mvn
> 2) register any time engine.json changes so make the json file an optional
> param to `pio register --variant path/to/some-engine.json --instanceId
> some-REST-compatible-resource-id` the instance could also be
> auto-generated and output or optionally in the engine.json. `pio engine
> list` lists registered instances with instanceId. The path to the binary
> would be put in the instanceId and would be expected to be the same on all
> cluster machines that need it.
> 3) `pio train --instanceId` optional if it’s in engine.json
> 4) `pio deploy --instanceId` optional if it’s in engine.json
> 5) with easily recognized exceptions all the above can happen in any order
> on any cluster machine and from any directory.
>
> This takes one big step to multi-tenancy since the instance data has an
> externally visible id—call it a REST resource id…
>
> I bring this up not to confuse the issue but because if we change the
> workflow commands we should avoid doing it often because of the disruption
> it brings.
>
>
> On Sep 16, 2016, at 10:42 AM, Donald Szeto <do...@apache.org> wrote:
>
> Hi all,
>
> I want to start the discussion of removing engine registration. How many
> people actually take advantage of being able to run pio commands everywhere
> outside of an engine template directory? This will be a nontrivial change
> on the operational side so I want to gauge the potential impact to existing
> users.
>
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer
> depends on an SBT plugin.
>
> Cons:
> - Inability to run pio engine training and deployment commands outside of
> engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts
> version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build,
> then pio-train-deploy.
>
> Regards,
> Donald
>
>

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This is a great discussion topic and a great idea.

However the cons must also be addressed, we will need to do this before multi-tenant deploys can happen and the benefits are just as large as removing `pio build`

This takes one big step to multi-tenancy since the instance data has an externally visible id—call it a REST resource id…

I bring this up not to confuse the issue but because if we change the workflow commands we should avoid doing it often because of the disruption it brings.

On Sep 16, 2016, at 10:42 AM, Donald Szeto <do...@apache.org> wrote:

Hi all,

Regards,
Donald

Re: Remove engine registration

Posted by Mars Hall <ma...@heroku.com>.

Hello folks,

Great to hear about this possibility. I've been working on running PredictionIO on Heroku https://www.heroku.com

Heroku's 12-factor architecture https://12factor.net prefers "stateless builds" to ensure that compiled artifacts result in processes which may be cheaply restarted, replaced, and scaled via process count & size. I imagine this stateless property would be valuable for others as well.

The fact that `pio build` inserts stateful metadata into a database causes ripples throughout the lifecycle of PIO engines on Heroku:

* An engine cannot be built for production without the production database available. When a production database contains PII (personally identifiable information) which has security compliance requirements, the build system may not be privileged to access that PII data. This also affects CI (continuous integration/testing), where engines would need to be rebuilt in production, defeating assurances CI is supposed to provide.

* The build artifacts cannot be reliably reused. "Slugs" at Heroku are intended to be stateless, so that you can rollback to a previous version during the lifetime of an app. With `pio build` causing database side-effects, there's a greater-than-zero probability of slug-to-metadata inconsistencies eventually surfacing in a long-running system.

From my user-perspective, a few changes to the CLI would fix it:

1. add a "skip registration" option, `pio build --without-engine-registration`
2. a new command `pio app register` that could be run separately in the built engine (before training)

Alas, I do not know PredictionIO internals, so I can only offer a suggestion for how this might be solved.

Donald, one specific note,

Regarding "No automatic version matching of PIO binary distribution and artifacts version used in the engine template":

The Heroku slug contains the PredictionIO binary distribution used to build the engine, so there's never a version matching issue. I guess some systems might deploy only the engine artifacts to production where a pre-existing PIO binary is available, but that seems like a risky practice for long-running systems.

Thanks for listening,

*Mars Hall
Customer Facing Architect
Salesforce App Cloud / Heroku
San Francisco, California

> On Sep 16, 2016, at 10:42, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I want to start the discussion of removing engine registration. How many people actually take advantage of being able to run pio commands everywhere outside of an engine template directory? This will be a nontrivial change on the operational side so I want to gauge the potential impact to existing users.
> 
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer depends on an SBT plugin.
> 
> Cons:
> - Inability to run pio engine training and deployment commands outside of engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build, then pio-train-deploy.
> 
> Regards,
> Donald

Re: Remove engine registration

Posted by Pat Ferrel <pa...@occamsmachete.com>.

This is a great discussion topic and a great idea.

However the cons must also be addressed, we will need to do this before multi-tenant deploys can happen and the benefits are just as large as removing `pio build`

This takes one big step to multi-tenancy since the instance data has an externally visible id—call it a REST resource id…

I bring this up not to confuse the issue but because if we change the workflow commands we should avoid doing it often because of the disruption it brings.

On Sep 16, 2016, at 10:42 AM, Donald Szeto <do...@apache.org> wrote:

Hi all,

Regards,
Donald