You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@predictionio.apache.org by Donald Szeto <do...@apache.org> on 2016/12/05 18:00:56 UTC

Stateless Builds

Hi all,

I am moving the discussion of stateless build (
https://github.com/apache/incubator-predictionio/pull/328) here. Replying
Pat:

> BTW @chanlee514 @dszeto Are we thinking of a new command, something like
pio register that would add metadata to the metastore? This would need to
be run every time the engine.json changed for instance? It would also do
not compile? Is there an alternative? What state does this leave us in?

I imagine we would need pio register after this. Something like what docker
push would do for you today. Changes of engine.json will not require
registration because it is consumed during runtime by pio train and pio
deploy. We are phasing out pio build so that engine templates will be more
friendly with different IDEs.

> After the push, what action create binary (I assume pio build) what
action adds metadata to the metastore (I assume pio train) So does this
require they run on the same machine? They often do not.
pio build will still create the binary at this point (and hopefully phased
out as mentioned). Right now the only metadata that is disappearing are
engine manifests. Engine instances will still be written after pio train,
and used by pio deploy.

> One more question. After push how do we run the PredictionServer or train
on multiple machines? In the past this required copying the manifest.json
and making sure binaries are in the same location on all machines.
"In the same location" is actually a downside IMO of the manifest.json
design. Without manifest.json now, you would need to run pio commands from
a location with a built engine, because instead of looking at engine
manifests, it will now look locally for engine JARs. So deployment would
still involve copying engine JARs to a remote deployment machine, running
pio commands at the engine template location with engine-id and
engine-version arguments.

Regards,
Donald

Re: Stateless Builds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
So to be clear no objections to the PR. I’m using it to talk about something that may be off topic to the PR and apologies if it is confusing the issue.


On Jan 10, 2017, at 12:52 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Great, I’m really glad you are picking this up again. This is an important step!

TLDR; To sum up the issue I bring up, if we set the engine-instance id explicitly and consume it in every other workflow/Template command we step towards an ideal. If the id is created without users knowing it is like today and that is not bad, it’s just that to get to the ideal we’ll have to change it again. See my answers below in your email.

At minimum, to support existing users while simplifying things we need to:
preserve the ability to run multiple PredictionServers on multiple machines (your proposed solution would do that I believe).
we should do `sbt build` for compiling Templates and not use this to also modify metadata. BTW if `pio build` doesn’t change metadata, what does?

My long term goal is to support the following workflow, which i have working in a branch though it is not ideal either:
Deploy both PredictionServer and EventServer before any data is input or any model is created.
Then datasets and engine instances are created empty, just to attach an ID to them. We can do this today for datasets/apps but not for engine-instances. My branch does this for engine-instances to by a mode to `pio deploy` to allow setting an ID on an empty engine-instance.
Input goes to the dataset (app in today’s nomenclature)
Then models are created by processing the dataset with and engine instance (pio train in today’s workflow)
Then the model is added to the already running PredictionServer (not possible today, but working in my branch for a subset of Templates as another mod to `pio deploy`)
We need to be practical and keep talk of ideals separate from next release talk and I think the engine-instance ID setting is the key point here that should allow us to make a practical decision for next release.

To sum up the ideal I’d like to see PIO more like deploying hadoop, spark, or a database. All servers are up and running all the time, only the data inside changes. When updates happen queries or jobs operate on the new data. This hugely simplifies the process of Machine Learning into a pattern that is tried and true, which people already understand and use every day. Today PIO is a long way toward this goal but with further to go.


> On Jan 10, 2017, at 10:40 AM, Chan Lee <ch...@gmail.com> wrote:
> 
> I'm making additional changes to the stateless build(
> https://issues.apache.org/jira/browse/PIO-47), particularly to ensure that
> the change does not affect running production servers using PIO.
> 
> First, with regards to @Pat's concerns: You'd still be able to run training
> on a separate (possibly ephemeral) machine and use the trained ModelData by
> providing engine-instance-id as CLI argument. PIO would automatically look
> for the latest trained model with the same filepath (engineVersion). I will
> add some tests to ensure that this process works.
> 
> Also @Pat, I'd appreciate it if you could briefly explain the setup and
> common tasks for your production servers to make sure I don't miss anything.
> Some immediate questions:
> 1. When you copy the engine directory on multiple machines, do you include
> the .jar files (autogenerated during `pio build` in target/) for all the
> machines?

yes

> 2. Are there any immediate needs/tasks running pio commands outside of the
> engine directory?

no, not the engine specific commands, obviously some already work outside the engine directory.

> 
> Thanks,
> Chan
> 
> On Sun, Dec 11, 2016 at 1:35 PM, Mars Hall <ma...@heroku.com> wrote:
> 
>> Pat,
>> 
>> Since I represent one of the containerized platforms (Heroku) that needs
>> stateless builds (specifically the ability to run a build without a
>> database attached) to deploy PIO at its full potential, I would love to be
>> able to contribute more to this discussion. Unfortunately I do not
>> understand most of the technicalities described here.
>> 
>> How does someone like me learn about this aspect of PIO? Would you wise
>> folks step back and describe what the metadata does today?
>> 
>> Are the three PIO_STORAGE types (data, model, & metadata) documented
>> clearly anywhere? What are the metadata options, what does it store, & how
>> does it effect the engine lifecycle today?
>> 
>> One of the main sources of confusion I've seen from people trying to work
>> with PIO is the multiple storage aspects (data, model, & metadata) combined
>> with multiple services types (eventserver & engine) and how these all
>> interconnect. Plus some engines have a requirement like Elasticsearch, but
>> it's not clear where that's required in the grand scheme.
>> 
>> Thanks for all the efforts to move forward with these changes,
>> 
>> *Mars
>> 
>> On Mon, Dec 12, 2016 at 03:13 Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> OK, so that was too much detail. My immediate question is how to train on
>>> one machine and deploy on several others—all referencing the same instance
>>> data (model)? Before it was by copying the manifest, now there is no
>>> manifest.
>>> 
>>> 
>>> 
>>> On Dec 7, 2016, at 5:43 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> My first question is how to train on an ephemeral machine to swap models
>>> into an already deployed prediction server, because this is what i do all
>>> the time. The only way to do this now is train first on dummy data then
>>> deploy and re-train as data comes in, but there are other issues and
>>> questions below. Some may be slightly off topic of this specific PR.
>>> 
>>> 
>>>> On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I am moving the discussion of stateless build (
>>>> https://github.com/apache/incubator-predictionio/pull/328) here.
>>> Replying
>>>> Pat:
>>>> 
>>>>> BTW @chanlee514 @dszeto Are we thinking of a new command, something
>>> like
>>>> pio register that would add metadata to the metastore? This would need
>>> to
>>>> be run every time the engine.json changed for instance? It would also do
>>>> not compile? Is there an alternative? What state does this leave us in?
>>>> 
>>>> I imagine we would need pio register after this. Something like what
>>> docker
>>>> push would do for you today. Changes of engine.json will not require
>>>> registration because it is consumed during runtime by pio train and pio
>>>> deploy. We are phasing out pio build so that engine templates will be
>>> more
>>>> friendly with different IDEs.
>>> 
>>> I’m all for removing the manifest and stateless build but I’m not sure we
>>> mean the same thing by stateless. My issue is more with stateless commands,
>>> or put differently as a fully flexible workflow. Which means all commands
>>> read metadata from the metastore, and only one, very explicitly sets
>>> metadata into the metastore. Doing the write in train doesn't consider the
>>> the deploy before train and multi-tenancy use case.
>>> 
>>> Deploy then train:
>>> 1) pio eventserver to start ES on any machine
>>> 2) pio deploy to get the query server (prediction server) on any machine
>>> 3) pio train at any time on any machine and have a mechanism for deployed
>>> engines to discover the metadata they need when they need it or have it
>>> automatically updated when changed (pick a method push for deployed engines
>>> and pull for train)
>>> 4) send input an any time
>>> 
>>> Multi-tenancy:
>>> This seems to imply a user visible id for an engine instance id in
>>> today’s nomenclature. For multi-tenancy, the user is going to want to set
>>> this instance id somewhere and should have stateless commands, not only
>>> stateless build.
>>> 
>>>> 
>>>>> After the push, what action create binary (I assume pio build) what
>>>> action adds metadata to the metastore (I assume pio train) So does this
>>>> require they run on the same machine? They often do not.
>>>> pio build will still create the binary at this point (and hopefully
>>> phased
>>>> out as mentioned). Right now the only metadata that is disappearing are
>>>> engine manifests. Engine instances will still be written after pio
>>> train,
>>>> and used by pio deploy.
>>>> 
>>>>> One more question. After push how do we run the PredictionServer or
>>> train
>>>> on multiple machines? In the past this required copying the
>>> manifest.json
>>>> and making sure binaries are in the same location on all machines.
>>>> "In the same location" is actually a downside IMO of the manifest.json
>>>> design. Without manifest.json now, you would need to run pio commands
>>> from
>>>> a location with a built engine, because instead of looking at engine
>>>> manifests, it will now look locally for engine JARs. So deployment would
>>>> still involve copying engine JARs to a remote deployment machine,
>>> running
>>>> pio commands at the engine template location with engine-id and
>>>> engine-version arguments.
>>> 
>>> I guess I also don't understand the need for engine-id and
>>> engine-version. Let’s do away with them. There is one metadata object that
>>> points to input data id, params, model id, and binary. This id can be
>>> assigned by the user.
>>> 
>>> With the above in place we are ready to imagine an EventServer where you
>>> POST to pio-ip/dataset/resource-id (no keys) and GET from
>>> pio-ip/model/resource-id to do queries. This would allow multi-tenancy and
>>> merge the EventServer and PredictionServer under the well understood banner
>>> of REST. Extending this a little further we have all the commands
>>> implemented as REST APIs. The CLI becomes some simple scripts or binaries
>>> that hit the REST interface and an admin server that hits the same
>>> interface.
>>> 
>>> This is compatible with the simple stateless build as a first step as
>>> long as we don’t perpetuate hidden instance ids and stateful commands like
>>> a train that creates the hidden id. But maybe I misunderstand the code or
>>> plans for next steps?
>>> 
>>> 
>>>> 
>>>> Regards,
>>>> Donald


Re: Stateless Builds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Great, I’m really glad you are picking this up again. This is an important step!

TLDR; To sum up the issue I bring up, if we set the engine-instance id explicitly and consume it in every other workflow/Template command we step towards an ideal. If the id is created without users knowing it is like today and that is not bad, it’s just that to get to the ideal we’ll have to change it again. See my answers below in your email.

At minimum, to support existing users while simplifying things we need to:
preserve the ability to run multiple PredictionServers on multiple machines (your proposed solution would do that I believe).
we should do `sbt build` for compiling Templates and not use this to also modify metadata. BTW if `pio build` doesn’t change metadata, what does?

My long term goal is to support the following workflow, which i have working in a branch though it is not ideal either:
Deploy both PredictionServer and EventServer before any data is input or any model is created.
Then datasets and engine instances are created empty, just to attach an ID to them. We can do this today for datasets/apps but not for engine-instances. My branch does this for engine-instances to by a mode to `pio deploy` to allow setting an ID on an empty engine-instance.
Input goes to the dataset (app in today’s nomenclature)
Then models are created by processing the dataset with and engine instance (pio train in today’s workflow)
Then the model is added to the already running PredictionServer (not possible today, but working in my branch for a subset of Templates as another mod to `pio deploy`)
We need to be practical and keep talk of ideals separate from next release talk and I think the engine-instance ID setting is the key point here that should allow us to make a practical decision for next release.

To sum up the ideal I’d like to see PIO more like deploying hadoop, spark, or a database. All servers are up and running all the time, only the data inside changes. When updates happen queries or jobs operate on the new data. This hugely simplifies the process of Machine Learning into a pattern that is tried and true, which people already understand and use every day. Today PIO is a long way toward this goal but with further to go.


> On Jan 10, 2017, at 10:40 AM, Chan Lee <ch...@gmail.com> wrote:
> 
> I'm making additional changes to the stateless build(
> https://issues.apache.org/jira/browse/PIO-47), particularly to ensure that
> the change does not affect running production servers using PIO.
> 
> First, with regards to @Pat's concerns: You'd still be able to run training
> on a separate (possibly ephemeral) machine and use the trained ModelData by
> providing engine-instance-id as CLI argument. PIO would automatically look
> for the latest trained model with the same filepath (engineVersion). I will
> add some tests to ensure that this process works.
> 
> Also @Pat, I'd appreciate it if you could briefly explain the setup and
> common tasks for your production servers to make sure I don't miss anything.
> Some immediate questions:
> 1. When you copy the engine directory on multiple machines, do you include
> the .jar files (autogenerated during `pio build` in target/) for all the
> machines?

yes

> 2. Are there any immediate needs/tasks running pio commands outside of the
> engine directory?

no, not the engine specific commands, obviously some already work outside the engine directory.

> 
> Thanks,
> Chan
> 
> On Sun, Dec 11, 2016 at 1:35 PM, Mars Hall <ma...@heroku.com> wrote:
> 
>> Pat,
>> 
>> Since I represent one of the containerized platforms (Heroku) that needs
>> stateless builds (specifically the ability to run a build without a
>> database attached) to deploy PIO at its full potential, I would love to be
>> able to contribute more to this discussion. Unfortunately I do not
>> understand most of the technicalities described here.
>> 
>> How does someone like me learn about this aspect of PIO? Would you wise
>> folks step back and describe what the metadata does today?
>> 
>> Are the three PIO_STORAGE types (data, model, & metadata) documented
>> clearly anywhere? What are the metadata options, what does it store, & how
>> does it effect the engine lifecycle today?
>> 
>> One of the main sources of confusion I've seen from people trying to work
>> with PIO is the multiple storage aspects (data, model, & metadata) combined
>> with multiple services types (eventserver & engine) and how these all
>> interconnect. Plus some engines have a requirement like Elasticsearch, but
>> it's not clear where that's required in the grand scheme.
>> 
>> Thanks for all the efforts to move forward with these changes,
>> 
>> *Mars
>> 
>> On Mon, Dec 12, 2016 at 03:13 Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>>> OK, so that was too much detail. My immediate question is how to train on
>>> one machine and deploy on several others—all referencing the same instance
>>> data (model)? Before it was by copying the manifest, now there is no
>>> manifest.
>>> 
>>> 
>>> 
>>> On Dec 7, 2016, at 5:43 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> My first question is how to train on an ephemeral machine to swap models
>>> into an already deployed prediction server, because this is what i do all
>>> the time. The only way to do this now is train first on dummy data then
>>> deploy and re-train as data comes in, but there are other issues and
>>> questions below. Some may be slightly off topic of this specific PR.
>>> 
>>> 
>>>> On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I am moving the discussion of stateless build (
>>>> https://github.com/apache/incubator-predictionio/pull/328) here.
>>> Replying
>>>> Pat:
>>>> 
>>>>> BTW @chanlee514 @dszeto Are we thinking of a new command, something
>>> like
>>>> pio register that would add metadata to the metastore? This would need
>>> to
>>>> be run every time the engine.json changed for instance? It would also do
>>>> not compile? Is there an alternative? What state does this leave us in?
>>>> 
>>>> I imagine we would need pio register after this. Something like what
>>> docker
>>>> push would do for you today. Changes of engine.json will not require
>>>> registration because it is consumed during runtime by pio train and pio
>>>> deploy. We are phasing out pio build so that engine templates will be
>>> more
>>>> friendly with different IDEs.
>>> 
>>> I’m all for removing the manifest and stateless build but I’m not sure we
>>> mean the same thing by stateless. My issue is more with stateless commands,
>>> or put differently as a fully flexible workflow. Which means all commands
>>> read metadata from the metastore, and only one, very explicitly sets
>>> metadata into the metastore. Doing the write in train doesn't consider the
>>> the deploy before train and multi-tenancy use case.
>>> 
>>> Deploy then train:
>>> 1) pio eventserver to start ES on any machine
>>> 2) pio deploy to get the query server (prediction server) on any machine
>>> 3) pio train at any time on any machine and have a mechanism for deployed
>>> engines to discover the metadata they need when they need it or have it
>>> automatically updated when changed (pick a method push for deployed engines
>>> and pull for train)
>>> 4) send input an any time
>>> 
>>> Multi-tenancy:
>>> This seems to imply a user visible id for an engine instance id in
>>> today’s nomenclature. For multi-tenancy, the user is going to want to set
>>> this instance id somewhere and should have stateless commands, not only
>>> stateless build.
>>> 
>>>> 
>>>>> After the push, what action create binary (I assume pio build) what
>>>> action adds metadata to the metastore (I assume pio train) So does this
>>>> require they run on the same machine? They often do not.
>>>> pio build will still create the binary at this point (and hopefully
>>> phased
>>>> out as mentioned). Right now the only metadata that is disappearing are
>>>> engine manifests. Engine instances will still be written after pio
>>> train,
>>>> and used by pio deploy.
>>>> 
>>>>> One more question. After push how do we run the PredictionServer or
>>> train
>>>> on multiple machines? In the past this required copying the
>>> manifest.json
>>>> and making sure binaries are in the same location on all machines.
>>>> "In the same location" is actually a downside IMO of the manifest.json
>>>> design. Without manifest.json now, you would need to run pio commands
>>> from
>>>> a location with a built engine, because instead of looking at engine
>>>> manifests, it will now look locally for engine JARs. So deployment would
>>>> still involve copying engine JARs to a remote deployment machine,
>>> running
>>>> pio commands at the engine template location with engine-id and
>>>> engine-version arguments.
>>> 
>>> I guess I also don't understand the need for engine-id and
>>> engine-version. Let’s do away with them. There is one metadata object that
>>> points to input data id, params, model id, and binary. This id can be
>>> assigned by the user.
>>> 
>>> With the above in place we are ready to imagine an EventServer where you
>>> POST to pio-ip/dataset/resource-id (no keys) and GET from
>>> pio-ip/model/resource-id to do queries. This would allow multi-tenancy and
>>> merge the EventServer and PredictionServer under the well understood banner
>>> of REST. Extending this a little further we have all the commands
>>> implemented as REST APIs. The CLI becomes some simple scripts or binaries
>>> that hit the REST interface and an admin server that hits the same
>>> interface.
>>> 
>>> This is compatible with the simple stateless build as a first step as
>>> long as we don’t perpetuate hidden instance ids and stateful commands like
>>> a train that creates the hidden id. But maybe I misunderstand the code or
>>> plans for next steps?
>>> 
>>> 
>>>> 
>>>> Regards,
>>>> Donald
>>> 
>>> 
> 

Re: Stateless Builds

Posted by Chan Lee <ch...@gmail.com>.
I'm making additional changes to the stateless build(
https://issues.apache.org/jira/browse/PIO-47), particularly to ensure that
the change does not affect running production servers using PIO.

First, with regards to @Pat's concerns: You'd still be able to run training
on a separate (possibly ephemeral) machine and use the trained ModelData by
providing engine-instance-id as CLI argument. PIO would automatically look
for the latest trained model with the same filepath (engineVersion). I will
add some tests to ensure that this process works.

Also @Pat, I'd appreciate it if you could briefly explain the setup and
common tasks for your production servers to make sure I don't miss anything.
Some immediate questions:
1. When you copy the engine directory on multiple machines, do you include
the .jar files (autogenerated during `pio build` in target/) for all the
machines?
2. Are there any immediate needs/tasks running pio commands outside of the
engine directory?

Thanks,
Chan

On Sun, Dec 11, 2016 at 1:35 PM, Mars Hall <ma...@heroku.com> wrote:

> Pat,
>
> Since I represent one of the containerized platforms (Heroku) that needs
> stateless builds (specifically the ability to run a build without a
> database attached) to deploy PIO at its full potential, I would love to be
> able to contribute more to this discussion. Unfortunately I do not
> understand most of the technicalities described here.
>
> How does someone like me learn about this aspect of PIO? Would you wise
> folks step back and describe what the metadata does today?
>
> Are the three PIO_STORAGE types (data, model, & metadata) documented
> clearly anywhere? What are the metadata options, what does it store, & how
> does it effect the engine lifecycle today?
>
> One of the main sources of confusion I've seen from people trying to work
> with PIO is the multiple storage aspects (data, model, & metadata) combined
> with multiple services types (eventserver & engine) and how these all
> interconnect. Plus some engines have a requirement like Elasticsearch, but
> it's not clear where that's required in the grand scheme.
>
> Thanks for all the efforts to move forward with these changes,
>
> *Mars
>
> On Mon, Dec 12, 2016 at 03:13 Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> OK, so that was too much detail. My immediate question is how to train on
>> one machine and deploy on several others—all referencing the same instance
>> data (model)? Before it was by copying the manifest, now there is no
>> manifest.
>>
>>
>>
>> On Dec 7, 2016, at 5:43 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>> My first question is how to train on an ephemeral machine to swap models
>> into an already deployed prediction server, because this is what i do all
>> the time. The only way to do this now is train first on dummy data then
>> deploy and re-train as data comes in, but there are other issues and
>> questions below. Some may be slightly off topic of this specific PR.
>>
>>
>> > On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
>> >
>> > Hi all,
>> >
>> > I am moving the discussion of stateless build (
>> > https://github.com/apache/incubator-predictionio/pull/328) here.
>> Replying
>> > Pat:
>> >
>> >> BTW @chanlee514 @dszeto Are we thinking of a new command, something
>> like
>> > pio register that would add metadata to the metastore? This would need
>> to
>> > be run every time the engine.json changed for instance? It would also do
>> > not compile? Is there an alternative? What state does this leave us in?
>> >
>> > I imagine we would need pio register after this. Something like what
>> docker
>> > push would do for you today. Changes of engine.json will not require
>> > registration because it is consumed during runtime by pio train and pio
>> > deploy. We are phasing out pio build so that engine templates will be
>> more
>> > friendly with different IDEs.
>>
>> I’m all for removing the manifest and stateless build but I’m not sure we
>> mean the same thing by stateless. My issue is more with stateless commands,
>> or put differently as a fully flexible workflow. Which means all commands
>> read metadata from the metastore, and only one, very explicitly sets
>> metadata into the metastore. Doing the write in train doesn't consider the
>> the deploy before train and multi-tenancy use case.
>>
>> Deploy then train:
>> 1) pio eventserver to start ES on any machine
>> 2) pio deploy to get the query server (prediction server) on any machine
>> 3) pio train at any time on any machine and have a mechanism for deployed
>> engines to discover the metadata they need when they need it or have it
>> automatically updated when changed (pick a method push for deployed engines
>> and pull for train)
>> 4) send input an any time
>>
>> Multi-tenancy:
>> This seems to imply a user visible id for an engine instance id in
>> today’s nomenclature. For multi-tenancy, the user is going to want to set
>> this instance id somewhere and should have stateless commands, not only
>> stateless build.
>>
>> >
>> >> After the push, what action create binary (I assume pio build) what
>> > action adds metadata to the metastore (I assume pio train) So does this
>> > require they run on the same machine? They often do not.
>> > pio build will still create the binary at this point (and hopefully
>> phased
>> > out as mentioned). Right now the only metadata that is disappearing are
>> > engine manifests. Engine instances will still be written after pio
>> train,
>> > and used by pio deploy.
>> >
>> >> One more question. After push how do we run the PredictionServer or
>> train
>> > on multiple machines? In the past this required copying the
>> manifest.json
>> > and making sure binaries are in the same location on all machines.
>> > "In the same location" is actually a downside IMO of the manifest.json
>> > design. Without manifest.json now, you would need to run pio commands
>> from
>> > a location with a built engine, because instead of looking at engine
>> > manifests, it will now look locally for engine JARs. So deployment would
>> > still involve copying engine JARs to a remote deployment machine,
>> running
>> > pio commands at the engine template location with engine-id and
>> > engine-version arguments.
>>
>> I guess I also don't understand the need for engine-id and
>> engine-version. Let’s do away with them. There is one metadata object that
>> points to input data id, params, model id, and binary. This id can be
>> assigned by the user.
>>
>> With the above in place we are ready to imagine an EventServer where you
>> POST to pio-ip/dataset/resource-id (no keys) and GET from
>> pio-ip/model/resource-id to do queries. This would allow multi-tenancy and
>> merge the EventServer and PredictionServer under the well understood banner
>> of REST. Extending this a little further we have all the commands
>> implemented as REST APIs. The CLI becomes some simple scripts or binaries
>> that hit the REST interface and an admin server that hits the same
>> interface.
>>
>> This is compatible with the simple stateless build as a first step as
>> long as we don’t perpetuate hidden instance ids and stateful commands like
>> a train that creates the hidden id. But maybe I misunderstand the code or
>> plans for next steps?
>>
>>
>> >
>> > Regards,
>> > Donald
>>
>>

Re: Stateless Builds

Posted by Mars Hall <ma...@heroku.com>.
Pat,

Since I represent one of the containerized platforms (Heroku) that needs
stateless builds (specifically the ability to run a build without a
database attached) to deploy PIO at its full potential, I would love to be
able to contribute more to this discussion. Unfortunately I do not
understand most of the technicalities described here.

How does someone like me learn about this aspect of PIO? Would you wise
folks step back and describe what the metadata does today?

Are the three PIO_STORAGE types (data, model, & metadata) documented
clearly anywhere? What are the metadata options, what does it store, & how
does it effect the engine lifecycle today?

One of the main sources of confusion I've seen from people trying to work
with PIO is the multiple storage aspects (data, model, & metadata) combined
with multiple services types (eventserver & engine) and how these all
interconnect. Plus some engines have a requirement like Elasticsearch, but
it's not clear where that's required in the grand scheme.

Thanks for all the efforts to move forward with these changes,

*Mars
On Mon, Dec 12, 2016 at 03:13 Pat Ferrel <pa...@occamsmachete.com> wrote:

> OK, so that was too much detail. My immediate question is how to train on
> one machine and deploy on several others—all referencing the same instance
> data (model)? Before it was by copying the manifest, now there is no
> manifest.
>
>
>
> On Dec 7, 2016, at 5:43 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> My first question is how to train on an ephemeral machine to swap models
> into an already deployed prediction server, because this is what i do all
> the time. The only way to do this now is train first on dummy data then
> deploy and re-train as data comes in, but there are other issues and
> questions below. Some may be slightly off topic of this specific PR.
>
>
> > On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
> >
> > Hi all,
> >
> > I am moving the discussion of stateless build (
> > https://github.com/apache/incubator-predictionio/pull/328) here.
> Replying
> > Pat:
> >
> >> BTW @chanlee514 @dszeto Are we thinking of a new command, something like
> > pio register that would add metadata to the metastore? This would need to
> > be run every time the engine.json changed for instance? It would also do
> > not compile? Is there an alternative? What state does this leave us in?
> >
> > I imagine we would need pio register after this. Something like what
> docker
> > push would do for you today. Changes of engine.json will not require
> > registration because it is consumed during runtime by pio train and pio
> > deploy. We are phasing out pio build so that engine templates will be
> more
> > friendly with different IDEs.
>
> I’m all for removing the manifest and stateless build but I’m not sure we
> mean the same thing by stateless. My issue is more with stateless commands,
> or put differently as a fully flexible workflow. Which means all commands
> read metadata from the metastore, and only one, very explicitly sets
> metadata into the metastore. Doing the write in train doesn't consider the
> the deploy before train and multi-tenancy use case.
>
> Deploy then train:
> 1) pio eventserver to start ES on any machine
> 2) pio deploy to get the query server (prediction server) on any machine
> 3) pio train at any time on any machine and have a mechanism for deployed
> engines to discover the metadata they need when they need it or have it
> automatically updated when changed (pick a method push for deployed engines
> and pull for train)
> 4) send input an any time
>
> Multi-tenancy:
> This seems to imply a user visible id for an engine instance id in today’s
> nomenclature. For multi-tenancy, the user is going to want to set this
> instance id somewhere and should have stateless commands, not only
> stateless build.
>
> >
> >> After the push, what action create binary (I assume pio build) what
> > action adds metadata to the metastore (I assume pio train) So does this
> > require they run on the same machine? They often do not.
> > pio build will still create the binary at this point (and hopefully
> phased
> > out as mentioned). Right now the only metadata that is disappearing are
> > engine manifests. Engine instances will still be written after pio train,
> > and used by pio deploy.
> >
> >> One more question. After push how do we run the PredictionServer or
> train
> > on multiple machines? In the past this required copying the manifest.json
> > and making sure binaries are in the same location on all machines.
> > "In the same location" is actually a downside IMO of the manifest.json
> > design. Without manifest.json now, you would need to run pio commands
> from
> > a location with a built engine, because instead of looking at engine
> > manifests, it will now look locally for engine JARs. So deployment would
> > still involve copying engine JARs to a remote deployment machine, running
> > pio commands at the engine template location with engine-id and
> > engine-version arguments.
>
> I guess I also don't understand the need for engine-id and engine-version.
> Let’s do away with them. There is one metadata object that points to input
> data id, params, model id, and binary. This id can be assigned by the user.
>
> With the above in place we are ready to imagine an EventServer where you
> POST to pio-ip/dataset/resource-id (no keys) and GET from
> pio-ip/model/resource-id to do queries. This would allow multi-tenancy and
> merge the EventServer and PredictionServer under the well understood banner
> of REST. Extending this a little further we have all the commands
> implemented as REST APIs. The CLI becomes some simple scripts or binaries
> that hit the REST interface and an admin server that hits the same
> interface.
>
> This is compatible with the simple stateless build as a first step as long
> as we don’t perpetuate hidden instance ids and stateful commands like a
> train that creates the hidden id. But maybe I misunderstand the code or
> plans for next steps?
>
>
> >
> > Regards,
> > Donald
>
>

Re: Stateless Builds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
OK, so that was too much detail. My immediate question is how to train on one machine and deploy on several others—all referencing the same instance data (model)? Before it was by copying the manifest, now there is no manifest.



On Dec 7, 2016, at 5:43 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

My first question is how to train on an ephemeral machine to swap models into an already deployed prediction server, because this is what i do all the time. The only way to do this now is train first on dummy data then deploy and re-train as data comes in, but there are other issues and questions below. Some may be slightly off topic of this specific PR.


> On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I am moving the discussion of stateless build (
> https://github.com/apache/incubator-predictionio/pull/328) here. Replying
> Pat:
> 
>> BTW @chanlee514 @dszeto Are we thinking of a new command, something like
> pio register that would add metadata to the metastore? This would need to
> be run every time the engine.json changed for instance? It would also do
> not compile? Is there an alternative? What state does this leave us in?
> 
> I imagine we would need pio register after this. Something like what docker
> push would do for you today. Changes of engine.json will not require
> registration because it is consumed during runtime by pio train and pio
> deploy. We are phasing out pio build so that engine templates will be more
> friendly with different IDEs.

I’m all for removing the manifest and stateless build but I’m not sure we mean the same thing by stateless. My issue is more with stateless commands, or put differently as a fully flexible workflow. Which means all commands read metadata from the metastore, and only one, very explicitly sets metadata into the metastore. Doing the write in train doesn't consider the the deploy before train and multi-tenancy use case.

Deploy then train:
1) pio eventserver to start ES on any machine
2) pio deploy to get the query server (prediction server) on any machine
3) pio train at any time on any machine and have a mechanism for deployed engines to discover the metadata they need when they need it or have it automatically updated when changed (pick a method push for deployed engines and pull for train)
4) send input an any time

Multi-tenancy:
This seems to imply a user visible id for an engine instance id in today’s nomenclature. For multi-tenancy, the user is going to want to set this instance id somewhere and should have stateless commands, not only stateless build.

> 
>> After the push, what action create binary (I assume pio build) what
> action adds metadata to the metastore (I assume pio train) So does this
> require they run on the same machine? They often do not.
> pio build will still create the binary at this point (and hopefully phased
> out as mentioned). Right now the only metadata that is disappearing are
> engine manifests. Engine instances will still be written after pio train,
> and used by pio deploy.
> 
>> One more question. After push how do we run the PredictionServer or train
> on multiple machines? In the past this required copying the manifest.json
> and making sure binaries are in the same location on all machines.
> "In the same location" is actually a downside IMO of the manifest.json
> design. Without manifest.json now, you would need to run pio commands from
> a location with a built engine, because instead of looking at engine
> manifests, it will now look locally for engine JARs. So deployment would
> still involve copying engine JARs to a remote deployment machine, running
> pio commands at the engine template location with engine-id and
> engine-version arguments.

I guess I also don't understand the need for engine-id and engine-version. Let’s do away with them. There is one metadata object that points to input data id, params, model id, and binary. This id can be assigned by the user.

With the above in place we are ready to imagine an EventServer where you POST to pio-ip/dataset/resource-id (no keys) and GET from pio-ip/model/resource-id to do queries. This would allow multi-tenancy and merge the EventServer and PredictionServer under the well understood banner of REST. Extending this a little further we have all the commands implemented as REST APIs. The CLI becomes some simple scripts or binaries that hit the REST interface and an admin server that hits the same interface.

This is compatible with the simple stateless build as a first step as long as we don’t perpetuate hidden instance ids and stateful commands like a train that creates the hidden id. But maybe I misunderstand the code or plans for next steps?


> 
> Regards,
> Donald


Re: Stateless Builds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
My first question is how to train on an ephemeral machine to swap models into an already deployed prediction server, because this is what i do all the time. The only way to do this now is train first on dummy data then deploy and re-train as data comes in, but there are other issues and questions below. Some may be slightly off topic of this specific PR.


> On Dec 5, 2016, at 10:00 AM, Donald Szeto <do...@apache.org> wrote:
> 
> Hi all,
> 
> I am moving the discussion of stateless build (
> https://github.com/apache/incubator-predictionio/pull/328) here. Replying
> Pat:
> 
>> BTW @chanlee514 @dszeto Are we thinking of a new command, something like
> pio register that would add metadata to the metastore? This would need to
> be run every time the engine.json changed for instance? It would also do
> not compile? Is there an alternative? What state does this leave us in?
> 
> I imagine we would need pio register after this. Something like what docker
> push would do for you today. Changes of engine.json will not require
> registration because it is consumed during runtime by pio train and pio
> deploy. We are phasing out pio build so that engine templates will be more
> friendly with different IDEs.

I’m all for removing the manifest and stateless build but I’m not sure we mean the same thing by stateless. My issue is more with stateless commands, or put differently as a fully flexible workflow. Which means all commands read metadata from the metastore, and only one, very explicitly sets metadata into the metastore. Doing the write in train doesn't consider the the deploy before train and multi-tenancy use case.

Deploy then train:
1) pio eventserver to start ES on any machine
2) pio deploy to get the query server (prediction server) on any machine
3) pio train at any time on any machine and have a mechanism for deployed engines to discover the metadata they need when they need it or have it automatically updated when changed (pick a method push for deployed engines and pull for train)
4) send input an any time

Multi-tenancy:
This seems to imply a user visible id for an engine instance id in today’s nomenclature. For multi-tenancy, the user is going to want to set this instance id somewhere and should have stateless commands, not only stateless build.

> 
>> After the push, what action create binary (I assume pio build) what
> action adds metadata to the metastore (I assume pio train) So does this
> require they run on the same machine? They often do not.
> pio build will still create the binary at this point (and hopefully phased
> out as mentioned). Right now the only metadata that is disappearing are
> engine manifests. Engine instances will still be written after pio train,
> and used by pio deploy.
> 
>> One more question. After push how do we run the PredictionServer or train
> on multiple machines? In the past this required copying the manifest.json
> and making sure binaries are in the same location on all machines.
> "In the same location" is actually a downside IMO of the manifest.json
> design. Without manifest.json now, you would need to run pio commands from
> a location with a built engine, because instead of looking at engine
> manifests, it will now look locally for engine JARs. So deployment would
> still involve copying engine JARs to a remote deployment machine, running
> pio commands at the engine template location with engine-id and
> engine-version arguments.

I guess I also don't understand the need for engine-id and engine-version. Let’s do away with them. There is one metadata object that points to input data id, params, model id, and binary. This id can be assigned by the user.

With the above in place we are ready to imagine an EventServer where you POST to pio-ip/dataset/resource-id (no keys) and GET from pio-ip/model/resource-id to do queries. This would allow multi-tenancy and merge the EventServer and PredictionServer under the well understood banner of REST. Extending this a little further we have all the commands implemented as REST APIs. The CLI becomes some simple scripts or binaries that hit the REST interface and an admin server that hits the same interface.

This is compatible with the simple stateless build as a first step as long as we don’t perpetuate hidden instance ids and stateful commands like a train that creates the hidden id. But maybe I misunderstand the code or plans for next steps?


> 
> Regards,
> Donald
>