You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Brian Chiu <br...@snaptee.co> on 2017/09/20 10:23:21 UTC

How to training and deploy on different machine?

Hi,

I would like to be able to train and run model on different machines.
The reason is, on my dataset, training takes around 16GB of memory and
deploying only needs 8GB.  In order to save money, it would be better
if only a 8GB memory machine is used in production, and only start a
16GB one perhaps weekly for training.  Is it possible with
predictionIO + universal recommender?

I have done some search and found a related guide here:
https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
Which copy the whole template directory and then run pio deploy.  But
in their case HBase and elasticsearch cluster are used.  In my case
only a single machine is used with elasticsearch and postgresql.  Will
this work?  (I am flexible about using postresql or localfs or hbase,
but I cannot afford a cluster)

Perhaps another solution to make the 16GB machine as a spark slave,
start it before training start, and the 8GB machine will connect to
it. Then call pio train; pio deploy on the 8GB machine.  Finally
shutdown the 16GB machine.  But I have no idea if it can work.  And if
yes, is there any documentation I can look into?

Any other method is welcome!  Zero downtime is preferred but not necessary.

Thanks in advance.


Best Regards,
Brian

Re: How to training and deploy on different machine?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

We do deployments and customize things for users. When we deploy PredictionIO we typically have one machine that is for only PIO permanent servers. It runs the PredictionServer (started with `pio deploy`) and the EventServer (started with `pio eventserver`). These services communicate with Elasticsearch and HBase. We usually have the DB (Hbase) and Elasticsearch on separate machines. They are under heavy load in production and during training so having them separate allows you to scale as needed. 

Spark is the oddball because it can be temporary. Here the minimum is 2 machines, one for the Spark driver (launched with `pio train`) and at least one Spark executor machine with Spark installed but nothing else.

This means PIO is installed on the EventServer + PredictionServer machine and the Spark driver machine, so in 2 places. The other services can be put wherever you want.

The temporary machines are the Spark driver and Spark Executor(s). Since PIO is installed on the driver machine you will want to save config but “stopping” instead of "deleting” the instance.

On Sep 20, 2017, at 8:30 PM, Brian Chiu <br...@snaptee.co> wrote:

Dear Pat,

Thanks for the detailed guide.  It is nice to know it is possible.
But I am not sure if I understand it correctly, so could you please
point out any misunderstanding in the following?  (If there is any)

====
Let's say I have 3 machines.

There is a machine [EventServer and data store) for ES, HBase+HDFS (or
Postgres, but not recommended)
The other 2 machines will both connect to this machine.
It is permanent.

machine [TrainingServer] will run `pio build` and `pio train`
This step pull training data from [EventServer] and then store model
and metadata back,
It is not permanent.

machine [PredictionServer] gets a copy of the template from machine
[TrainingServer] (only need to do this once)
Then run `pio deploy`
It is not a Spark driver or executor for training
Write a cron job of `pio deploy`
It is permanent.
====

Thanks

Brian

On Wed, Sep 20, 2017 at 11:16 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> Yes, this is the recommended config (Postgres is not, but later). Spark is
> only needed during training but the `pio train` process creates drives and
> executors in Spark. The driver will be the `pio train` machine so you must
> install pio on it. You should have 2 Spark machines at least because the
> driver and executor need roughly the same memory, more executors will train
> faster.
> 
> You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer and
> EventServe. These are 2 JVM processes the take events and respond to queries
> and so must be available all the time. You will run `pio eventserver` and
> `pio deploy` on this machine. the Spark driver machine will run `pio train`.
> Since no state is stored in PIO this will work because the machines get
> state from the DBs (HBase is recommended, and Elasticsearch). Install pio
> and the UR in the same location on all machines because the path to the UR
> is used by PIO to give an id to the engine (not ideal, but oh well).
> 
> Once setup:
> 
> Run `pio eventserver` on the permanent PS/ES machine and input your data
> into the EventServer.
> Run `pio build` on the “driver” machine and `pio train` on the same machine.
> This build the UR, puts metadata about the instance in PIO and creates the
> Spark driver, which can use a separate machine or 3 as Spark executors.
> Then copy the UR directory to the PS/ES machine and do `pio deploy` from the
> copied directory.
> Shut down the driver machine and Spark executors. For AWS “stopping" them
> means config is saved so you only pay for EBS storage. You will start them
> before the next train.
> 
> 
> From then on there is no need to copy the UR directory, just spin up the
> driver and any other Spark machine, do `pio train` and you are done. The
> model is automatically hot-swapped with the old one with no downtime and no
> need to re-deploy.
> 
> This will only work in this order if you want to take advantage of a
> temporary Spark. PIO is installed on the PS/ES machine and the “driver”
> machine in exactly the same way connecting to the same stores.
> 
> Hmm, I should write a How to for this...
> 
> 
> 
> On Sep 20, 2017, at 3:23 AM, Brian Chiu <br...@snaptee.co> wrote:
> 
> Hi,
> 
> I would like to be able to train and run model on different machines.
> The reason is, on my dataset, training takes around 16GB of memory and
> deploying only needs 8GB.  In order to save money, it would be better
> if only a 8GB memory machine is used in production, and only start a
> 16GB one perhaps weekly for training.  Is it possible with
> predictionIO + universal recommender?
> 
> I have done some search and found a related guide here:
> https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
> Which copy the whole template directory and then run pio deploy.  But
> in their case HBase and elasticsearch cluster are used.  In my case
> only a single machine is used with elasticsearch and postgresql.  Will
> this work?  (I am flexible about using postresql or localfs or hbase,
> but I cannot afford a cluster)
> 
> Perhaps another solution to make the 16GB machine as a spark slave,
> start it before training start, and the 8GB machine will connect to
> it. Then call pio train; pio deploy on the 8GB machine.  Finally
> shutdown the 16GB machine.  But I have no idea if it can work.  And if
> yes, is there any documentation I can look into?
> 
> Any other method is welcome!  Zero downtime is preferred but not necessary.
> 
> Thanks in advance.
> 
> 
> Best Regards,
> Brian
>

Re: How to training and deploy on different machine?

Posted by Brian Chiu <br...@snaptee.co>.

Dear Pat,

Thanks for the detailed guide.  It is nice to know it is possible.
But I am not sure if I understand it correctly, so could you please
point out any misunderstanding in the following?  (If there is any)

====
Let's say I have 3 machines.

There is a machine [EventServer and data store) for ES, HBase+HDFS (or
Postgres, but not recommended)
The other 2 machines will both connect to this machine.
It is permanent.

machine [TrainingServer] will run `pio build` and `pio train`
This step pull training data from [EventServer] and then store model
and metadata back,
It is not permanent.

machine [PredictionServer] gets a copy of the template from machine
[TrainingServer] (only need to do this once)
Then run `pio deploy`
It is not a Spark driver or executor for training
Write a cron job of `pio deploy`
It is permanent.
====

Thanks

Brian

On Wed, Sep 20, 2017 at 11:16 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> Yes, this is the recommended config (Postgres is not, but later). Spark is
> only needed during training but the `pio train` process creates drives and
> executors in Spark. The driver will be the `pio train` machine so you must
> install pio on it. You should have 2 Spark machines at least because the
> driver and executor need roughly the same memory, more executors will train
> faster.
>
> You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer and
> EventServe. These are 2 JVM processes the take events and respond to queries
> and so must be available all the time. You will run `pio eventserver` and
> `pio deploy` on this machine. the Spark driver machine will run `pio train`.
> Since no state is stored in PIO this will work because the machines get
> state from the DBs (HBase is recommended, and Elasticsearch). Install pio
> and the UR in the same location on all machines because the path to the UR
> is used by PIO to give an id to the engine (not ideal, but oh well).
>
> Once setup:
>
> Run `pio eventserver` on the permanent PS/ES machine and input your data
> into the EventServer.
> Run `pio build` on the “driver” machine and `pio train` on the same machine.
> This build the UR, puts metadata about the instance in PIO and creates the
> Spark driver, which can use a separate machine or 3 as Spark executors.
> Then copy the UR directory to the PS/ES machine and do `pio deploy` from the
> copied directory.
> Shut down the driver machine and Spark executors. For AWS “stopping" them
> means config is saved so you only pay for EBS storage. You will start them
> before the next train.
>
>
> From then on there is no need to copy the UR directory, just spin up the
> driver and any other Spark machine, do `pio train` and you are done. The
> model is automatically hot-swapped with the old one with no downtime and no
> need to re-deploy.
>
> This will only work in this order if you want to take advantage of a
> temporary Spark. PIO is installed on the PS/ES machine and the “driver”
> machine in exactly the same way connecting to the same stores.
>
> Hmm, I should write a How to for this...
>
>
>
> On Sep 20, 2017, at 3:23 AM, Brian Chiu <br...@snaptee.co> wrote:
>
> Hi,
>
> I would like to be able to train and run model on different machines.
> The reason is, on my dataset, training takes around 16GB of memory and
> deploying only needs 8GB.  In order to save money, it would be better
> if only a 8GB memory machine is used in production, and only start a
> 16GB one perhaps weekly for training.  Is it possible with
> predictionIO + universal recommender?
>
> I have done some search and found a related guide here:
> https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
> Which copy the whole template directory and then run pio deploy.  But
> in their case HBase and elasticsearch cluster are used.  In my case
> only a single machine is used with elasticsearch and postgresql.  Will
> this work?  (I am flexible about using postresql or localfs or hbase,
> but I cannot afford a cluster)
>
> Perhaps another solution to make the 16GB machine as a spark slave,
> start it before training start, and the 8GB machine will connect to
> it. Then call pio train; pio deploy on the 8GB machine.  Finally
> shutdown the 16GB machine.  But I have no idea if it can work.  And if
> yes, is there any documentation I can look into?
>
> Any other method is welcome!  Zero downtime is preferred but not necessary.
>
> Thanks in advance.
>
>
> Best Regards,
> Brian
>

Re: How to training and deploy on different machine?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, this is the recommended config (Postgres is not, but later). Spark is only needed during training but the `pio train` process creates drives and executors in Spark. The driver will be the `pio train` machine so you must install pio on it. You should have 2 Spark machines at least because the driver and executor need roughly the same memory, more executors will train faster.

You will have to spread the pio “workflow” out over a permanent deploy+eventserver machine. I usually call this a combo PredictionServer and EventServe. These are 2 JVM processes the take events and respond to queries and so must be available all the time. You will run `pio eventserver` and `pio deploy` on this machine. the Spark driver machine will run `pio train`. Since no state is stored in PIO this will work because the machines get state from the DBs (HBase is recommended, and Elasticsearch). Install pio and the UR in the same location on all machines because the path to the UR is used by PIO to give an id to the engine (not ideal, but oh well).

Once setup:
Run `pio eventserver` on the permanent PS/ES machine and input your data into the EventServer.
Run `pio build` on the “driver” machine and `pio train` on the same machine. This build the UR, puts metadata about the instance in PIO and creates the Spark driver, which can use a separate machine or 3 as Spark executors.
Then copy the UR directory to the PS/ES machine and do `pio deploy` from the copied directory.
Shut down the driver machine and Spark executors. For AWS “stopping" them means config is saved so you only pay for EBS storage. You will start them before the next train.

From then on there is no need to copy the UR directory, just spin up the driver and any other Spark machine, do `pio train` and you are done. The model is automatically hot-swapped with the old one with no downtime and no need to re-deploy.

This will only work in this order if you want to take advantage of a temporary Spark. PIO is installed on the PS/ES machine and the “driver” machine in exactly the same way connecting to the same stores.

Hmm, I should write a How to for this...

On Sep 20, 2017, at 3:23 AM, Brian Chiu <br...@snaptee.co> wrote:

Hi,

I would like to be able to train and run model on different machines.
The reason is, on my dataset, training takes around 16GB of memory and
deploying only needs 8GB. In order to save money, it would be better
if only a 8GB memory machine is used in production, and only start a
16GB one perhaps weekly for training. Is it possible with
predictionIO + universal recommender?

I have done some search and found a related guide here:
https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
Which copy the whole template directory and then run pio deploy. But
in their case HBase and elasticsearch cluster are used. In my case
only a single machine is used with elasticsearch and postgresql. Will
this work? (I am flexible about using postresql or localfs or hbase,
but I cannot afford a cluster)

Perhaps another solution to make the 16GB machine as a spark slave,
start it before training start, and the 8GB machine will connect to
it. Then call pio train; pio deploy on the 8GB machine. Finally
shutdown the 16GB machine. But I have no idea if it can work. And if
yes, is there any documentation I can look into?

Any other method is welcome! Zero downtime is preferred but not necessary.

Thanks in advance.

Best Regards,
Brian