You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Dave Novelli <da...@ultravioletanalytics.com> on 2018/03/23 01:17:32 UTC

Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch
storage data sources from local file to S3 for the model repository. I've
read the page at https://predictionio.apache.org/system/anotherdatastore/
to try to understand the configuration requirements, but when I run pio
train it's indicating an error and nothing shows up in the s3 bucket:

[ERROR] [S3Models] Failed to insert a model to
s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public
permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model


Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave

Re: Unclear problem with using S3 as a storage data source

Posted by Dave Novelli <da...@ultravioletanalytics.com>.

I don't *think* I need more spark nodes - I'm just using the one for
training on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am
looking for a way to *not* install HDFS on there as well. S3 seemed like it
would be a super convenient way to pass the model files back and forth, but
it sounds like it wasn't implemented as a data source for the model
repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda*
read Scala haha, maybe this would be a fun learning project. Do you think
it would be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> So you need to have more Spark nodes and this is the problem?
>
> If so setup HBase on pseudo-clustered HDFS so you have a master node
> address even though all storage is on one machine. Then you use that
> version of HDFS to tell Spark where to look for the model. It give the
> model a URI.
>
> I have never used the raw S3 support, HDFS can also be backed by S3 but
> you use HDFS APIs, it is an HDFS config setting to use S3.
>
> It is a rather unfortunate side effect of PIO but there are 2 ways to
> solve this with no extra servers.
>
> Maybe someone else knows how to use S3 natively for the model stub?
>
>
> From: Dave Novelli <da...@ultravioletanalytics.com>
> <da...@ultravioletanalytics.com>
> Date: March 28, 2018 at 12:13:12 PM
> To: Pat Ferrel <pa...@occamsmachete.com> <pa...@occamsmachete.com>
> Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
> <us...@predictionio.apache.org>
> Subject:  Re: Unclear problem with using S3 as a storage data source
>
> Well, it looks like the local file system isn't an option in a
> multi-server configuration without manually setting up a process to
> transfer those stub model files.
>
> I trained models on one heavy-weight temporary instance, and then when I
> went to deploy from the prediction server instance it failed due to missing
> files. I copied the .pio_store/models directory from the training server
> over to the prediction server and then was able to deploy.
>
> So, in a dual-instance configuration what's the best way to store the
> files? I'm using pseudo-distributed HBase with standard file system storage
> instead of HDFS (my current aim is keeping down cost and complexity for a
> pilot project).
>
> Is S3 back on the table as on option?
>
> On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
> dave@ultravioletanalytics.com> wrote:
>
>> Ahhh ok, thanks Pat!
>>
>>
>> Dave Novelli
>> Founder/Principal Consultant, Ultraviolet Analytics
>> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
>> dave@ultravioletanalytics.com
>>
>> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>> There is no need to have Universal Recommender models put in S3, they
>>> are not used and only exist (in stub form) because PIO requires them. The
>>> actual model lives in Elasticsearch and uses special features of ES to
>>> perform the last phase of the algorithm and so cannot be replaced.
>>>
>>> The stub PIO models have no data and will be tiny. putting them in HDFS
>>> or the local file system is recommended.
>>>
>>>
>>> From: Dave Novelli <da...@ultravioletanalytics.com>
>>> <da...@ultravioletanalytics.com>
>>> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
>>> <us...@predictionio.apache.org>
>>> Date: March 22, 2018 at 6:17:32 PM
>>> To: user@predictionio.apache.org <us...@predictionio.apache.org>
>>> <us...@predictionio.apache.org>
>>> Subject:  Unclear problem with using S3 as a storage data source
>>>
>>> Hi all,
>>>
>>> I'm using the Universal Recommender template and I'm trying to switch
>>> storage data sources from local file to S3 for the model repository. I've
>>> read the page at https://predictionio.apache
>>> .org/system/anotherdatastore/ to try to understand the configuration
>>> requirements, but when I run pio train it's indicating an error and nothing
>>> shows up in the s3 bucket:
>>>
>>> [ERROR] [S3Models] Failed to insert a model to
>>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>>
>>> I created a new bucket named "pio-model" and granted full public
>>> permissions.
>>>
>>> Seemingly relevant settings from pio-env.sh:
>>>
>>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>>> ...
>>>
>>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>>
>>> # I've tried with and without this
>>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>>
>>> # I've tried with and without this
>>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>>
>>>
>>> Any suggestions where I can start troubleshooting my configuration?
>>>
>>> Thanks,
>>> Dave
>>>
>>>
>>
>
>
> --
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> dave@ultravioletanalytics.com
>
>

Re: Unclear problem with using S3 as a storage data source

Posted by Dave Novelli <da...@ultravioletanalytics.com>.

Sorry Pat, I think I took some shortcuts in my initial explanation that are
causing some confusion :) I'll try laying everything out again in detail...

I have configured 2 servers in AWS:

*Event/Prediction Server - t2.medium*
- Runs permanently
- Using swap to deal with 4GB mem limit (I know, I know)
- ElasticSearch
- HBase (pseudo-distributed mode, using normal files instead of hdfs)
- Web server for events and 6 prediction models

*Training Server - r4.large*
- Only spun up to execute "pio train" for the 6 UR models I've configured
then spun back down
- Spark

My specific problem is that running "pio train" on the training server when
"LOCALFS" is set as the model data store will deposit all the stub files in
.pio_store/models/.

When I run "pio deploy" on the Event/Prediction Server, it's looking for
those files in the .pio_store/models/ directory on the Event/Prediction
server, and they're obviously not there. If I manually copy the files from
the Training server to the Event/Prediction server then "pio deploy" works
as expected.

My thought is that if the Training server saves those model stub files to
S3, then the Event/Prediction server can read those files from S3 and I
won't have to manually copy them.


Hopefully this clears my situation up!


As a note - I realize t2.medium is not a feasible instance type for any
significant production system, but I'm bootstrapping a demo system on a
very tight budget for a site that will almost certainly have extremely low
traffic. In my initial tests I've managed to get UR working on this
configuration and will be doing some simple load testing soon to see how
far I can push it before it crashes. Speed is obviously not an issue at the
moment but once it is (and once there's some funding) that t2 will be
replaced with an r4 or an m5

Cheers,
Dave


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 7:40 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sorry then I don’t understand what part has no access to the file system
> on the single machine?
>
> Also a t2 is not going to work with PIO. Spark 2 along requires something
> like 2g for a do-nothing empty executor and driver, so a real app will
> require 16g or so minimum (my laptop has 16g). Run the OS, HBase, ES, and
> Spark will get you to over 8g, then add data. Spark keeps all data needed
> at a given phase of the calculation in memory across the cluster, that’s
> where it gets it’s speed. Welcome to big-data :-)
>
>
> From: Dave Novelli <da...@ultravioletanalytics.com>
> <da...@ultravioletanalytics.com>
> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
> <us...@predictionio.apache.org>
> Date: March 28, 2018 at 3:47:35 PM
> To: Pat Ferrel <pa...@occamsmachete.com> <pa...@occamsmachete.com>
> Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
> <us...@predictionio.apache.org>
> Subject:  Re: Unclear problem with using S3 as a storage data source
>
> I don't *think* I need more spark nodes - I'm just using the one for
> training on an r4.large instance I spin up and down as needed.
>
> I was hoping to avoid adding any additional computational load to my
> Event/Prediction/HBase/ES server (all running on a t2.medium) so I am
> looking for a way to *not* install HDFS on there as well. S3 seemed like it
> would be a super convenient way to pass the model files back and forth, but
> it sounds like it wasn't implemented as a data source for the model
> repository for UR.
>
> Perhaps that's something I could implement and contribute? I can *kinda*
> read Scala haha, maybe this would be a fun learning project. Do you think
> it would be fairly straightforward?
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> dave@ultravioletanalytics.com
>
> On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> So you need to have more Spark nodes and this is the problem?
>>
>> If so setup HBase on pseudo-clustered HDFS so you have a master node
>> address even though all storage is on one machine. Then you use that
>> version of HDFS to tell Spark where to look for the model. It give the
>> model a URI.
>>
>> I have never used the raw S3 support, HDFS can also be backed by S3 but
>> you use HDFS APIs, it is an HDFS config setting to use S3.
>>
>> It is a rather unfortunate side effect of PIO but there are 2 ways to
>> solve this with no extra servers.
>>
>> Maybe someone else knows how to use S3 natively for the model stub?
>>
>>
>> From: Dave Novelli <da...@ultravioletanalytics.com>
>> <da...@ultravioletanalytics.com>
>> Date: March 28, 2018 at 12:13:12 PM
>> To: Pat Ferrel <pa...@occamsmachete.com> <pa...@occamsmachete.com>
>> Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
>> <us...@predictionio.apache.org>
>> Subject:  Re: Unclear problem with using S3 as a storage data source
>>
>> Well, it looks like the local file system isn't an option in a
>> multi-server configuration without manually setting up a process to
>> transfer those stub model files.
>>
>> I trained models on one heavy-weight temporary instance, and then when I
>> went to deploy from the prediction server instance it failed due to missing
>> files. I copied the .pio_store/models directory from the training server
>> over to the prediction server and then was able to deploy.
>>
>> So, in a dual-instance configuration what's the best way to store the
>> files? I'm using pseudo-distributed HBase with standard file system storage
>> instead of HDFS (my current aim is keeping down cost and complexity for a
>> pilot project).
>>
>> Is S3 back on the table as on option?
>>
>> On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
>> dave@ultravioletanalytics.com> wrote:
>>
>>> Ahhh ok, thanks Pat!
>>>
>>>
>>> Dave Novelli
>>> Founder/Principal Consultant, Ultraviolet Analytics
>>> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
>>> dave@ultravioletanalytics.com
>>>
>>> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>
>>>> There is no need to have Universal Recommender models put in S3, they
>>>> are not used and only exist (in stub form) because PIO requires them. The
>>>> actual model lives in Elasticsearch and uses special features of ES to
>>>> perform the last phase of the algorithm and so cannot be replaced.
>>>>
>>>> The stub PIO models have no data and will be tiny. putting them in HDFS
>>>> or the local file system is recommended.
>>>>
>>>>
>>>> From: Dave Novelli <da...@ultravioletanalytics.com>
>>>> <da...@ultravioletanalytics.com>
>>>> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
>>>> <us...@predictionio.apache.org>
>>>> Date: March 22, 2018 at 6:17:32 PM
>>>> To: user@predictionio.apache.org <us...@predictionio.apache.org>
>>>> <us...@predictionio.apache.org>
>>>> Subject:  Unclear problem with using S3 as a storage data source
>>>>
>>>> Hi all,
>>>>
>>>> I'm using the Universal Recommender template and I'm trying to switch
>>>> storage data sources from local file to S3 for the model repository. I've
>>>> read the page at https://predictionio.apache
>>>> .org/system/anotherdatastore/ to try to understand the configuration
>>>> requirements, but when I run pio train it's indicating an error and nothing
>>>> shows up in the s3 bucket:
>>>>
>>>> [ERROR] [S3Models] Failed to insert a model to
>>>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>>>
>>>> I created a new bucket named "pio-model" and granted full public
>>>> permissions.
>>>>
>>>> Seemingly relevant settings from pio-env.sh:
>>>>
>>>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>>>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>>>> ...
>>>>
>>>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>>>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>>>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>>>
>>>> # I've tried with and without this
>>>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>>>
>>>> # I've tried with and without this
>>>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>>>
>>>>
>>>> Any suggestions where I can start troubleshooting my configuration?
>>>>
>>>> Thanks,
>>>> Dave
>>>>
>>>>
>>>
>>
>>
>> --
>> Dave Novelli
>> Founder/Principal Consultant, Ultraviolet Analytics
>> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
>> dave@ultravioletanalytics.com
>>
>>
>

Re: Unclear problem with using S3 as a storage data source

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, the problem, as I thought at first, is that Spark creates the model and the PredictionServer must read it.

My methods below still work. There is very little extra to creating a pseudo cluster for HDFS as far a performance if it is still running all on one machine.

You can also write it on the Spark/training machine ot localfs and copy it to the PredictionServer before deploy. A simple scp in a script would do that.

Again I have no knowledge of using S3 for such things. If that works, someone else will have to help.




From: Dave Novelli <da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: March 29, 2018 at 6:19:58 AM
To: Pat Ferrel <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source  

Sorry Pat, I think I took some shortcuts in my initial explanation that are causing some confusion :) I'll try laying everything out again in detail...

I have configured 2 servers in AWS:

Event/Prediction Server - t2.medium
- Runs permanently
- Using swap to deal with 4GB mem limit (I know, I know)
- ElasticSearch
- HBase (pseudo-distributed mode, using normal files instead of hdfs)
- Web server for events and 6 prediction models

Training Server - r4.large
- Only spun up to execute "pio train" for the 6 UR models I've configured then spun back down
- Spark

My specific problem is that running "pio train" on the training server when "LOCALFS" is set as the model data store will deposit all the stub files in .pio_store/models/.

When I run "pio deploy" on the Event/Prediction Server, it's looking for those files in the .pio_store/models/ directory on the Event/Prediction server, and they're obviously not there. If I manually copy the files from the Training server to the Event/Prediction server then "pio deploy" works as expected.

My thought is that if the Training server saves those model stub files to S3, then the Event/Prediction server can read those files from S3 and I won't have to manually copy them.


Hopefully this clears my situation up!


As a note - I realize t2.medium is not a feasible instance type for any significant production system, but I'm bootstrapping a demo system on a very tight budget for a site that will almost certainly have extremely low traffic. In my initial tests I've managed to get UR working on this configuration and will be doing some simple load testing soon to see how far I can push it before it crashes. Speed is obviously not an issue at the moment but once it is (and once there's some funding) that t2 will be replaced with an r4 or an m5

Cheers,
Dave


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 7:40 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
Sorry then I don’t understand what part has no access to the file system on the single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 2g for a do-nothing empty executor and driver, so a real app will require 16g or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get you to over 8g, then add data. Spark keeps all data needed at a given phase of the calculation in memory across the cluster, that’s where it gets it’s speed. Welcome to big-data :-)


From: Dave Novelli <da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

I don't *think* I need more spark nodes - I'm just using the one for training on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking for a way to *not* install HDFS on there as well. S3 seemed like it would be a super convenient way to pass the model files back and forth, but it sounds like it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read Scala haha, maybe this would be a fun learning project. Do you think it would be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address even though all storage is on one machine. Then you use that version of HDFS to tell Spark where to look for the model. It give the model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve this with no extra servers. 

Maybe someone else knows how to use S3 natively for the model stub?
 

From: Dave Novelli <da...@ultravioletanalytics.com>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server configuration without manually setting up a process to transfer those stub model files.

I trained models on one heavy-weight temporary instance, and then when I went to deploy from the prediction server instance it failed due to missing files. I copied the .pio_store/models directory from the training server over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the files? I'm using pseudo-distributed HBase with standard file system storage instead of HDFS (my current aim is keeping down cost and complexity for a pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <da...@ultravioletanalytics.com> wrote:
Ahhh ok, thanks Pat!


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
There is no need to have Universal Recommender models put in S3, they are not used and only exist (in stub form) because PIO requires them. The actual model lives in Elasticsearch and uses special features of ES to perform the last phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or the local file system is recommended.


From: Dave Novelli <da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: March 22, 2018 at 6:17:32 PM
To: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch storage data sources from local file to S3 for the model repository. I've read the page at https://predictionio.apache.org/system/anotherdatastore/ to try to understand the configuration requirements, but when I run pio train it's indicating an error and nothing shows up in the s3 bucket: 

[ERROR] [S3Models] Failed to insert a model to s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model


Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave




--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

Re: Unclear problem with using S3 as a storage data source

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry then I don’t understand what part has no access to the file system on the single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 2g for a do-nothing empty executor and driver, so a real app will require 16g or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get you to over 8g, then add data. Spark keeps all data needed at a given phase of the calculation in memory across the cluster, that’s where it gets it’s speed. Welcome to big-data :-)

From: Dave Novelli <da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source  

I don't *think* I need more spark nodes - I'm just using the one for training on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking for a way to *not* install HDFS on there as well. S3 seemed like it would be a super convenient way to pass the model files back and forth, but it sounds like it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read Scala haha, maybe this would be a fun learning project. Do you think it would be fairly straightforward?

Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address even though all storage is on one machine. Then you use that version of HDFS to tell Spark where to look for the model. It give the model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve this with no extra servers. 

Maybe someone else knows how to use S3 natively for the model stub?

From: Dave Novelli <da...@ultravioletanalytics.com>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server configuration without manually setting up a process to transfer those stub model files.

I trained models on one heavy-weight temporary instance, and then when I went to deploy from the prediction server instance it failed due to missing files. I copied the .pio_store/models directory from the training server over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the files? I'm using pseudo-distributed HBase with standard file system storage instead of HDFS (my current aim is keeping down cost and complexity for a pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <da...@ultravioletanalytics.com> wrote:
Ahhh ok, thanks Pat!

Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
There is no need to have Universal Recommender models put in S3, they are not used and only exist (in stub form) because PIO requires them. The actual model lives in Elasticsearch and uses special features of ES to perform the last phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or the local file system is recommended.

From: Dave Novelli <da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: March 22, 2018 at 6:17:32 PM
To: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch storage data sources from local file to S3 for the model repository. I've read the page at https://predictionio.apache.org/system/anotherdatastore/ to try to understand the configuration requirements, but when I run pio train it's indicating an error and nothing shows up in the s3 bucket: 

[ERROR] [S3Models] Failed to insert a model to s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model

Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave

--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

Re: Unclear problem with using S3 as a storage data source

Posted by Pat Ferrel <pa...@occamsmachete.com>.

So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node
address even though all storage is on one machine. Then you use that
version of HDFS to tell Spark where to look for the model. It give the
model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you
use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve
this with no extra servers.

Maybe someone else knows how to use S3 natively for the model stub?

From: Dave Novelli <da...@ultravioletanalytics.com>
<da...@ultravioletanalytics.com>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <pa...@occamsmachete.com> <pa...@occamsmachete.com>
Cc: user@predictionio.apache.org <us...@predictionio.apache.org>
<us...@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server
configuration without manually setting up a process to transfer those stub
model files.

I trained models on one heavy-weight temporary instance, and then when I
went to deploy from the prediction server instance it failed due to missing
files. I copied the .pio_store/models directory from the training server
over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the
files? I'm using pseudo-distributed HBase with standard file system storage
instead of HDFS (my current aim is keeping down cost and complexity for a
pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
dave@ultravioletanalytics.com> wrote:

> Ahhh ok, thanks Pat!
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> dave@ultravioletanalytics.com
>
> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> There is no need to have Universal Recommender models put in S3, they are
>> not used and only exist (in stub form) because PIO requires them. The
>> actual model lives in Elasticsearch and uses special features of ES to
>> perform the last phase of the algorithm and so cannot be replaced.
>>
>> The stub PIO models have no data and will be tiny. putting them in HDFS
>> or the local file system is recommended.
>>
>>
>> From: Dave Novelli <da...@ultravioletanalytics.com>
>> <da...@ultravioletanalytics.com>
>> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
>> <us...@predictionio.apache.org>
>> Date: March 22, 2018 at 6:17:32 PM
>> To: user@predictionio.apache.org <us...@predictionio.apache.org>
>> <us...@predictionio.apache.org>
>> Subject:  Unclear problem with using S3 as a storage data source
>>
>> Hi all,
>>
>> I'm using the Universal Recommender template and I'm trying to switch
>> storage data sources from local file to S3 for the model repository. I've
>> read the page at https://predictionio.apache.org/system/anotherdatastore/
>> to try to understand the configuration requirements, but when I run pio
>> train it's indicating an error and nothing shows up in the s3 bucket:
>>
>> [ERROR] [S3Models] Failed to insert a model to
>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>
>> I created a new bucket named "pio-model" and granted full public
>> permissions.
>>
>> Seemingly relevant settings from pio-env.sh:
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>> ...
>>
>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>
>>
>> Any suggestions where I can start troubleshooting my configuration?
>>
>> Thanks,
>> Dave
>>
>>
>

--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

Re: Unclear problem with using S3 as a storage data source

Posted by Dave Novelli <da...@ultravioletanalytics.com>.

Well, it looks like the local file system isn't an option in a multi-server
configuration without manually setting up a process to transfer those stub
model files.

I trained models on one heavy-weight temporary instance, and then when I
went to deploy from the prediction server instance it failed due to missing
files. I copied the .pio_store/models directory from the training server
over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the
files? I'm using pseudo-distributed HBase with standard file system storage
instead of HDFS (my current aim is keeping down cost and complexity for a
pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
dave@ultravioletanalytics.com> wrote:

> Ahhh ok, thanks Pat!
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> dave@ultravioletanalytics.com
>
> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> There is no need to have Universal Recommender models put in S3, they are
>> not used and only exist (in stub form) because PIO requires them. The
>> actual model lives in Elasticsearch and uses special features of ES to
>> perform the last phase of the algorithm and so cannot be replaced.
>>
>> The stub PIO models have no data and will be tiny. putting them in HDFS
>> or the local file system is recommended.
>>
>>
>> From: Dave Novelli <da...@ultravioletanalytics.com>
>> <da...@ultravioletanalytics.com>
>> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
>> <us...@predictionio.apache.org>
>> Date: March 22, 2018 at 6:17:32 PM
>> To: user@predictionio.apache.org <us...@predictionio.apache.org>
>> <us...@predictionio.apache.org>
>> Subject:  Unclear problem with using S3 as a storage data source
>>
>> Hi all,
>>
>> I'm using the Universal Recommender template and I'm trying to switch
>> storage data sources from local file to S3 for the model repository. I've
>> read the page at https://predictionio.apache.org/system/anotherdatastore/
>> to try to understand the configuration requirements, but when I run pio
>> train it's indicating an error and nothing shows up in the s3 bucket:
>>
>> [ERROR] [S3Models] Failed to insert a model to
>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>
>> I created a new bucket named "pio-model" and granted full public
>> permissions.
>>
>> Seemingly relevant settings from pio-env.sh:
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>> ...
>>
>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>
>>
>> Any suggestions where I can start troubleshooting my configuration?
>>
>> Thanks,
>> Dave
>>
>>
>


-- 
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

Re: Unclear problem with using S3 as a storage data source

Posted by Dave Novelli <da...@ultravioletanalytics.com>.

Ahhh ok, thanks Pat!


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | dave@ultravioletanalytics.com

On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> There is no need to have Universal Recommender models put in S3, they are
> not used and only exist (in stub form) because PIO requires them. The
> actual model lives in Elasticsearch and uses special features of ES to
> perform the last phase of the algorithm and so cannot be replaced.
>
> The stub PIO models have no data and will be tiny. putting them in HDFS or
> the local file system is recommended.
>
>
> From: Dave Novelli <da...@ultravioletanalytics.com>
> <da...@ultravioletanalytics.com>
> Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
> <us...@predictionio.apache.org>
> Date: March 22, 2018 at 6:17:32 PM
> To: user@predictionio.apache.org <us...@predictionio.apache.org>
> <us...@predictionio.apache.org>
> Subject:  Unclear problem with using S3 as a storage data source
>
> Hi all,
>
> I'm using the Universal Recommender template and I'm trying to switch
> storage data sources from local file to S3 for the model repository. I've
> read the page at https://predictionio.apache.org/system/anotherdatastore/
> to try to understand the configuration requirements, but when I run pio
> train it's indicating an error and nothing shows up in the s3 bucket:
>
> [ERROR] [S3Models] Failed to insert a model to s3://pio-model/pio_
> modelAWJPjTYM0wNJe2iKBl0d
>
> I created a new bucket named "pio-model" and granted full public
> permissions.
>
> Seemingly relevant settings from pio-env.sh:
>
> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
> ...
>
> PIO_STORAGE_SOURCES_S3_TYPE=s3
> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>
> # I've tried with and without this
> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>
> # I've tried with and without this
> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>
>
> Any suggestions where I can start troubleshooting my configuration?
>
> Thanks,
> Dave
>
>

Re: Unclear problem with using S3 as a storage data source

Posted by Pat Ferrel <pa...@occamsmachete.com>.

There is no need to have Universal Recommender models put in S3, they are
not used and only exist (in stub form) because PIO requires them. The
actual model lives in Elasticsearch and uses special features of ES to
perform the last phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or
the local file system is recommended.


From: Dave Novelli <da...@ultravioletanalytics.com>
<da...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
<us...@predictionio.apache.org>
Date: March 22, 2018 at 6:17:32 PM
To: user@predictionio.apache.org <us...@predictionio.apache.org>
<us...@predictionio.apache.org>
Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch
storage data sources from local file to S3 for the model repository. I've
read the page at https://predictionio.apache.org/system/anotherdatastore/
to try to understand the configuration requirements, but when I run pio
train it's indicating an error and nothing shows up in the s3 bucket:

[ERROR] [S3Models] Failed to insert a model to
s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public
permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model


Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave