You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Girish Vasmatkar <gi...@hotwaxsystems.com> on 2018/10/01 06:48:48 UTC

Use SparkContext in Web Application

Hi All

We are very early into our Spark days so the following may sound like a
novice question :) I will try to keep this as short as possible.

We are trying to use Spark to introduce a recommendation engine that can be
used to provide product recommendations and need help on some design
decisions before moving forward. Ours is a web application running on
Tomcat. So far, I have created a simple POC (standalone java program) that
reads in a CSV file and feeds to FPGrowth and then fits the data and runs
transformations. I would like to be able to do the following -


   - Scheduler runs nightly in Tomcat (which it does currently) and reads
   everything from the DB to train/fit the system. This can grow into really
   some large data and everyday we will have new data. Should I just use
   SparkContext here, within my scheduler, to FIT the system? Is this correct
   way to go about this? I am also planning to save the model on S3 which
   should be okay. We also thought on using HDFS. The scheduler's job will be
   just to create model and save the same and be done with it.
   - On the product page, we can then use the saved model to display the
   product recommendations for a particular product.
   - My understanding is that I should be able to use SparkContext here in
   my web application to just load the saved model and use it to derive the
   recommendations. Is this a good design? The problem I see using this
   approach is that the SparkContext does take time to initialize and this may
   cost dearly. Or should we keep SparkContext per web application to use a
   single instance of the same? We can initialize a SparkContext during
   application context initializaion phase.


Since I am fairly new to using Spark properly, please help me take decision
on whether the way I plan to use Spark is the recommended way? I have also
seen use cases involving kafka tha does communication with Spark, but can
we not do it directly using Spark Context? I am sure a lot of my
understanding is wrong, so please feel free to correct me.

Thanks and Regards,
Girish Vasmatkar
HotWax Systems

Re: Use SparkContext in Web Application

Posted by Girish Vasmatkar <gi...@hotwaxsystems.com>.

All

Can someone please shed some light on the above query? Any help is greatly
appreciated.

Thanks,
Girish Vasmatkar
HotWax Systems


On Thu, Oct 4, 2018 at 10:25 AM Girish Vasmatkar <
girish.vasmatkar@hotwaxsystems.com> wrote:

>
>
> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
> girish.vasmatkar@hotwaxsystems.com> wrote:
>
>> Hi All
>>
>> We are very early into our Spark days so the following may sound like a
>> novice question :) I will try to keep this as short as possible.
>>
>> We are trying to use Spark to introduce a recommendation engine that can
>> be used to provide product recommendations and need help on some design
>> decisions before moving forward. Ours is a web application running on
>> Tomcat. So far, I have created a simple POC (standalone java program) that
>> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
>> transformations. I would like to be able to do the following -
>>
>>
>>    - Scheduler runs nightly in Tomcat (which it does currently) and
>>    reads everything from the DB to train/fit the system. This can grow into
>>    really some large data and everyday we will have new data. Should I just
>>    use SparkContext here, within my scheduler, to FIT the system? Is this
>>    correct way to go about this? I am also planning to save the model on S3
>>    which should be okay. We also thought on using HDFS. The scheduler's job
>>    will be just to create model and save the same and be done with it.
>>    - On the product page, we can then use the saved model to display the
>>    product recommendations for a particular product.
>>    - My understanding is that I should be able to use SparkContext here
>>    in my web application to just load the saved model and use it to derive the
>>    recommendations. Is this a good design? The problem I see using this
>>    approach is that the SparkContext does take time to initialize and this may
>>    cost dearly. Or should we keep SparkContext per web application to use a
>>    single instance of the same? We can initialize a SparkContext during
>>    application context initializaion phase.
>>
>>
>> Since I am fairly new to using Spark properly, please help me take
>> decision on whether the way I plan to use Spark is the recommended way? I
>> have also seen use cases involving kafka tha does communication with Spark,
>> but can we not do it directly using Spark Context? I am sure a lot of my
>> understanding is wrong, so please feel free to correct me.
>>
>> Thanks and Regards,
>> Girish Vasmatkar
>> HotWax Systems
>>
>>
>>
>>

Re: Use SparkContext in Web Application

Posted by Girish Vasmatkar <gi...@hotwaxsystems.com>.

Thank you Vincent and Jorn for your inputs, much appreciated.

Our web-app already has a scheduler mechanism and other jobs are already
running in the system. Would you still prefer to decouple model training in
a separate scheduling tool outside of our web-app JVM?
We are using test Data for now as we are very new to spark. It's 256 MB
data and the model size is coming out to be around 240KB and is being saved
in parquet format by default.

Thanks again for your help!

On Thu, Oct 4, 2018 at 12:30 PM vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Decoupling the web app from Spark backend is recommended. Training the
> model can be launched in the background via a scheduling tool. Inferring
> the model with Spark in interactive mode s not a good option as it will do
> it for unitary data and Spark is better in using large dataset. The
> original purpose of inferring with Spark was to do it offline for large
> datasets and store the results in a KV store for instance, then any
> consumer like your web app would just read the KV store. I would personally
> store the trained model in PFA or PMML and serve it via another tool.
> There are lots of tools to serve the models via API from managed solution
> like Amazon Sagemaker to open source solution like Prediction.io
> If you still want to call Spark backend from your web app, what I don't
> recommend, I would do it using Spark Jobserver or Livy to interact via rest
> API.
>
> Le jeu. 4 oct. 2018 à 08:25, Jörn Franke <jo...@gmail.com> a écrit :
>
>> Depending on your model size you can store it as PFA or PMML and run the
>> prediction in Java. For larger models you will need a custom solution ,
>> potentially using a spark thrift Server/spark job server/Livy and a cache
>> to store predictions that have been already calculated (eg based on
>> previous requests to predict). Then you run also into thoughts on caching
>> prediction results on the model version that has been used, evicting
>> non-relevant predictions etc
>> Making the model available as a service is currently a topic where a lot
>> of custom „plumbing“ is required , especially if models are a little bit
>> larger.
>>
>> Am 04.10.2018 um 06:55 schrieb Girish Vasmatkar <
>> girish.vasmatkar@hotwaxsystems.com>:
>>
>>
>>
>> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
>> girish.vasmatkar@hotwaxsystems.com> wrote:
>>
>>> Hi All
>>>
>>> We are very early into our Spark days so the following may sound like a
>>> novice question :) I will try to keep this as short as possible.
>>>
>>> We are trying to use Spark to introduce a recommendation engine that can
>>> be used to provide product recommendations and need help on some design
>>> decisions before moving forward. Ours is a web application running on
>>> Tomcat. So far, I have created a simple POC (standalone java program) that
>>> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
>>> transformations. I would like to be able to do the following -
>>>
>>>
>>>    - Scheduler runs nightly in Tomcat (which it does currently) and
>>>    reads everything from the DB to train/fit the system. This can grow into
>>>    really some large data and everyday we will have new data. Should I just
>>>    use SparkContext here, within my scheduler, to FIT the system? Is this
>>>    correct way to go about this? I am also planning to save the model on S3
>>>    which should be okay. We also thought on using HDFS. The scheduler's job
>>>    will be just to create model and save the same and be done with it.
>>>    - On the product page, we can then use the saved model to display
>>>    the product recommendations for a particular product.
>>>    - My understanding is that I should be able to use SparkContext here
>>>    in my web application to just load the saved model and use it to derive the
>>>    recommendations. Is this a good design? The problem I see using this
>>>    approach is that the SparkContext does take time to initialize and this may
>>>    cost dearly. Or should we keep SparkContext per web application to use a
>>>    single instance of the same? We can initialize a SparkContext during
>>>    application context initializaion phase.
>>>
>>>
>>> Since I am fairly new to using Spark properly, please help me take
>>> decision on whether the way I plan to use Spark is the recommended way? I
>>> have also seen use cases involving kafka tha does communication with Spark,
>>> but can we not do it directly using Spark Context? I am sure a lot of my
>>> understanding is wrong, so please feel free to correct me.
>>>
>>> Thanks and Regards,
>>> Girish Vasmatkar
>>> HotWax Systems
>>>
>>>
>>>
>>>

Re: Use SparkContext in Web Application

Posted by vincent gromakowski <vi...@gmail.com>.

Decoupling the web app from Spark backend is recommended. Training the
model can be launched in the background via a scheduling tool. Inferring
the model with Spark in interactive mode s not a good option as it will do
it for unitary data and Spark is better in using large dataset. The
original purpose of inferring with Spark was to do it offline for large
datasets and store the results in a KV store for instance, then any
consumer like your web app would just read the KV store. I would personally
store the trained model in PFA or PMML and serve it via another tool.
There are lots of tools to serve the models via API from managed solution
like Amazon Sagemaker to open source solution like Prediction.io
If you still want to call Spark backend from your web app, what I don't
recommend, I would do it using Spark Jobserver or Livy to interact via rest
API.

Le jeu. 4 oct. 2018 à 08:25, Jörn Franke <jo...@gmail.com> a écrit :

> Depending on your model size you can store it as PFA or PMML and run the
> prediction in Java. For larger models you will need a custom solution ,
> potentially using a spark thrift Server/spark job server/Livy and a cache
> to store predictions that have been already calculated (eg based on
> previous requests to predict). Then you run also into thoughts on caching
> prediction results on the model version that has been used, evicting
> non-relevant predictions etc
> Making the model available as a service is currently a topic where a lot
> of custom „plumbing“ is required , especially if models are a little bit
> larger.
>
> Am 04.10.2018 um 06:55 schrieb Girish Vasmatkar <
> girish.vasmatkar@hotwaxsystems.com>:
>
>
>
> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
> girish.vasmatkar@hotwaxsystems.com> wrote:
>
>> Hi All
>>
>> We are very early into our Spark days so the following may sound like a
>> novice question :) I will try to keep this as short as possible.
>>
>> We are trying to use Spark to introduce a recommendation engine that can
>> be used to provide product recommendations and need help on some design
>> decisions before moving forward. Ours is a web application running on
>> Tomcat. So far, I have created a simple POC (standalone java program) that
>> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
>> transformations. I would like to be able to do the following -
>>
>>
>>    - Scheduler runs nightly in Tomcat (which it does currently) and
>>    reads everything from the DB to train/fit the system. This can grow into
>>    really some large data and everyday we will have new data. Should I just
>>    use SparkContext here, within my scheduler, to FIT the system? Is this
>>    correct way to go about this? I am also planning to save the model on S3
>>    which should be okay. We also thought on using HDFS. The scheduler's job
>>    will be just to create model and save the same and be done with it.
>>    - On the product page, we can then use the saved model to display the
>>    product recommendations for a particular product.
>>    - My understanding is that I should be able to use SparkContext here
>>    in my web application to just load the saved model and use it to derive the
>>    recommendations. Is this a good design? The problem I see using this
>>    approach is that the SparkContext does take time to initialize and this may
>>    cost dearly. Or should we keep SparkContext per web application to use a
>>    single instance of the same? We can initialize a SparkContext during
>>    application context initializaion phase.
>>
>>
>> Since I am fairly new to using Spark properly, please help me take
>> decision on whether the way I plan to use Spark is the recommended way? I
>> have also seen use cases involving kafka tha does communication with Spark,
>> but can we not do it directly using Spark Context? I am sure a lot of my
>> understanding is wrong, so please feel free to correct me.
>>
>> Thanks and Regards,
>> Girish Vasmatkar
>> HotWax Systems
>>
>>
>>
>>

Re: Use SparkContext in Web Application

Posted by Jörn Franke <jo...@gmail.com>.

Depending on your model size you can store it as PFA or PMML and run the prediction in Java. For larger models you will need a custom solution , potentially using a spark thrift Server/spark job server/Livy and a cache to store predictions that have been already calculated (eg based on previous requests to predict). Then you run also into thoughts on caching prediction results on the model version that has been used, evicting non-relevant predictions etc
Making the model available as a service is currently a topic where a lot of custom „plumbing“ is required , especially if models are a little bit larger.

> Am 04.10.2018 um 06:55 schrieb Girish Vasmatkar <gi...@hotwaxsystems.com>:
> 
> 
> 
>> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <gi...@hotwaxsystems.com> wrote:
>> Hi All
>> 
>> We are very early into our Spark days so the following may sound like a novice question :) I will try to keep this as short as possible.
>> 
>> We are trying to use Spark to introduce a recommendation engine that can be used to provide product recommendations and need help on some design decisions before moving forward. Ours is a web application running on Tomcat. So far, I have created a simple POC (standalone java program) that reads in a CSV file and feeds to FPGrowth and then fits the data and runs transformations. I would like to be able to do the following -
>> 
>> Scheduler runs nightly in Tomcat (which it does currently) and reads everything from the DB to train/fit the system. This can grow into really some large data and everyday we will have new data. Should I just use SparkContext here, within my scheduler, to FIT the system? Is this correct way to go about this? I am also planning to save the model on S3 which should be okay. We also thought on using HDFS. The scheduler's job will be just to create model and save the same and be done with it.
>> On the product page, we can then use the saved model to display the product recommendations for a particular product.
>> My understanding is that I should be able to use SparkContext here in my web application to just load the saved model and use it to derive the recommendations. Is this a good design? The problem I see using this approach is that the SparkContext does take time to initialize and this may cost dearly. Or should we keep SparkContext per web application to use a single instance of the same? We can initialize a SparkContext during application context initializaion phase. 
>> 
>> Since I am fairly new to using Spark properly, please help me take decision on whether the way I plan to use Spark is the recommended way? I have also seen use cases involving kafka tha does communication with Spark, but can we not do it directly using Spark Context? I am sure a lot of my understanding is wrong, so please feel free to correct me.
>> 
>> Thanks and Regards,
>> Girish Vasmatkar
>> HotWax Systems
>> 
>> 
>>

Re: Use SparkContext in Web Application

Posted by Girish Vasmatkar <gi...@hotwaxsystems.com>.

On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
girish.vasmatkar@hotwaxsystems.com> wrote:

> Hi All
>
> We are very early into our Spark days so the following may sound like a
> novice question :) I will try to keep this as short as possible.
>
> We are trying to use Spark to introduce a recommendation engine that can
> be used to provide product recommendations and need help on some design
> decisions before moving forward. Ours is a web application running on
> Tomcat. So far, I have created a simple POC (standalone java program) that
> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
> transformations. I would like to be able to do the following -
>
>
>    - Scheduler runs nightly in Tomcat (which it does currently) and reads
>    everything from the DB to train/fit the system. This can grow into really
>    some large data and everyday we will have new data. Should I just use
>    SparkContext here, within my scheduler, to FIT the system? Is this correct
>    way to go about this? I am also planning to save the model on S3 which
>    should be okay. We also thought on using HDFS. The scheduler's job will be
>    just to create model and save the same and be done with it.
>    - On the product page, we can then use the saved model to display the
>    product recommendations for a particular product.
>    - My understanding is that I should be able to use SparkContext here
>    in my web application to just load the saved model and use it to derive the
>    recommendations. Is this a good design? The problem I see using this
>    approach is that the SparkContext does take time to initialize and this may
>    cost dearly. Or should we keep SparkContext per web application to use a
>    single instance of the same? We can initialize a SparkContext during
>    application context initializaion phase.
>
>
> Since I am fairly new to using Spark properly, please help me take
> decision on whether the way I plan to use Spark is the recommended way? I
> have also seen use cases involving kafka tha does communication with Spark,
> but can we not do it directly using Spark Context? I am sure a lot of my
> understanding is wrong, so please feel free to correct me.
>
> Thanks and Regards,
> Girish Vasmatkar
> HotWax Systems
>
>
>
>