You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ll <du...@gmail.com> on 2014/11/07 08:26:34 UTC

word2vec: how to save an mllib model and reload it?

what is the best way to save an mllib model that you just trained and reload
it in the future?  specifically, i'm using the mllib word2vec model...
thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

Posted by Simon Chan <si...@gmail.com>.

Just want to elaborate more on Duy's suggestion on using PredictionIO.

PredictionIO will store the model automatically if you return it in the
training function.
An example using CF:

 def train(data: PreparedData): PersistentMatrixFactorizationModel = {
    val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda)
    new PersistentMatrixFactorizationModel(
      rank = m.rank,
      userFeatures = m.userFeatures,
      productFeatures = m.productFeatures)
  }


And the persisted model will be passed to the predict function when you
query for prediction:

def predict(
    model: PersistentMatrixFactorizationModel,
    query: Query): PredictedResult = {
    val productScores = model.recommendProducts(query.user, query.num)
      .map (r => ProductScore(r.product, r.rating))
    new PredictedResult(productScores)}



Some templates and tutorials for MLlib are here:
http://docs.prediction.io/0.8.1/templates/

Simon


On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> Sure - in theory this sounds great. But in practice it's much faster and a
> whole lot simpler to just serve the model from single instance in memory.
> Optionally you can multithread within that (as Oryx 1 does).
>
> There are very few real world use cases where the model is so large that
> it HAS to be distributed.
>
> Having said this, it's certainly possible to distribute model serving for
> factor-like models (like ALS). One idea I'm working on now is using
> Elasticsearch for exactly this purpose - but that more because I'm using it
> for filtering of recommendation results and combining with search, so
> overall it's faster to do it this way.
>
> For the pure matrix algebra part, single instance in memory is way faster.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh <du...@gmail.com> wrote:
>
>> hi nick.. sorry about the confusion.  originally i had a question
>> specifically about word2vec, but my follow up question on distributed model
>> is a more general question about saving different types of models.
>>
>> on distributed model, i was hoping to implement a model parallelism, so
>> that different workers can work on different parts of the models, and then
>> merge the results at the end at the single master model.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Currently I see the word2vec model is collected onto the master, so the
>>> model itself is not distributed.
>>>
>>> I guess the question is why do you need  a distributed model? Is the
>>> vocab size so large that it's necessary? For model serving in general,
>>> unless the model is truly massive (ie cannot fit into memory on a modern
>>> high end box with 64, or 128GB ram) then single instance is way faster and
>>> simpler (using a cluster of machines is more for load balancing / fault
>>> tolerance).
>>>
>>> What is your use case for model serving?
>>>
>>> —
>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <du...@gmail.com>
>>> wrote:
>>>
>>>> you're right, serialization works.
>>>>
>>>> what is your suggestion on saving a "distributed" model?  so part of
>>>> the model is in one cluster, and some other parts of the model are in other
>>>> clusters.  during runtime, these sub-models run independently in their own
>>>> clusters (load, train, save).  and at some point during run time these
>>>> sub-models merge into the master model, which also loads, trains, and saves
>>>> at the master level.
>>>>
>>>> much appreciated.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> There's some work going on to support PMML -
>>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>>> been merged into master.
>>>>>
>>>>> What are you used to doing in other environments? In R I'm used to
>>>>> running save(), same with matlab. In python either pickling things or
>>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>>> pickling -
>>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>>> all seem basically equivalent java serialization to me..
>>>>>
>>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>>> something) make sense to add?
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> that works.  is there a better way in spark?  this seems like the
>>>>>> most common feature for any machine learning work - to be able to save your
>>>>>> model after training it and load it later.
>>>>>>
>>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.sparks@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Plain old java serialization is one straightforward approach if
>>>>>>> you're in java/scala.
>>>>>>>
>>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>>>
>>>>>>>> what is the best way to save an mllib model that you just trained
>>>>>>>> and reload
>>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>>> model...
>>>>>>>> thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by Nick Pentreath <ni...@gmail.com>.

Sure - in theory this sounds great. But in practice it's much faster and a whole lot simpler to just serve the model from single instance in memory. Optionally you can multithread within that (as Oryx 1 does).


There are very few real world use cases where the model is so large that it HAS to be distributed.




Having said this, it's certainly possible to distribute model serving for factor-like models (like ALS). One idea I'm working on now is using Elasticsearch for exactly this purpose - but that more because I'm using it for filtering of recommendation results and combining with search, so overall it's faster to do it this way.




For the pure matrix algebra part, single instance in memory is way faster. 


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh <du...@gmail.com> wrote:

> hi nick.. sorry about the confusion.  originally i had a question
> specifically about word2vec, but my follow up question on distributed model
> is a more general question about saving different types of models.
> on distributed model, i was hoping to implement a model parallelism, so
> that different workers can work on different parts of the models, and then
> merge the results at the end at the single master model.
> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>> Currently I see the word2vec model is collected onto the master, so the
>> model itself is not distributed.
>>
>> I guess the question is why do you need  a distributed model? Is the vocab
>> size so large that it's necessary? For model serving in general, unless the
>> model is truly massive (ie cannot fit into memory on a modern high end box
>> with 64, or 128GB ram) then single instance is way faster and simpler
>> (using a cluster of machines is more for load balancing / fault tolerance).
>>
>> What is your use case for model serving?
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <du...@gmail.com> wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>> been merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running save(), same with matlab. In python either pickling things or
>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>>> These all seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if you're
>>>>>> in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>>> reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: word2vec: how to save an mllib model and reload it?

Posted by Duy Huynh <du...@gmail.com>.

hi nick.. sorry about the confusion.  originally i had a question
specifically about word2vec, but my follow up question on distributed model
is a more general question about saving different types of models.

on distributed model, i was hoping to implement a model parallelism, so
that different workers can work on different parts of the models, and then
merge the results at the end at the single master model.



On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed.
>
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless the
> model is truly massive (ie cannot fit into memory on a modern high end box
> with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault tolerance).
>
> What is your use case for model serving?
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <du...@gmail.com> wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by Xiangrui Meng <me...@gmail.com>.

We are working on import/export for MLlib models. The umbrella JIRA is
https://issues.apache.org/jira/browse/SPARK-4587. In 1.3, we are going
to have save/load for linear models, naive Bayes, ALS, and tree
models. I created a JIRA for Word2Vec and set the target version to
1.4. If anyone is interested in working on it, please ping me on the
JIRA. -Xiangrui

On Thu, Feb 5, 2015 at 9:11 AM, Carsten Schnober
<sc...@ukp.informatik.tu-darmstadt.de> wrote:
> As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
> our Hadoop cluster and here's my issue with classic Java serialization of
> the model: I don't have SSH access to the cluster master node.
> Here's my code for computing the model:
>
>     val input = sc.textFile("README.md").map(line => line.split(" ").toSeq)
>     val word2vec = new Word2Vec();
>     val model = word2vec.fit(input);
>     val oos = new ObjectOutputStream(new FileOutputStream(modelFile));
>     oos.writeObject(model);
>     oos.close();
>
> I can do that locally and get the file as desired. But that is of little use
> for me if the file is stored on the master.
>
> I've alternatively serialized the vectors to HDFS using this code:
>
>     val vectors = model.getVectors;
>     val output = sc.parallelize(vectors.toSeq);
>     output.saveAsObjectFile(modelFile);
>
> Indeed, this results in a serialization on HDFS so I can access it as a
> user. However, I have not figured out how to create a new Word2VecModel
> object from those files.
>
> Any clues?
> Thanks!
> Carsten
>
>
>
> MLnick wrote
>> Currently I see the word2vec model is collected onto the master, so the
>> model itself is not distributed.
>>
>>
>> I guess the question is why do you need  a distributed model? Is the vocab
>> size so large that it's necessary? For model serving in general, unless
>> the model is truly massive (ie cannot fit into memory on a modern high end
>> box with 64, or 128GB ram) then single instance is way faster and simpler
>> (using a cluster of machines is more for load balancing / fault
>> tolerance).
>>
>>
>>
>>
>> What is your use case for model serving?
>>
>>
>> —
>> Sent from Mailbox
>>
>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh &lt;
>
>> duy.huynh.uiv@
>
>> &gt; wrote:
>>
>>> you're right, serialization works.
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their
>>> own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and
>>> saves
>>> at the master level.
>>> much appreciated.
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks &lt;
>
>> evan.sparks@
>
>> &gt;
>>> wrote:
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>>> merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running
>>>> save(), same with matlab. In python either pickling things or dumping to
>>>> json seems pretty common. (even the scikit-learn docs recommend pickling
>>>> -
>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>> all
>>>> seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh &lt;
>
>> duy.huynh.uiv@
>
>> &gt;
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks &lt;
>
>> evan.sparks@
>
>> &gt;
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if you're
>>>>>> in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll &lt;
>
>> duy.huynh.uiv@
>
>> &gt; wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>>> reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>> model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>
>> user-unsubscribe@.apache
>
>>>>>>> For additional commands, e-mail:
>
>> user-help@.apache
>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

Posted by Carsten Schnober <sc...@ukp.informatik.tu-darmstadt.de>.

As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
our Hadoop cluster and here's my issue with classic Java serialization of
the model: I don't have SSH access to the cluster master node.  
Here's my code for computing the model:

    val input = sc.textFile("README.md").map(line => line.split(" ").toSeq)
    val word2vec = new Word2Vec();
    val model = word2vec.fit(input);
    val oos = new ObjectOutputStream(new FileOutputStream(modelFile));
    oos.writeObject(model);
    oos.close();

I can do that locally and get the file as desired. But that is of little use
for me if the file is stored on the master.

I've alternatively serialized the vectors to HDFS using this code:

    val vectors = model.getVectors;   
    val output = sc.parallelize(vectors.toSeq);
    output.saveAsObjectFile(modelFile);

Indeed, this results in a serialization on HDFS so I can access it as a
user. However, I have not figured out how to create a new Word2VecModel
object from those files.

Any clues?
Thanks!
Carsten



MLnick wrote
> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed. 
> 
> 
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless
> the model is truly massive (ie cannot fit into memory on a modern high end
> box with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault
> tolerance).
> 
> 
> 
> 
> What is your use case for model serving?
> 
> 
> —
> Sent from Mailbox
> 
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh &lt;

> duy.huynh.uiv@

> &gt; wrote:
> 
>> you're right, serialization works.
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their
>> own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and
>> saves
>> at the master level.
>> much appreciated.
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks &lt;

> evan.sparks@

> &gt;
>> wrote:
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>> merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running
>>> save(), same with matlab. In python either pickling things or dumping to
>>> json seems pretty common. (even the scikit-learn docs recommend pickling
>>> -
>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>> all
>>> seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh &lt;

> duy.huynh.uiv@

> &gt;
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks &lt;

> evan.sparks@

> &gt;
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll &lt;

> duy.huynh.uiv@

> &gt; wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>> model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>>>>>> For additional commands, e-mail: 

> user-help@.apache

>>>>>>
>>>>>>
>>>>>
>>>>
>>>





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

Posted by Nick Pentreath <ni...@gmail.com>.

Currently I see the word2vec model is collected onto the master, so the model itself is not distributed. 


I guess the question is why do you need  a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot fit into memory on a modern high end box with 64, or 128GB ram) then single instance is way faster and simpler (using a cluster of machines is more for load balancing / fault tolerance).




What is your use case for model serving?


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <du...@gmail.com> wrote:

> you're right, serialization works.
> what is your suggestion on saving a "distributed" model?  so part of the
> model is in one cluster, and some other parts of the model are in other
> clusters.  during runtime, these sub-models run independently in their own
> clusters (load, train, save).  and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
> much appreciated.
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
> wrote:
>> There's some work going on to support PMML -
>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>> merged into master.
>>
>> What are you used to doing in other environments? In R I'm used to running
>> save(), same with matlab. In python either pickling things or dumping to
>> json seems pretty common. (even the scikit-learn docs recommend pickling -
>> http://scikit-learn.org/stable/modules/model_persistence.html). These all
>> seem basically equivalent java serialization to me..
>>
>> Would some helper functions (in, say, mllib.util.modelpersistence or
>> something) make sense to add?
>>
>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>> wrote:
>>
>>> that works.  is there a better way in spark?  this seems like the most
>>> common feature for any machine learning work - to be able to save your
>>> model after training it and load it later.
>>>
>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>
>>>> Plain old java serialization is one straightforward approach if you're
>>>> in java/scala.
>>>>
>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>
>>>>> what is the best way to save an mllib model that you just trained and
>>>>> reload
>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>> thanks.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>

Re: word2vec: how to save an mllib model and reload it?

Posted by Duy Huynh <du...@gmail.com>.

thansk nick.  i'll take a look at oryx and prediction.io.

re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code.  but i'm running into some compiliation
issue now.  hopefully i can fix it soon, so to get this things going.

On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> For ALS if you want real time recs (and usually this is order 10s to a few
> 100s ms response), then Spark is not the way to go - a serving layer like
> Oryx, or prediction.io is what you want.
>
> (At graphflow we've built our own).
>
> You hold the factor matrices in memory and do the dot product in real time
> (with optional caching). Again, even for huge models (10s of millions
> users/items) this can be handled on a single, powerful instance. The issue
> at this scale is winnowing down the search space using LSH or similar
> approach to get to real time speeds.
>
> For word2vec it's pretty much the same thing as what you have is very
> similar to one of the ALS factor matrices.
>
> One problem is you can't access the wors2vec vectors as they are private
> val. I think this should be changed actually, so that just the word vectors
> could be saved and used in a serving layer.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks <ev...@gmail.com>
> wrote:
>
>> There are a few examples where this is the case. Let's take ALS, where
>> the result is a MatrixFactorizationModel, which is assumed to be big - the
>> model consists of two matrices, one (users x k) and one (k x products).
>> These are represented as RDDs.
>>
>> You can save these RDDs out to disk by doing something like
>>
>> model.userFeatures.saveAsObjectFile(...) and
>> model.productFeatures.saveAsObjectFile(...)
>>
>> to save out to HDFS or Tachyon or S3.
>>
>> Then, when you want to reload you'd have to instantiate them into a class
>> of MatrixFactorizationModel. That class is package private to MLlib right
>> now, so you'd need to copy the logic over to a new class, but that's the
>> basic idea.
>>
>> That said - using spark to serve these recommendations on a
>> point-by-point basis might not be optimal. There's some work going on in
>> the AMPLab to address this issue.
>>
>> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <du...@gmail.com>
>> wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>> been merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running save(), same with matlab. In python either pickling things or
>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>> pickling -
>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>> all seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if
>>>>>> you're in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained
>>>>>>> and reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>> model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by Nick Pentreath <ni...@gmail.com>.

For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want.


(At graphflow we've built our own).




You hold the factor matrices in memory and do the dot product in real time (with optional caching). Again, even for huge models (10s of millions users/items) this can be handled on a single, powerful instance. The issue at this scale is winnowing down the search space using LSH or similar approach to get to real time speeds.




For word2vec it's pretty much the same thing as what you have is very similar to one of the ALS factor matrices.




One problem is you can't access the wors2vec vectors as they are private val. I think this should be changed actually, so that just the word vectors could be saved and used in a serving layer.


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks <ev...@gmail.com>
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
> You can save these RDDs out to disk by doing something like
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
> to save out to HDFS or Tachyon or S3.
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <du...@gmail.com> wrote:
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>> merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: word2vec: how to save an mllib model and reload it?

Posted by Duy Huynh <du...@gmail.com>.

yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.

my question is for the use case that they are not represented as RDDs yet.

then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and loading them distributedly?

On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks <ev...@gmail.com>
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
>
> You can save these RDDs out to disk by doing something like
>
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
>
> to save out to HDFS or Tachyon or S3.
>
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
>
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
>
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <du...@gmail.com> wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.

You can save these RDDs out to disk by doing something like

model.userFeatures.saveAsObjectFile(...) and
model.productFeatures.saveAsObjectFile(...)

to save out to HDFS or Tachyon or S3.

Then, when you want to reload you'd have to instantiate them into a class
of MatrixFactorizationModel. That class is package private to MLlib right
now, so you'd need to copy the logic over to a new class, but that's the
basic idea.

That said - using spark to serve these recommendations on a point-by-point
basis might not be optimal. There's some work going on in the AMPLab to
address this issue.

On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <du...@gmail.com> wrote:

> you're right, serialization works.
>
> what is your suggestion on saving a "distributed" model?  so part of the
> model is in one cluster, and some other parts of the model are in other
> clusters.  during runtime, these sub-models run independently in their own
> clusters (load, train, save).  and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
>
> much appreciated.
>
>
>
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
> wrote:
>
>> There's some work going on to support PMML -
>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>> merged into master.
>>
>> What are you used to doing in other environments? In R I'm used to
>> running save(), same with matlab. In python either pickling things or
>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>> These all seem basically equivalent java serialization to me..
>>
>> Would some helper functions (in, say, mllib.util.modelpersistence or
>> something) make sense to add?
>>
>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
>> wrote:
>>
>>> that works.  is there a better way in spark?  this seems like the most
>>> common feature for any machine learning work - to be able to save your
>>> model after training it and load it later.
>>>
>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>
>>>> Plain old java serialization is one straightforward approach if you're
>>>> in java/scala.
>>>>
>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>>
>>>>> what is the best way to save an mllib model that you just trained and
>>>>> reload
>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>> thanks.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by Duy Huynh <du...@gmail.com>.

you're right, serialization works.

what is your suggestion on saving a "distributed" model?  so part of the
model is in one cluster, and some other parts of the model are in other
clusters.  during runtime, these sub-models run independently in their own
clusters (load, train, save).  and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.



On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <ev...@gmail.com>
wrote:

> There's some work going on to support PMML -
> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
> merged into master.
>
> What are you used to doing in other environments? In R I'm used to running
> save(), same with matlab. In python either pickling things or dumping to
> json seems pretty common. (even the scikit-learn docs recommend pickling -
> http://scikit-learn.org/stable/modules/model_persistence.html). These all
> seem basically equivalent java serialization to me..
>
> Would some helper functions (in, say, mllib.util.modelpersistence or
> something) make sense to add?
>
> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com>
> wrote:
>
>> that works.  is there a better way in spark?  this seems like the most
>> common feature for any machine learning work - to be able to save your
>> model after training it and load it later.
>>
>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>
>>> Plain old java serialization is one straightforward approach if you're
>>> in java/scala.
>>>
>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>>
>>>> what is the best way to save an mllib model that you just trained and
>>>> reload
>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>> thanks.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.

What are you used to doing in other environments? In R I'm used to running
save(), same with matlab. In python either pickling things or dumping to
json seems pretty common. (even the scikit-learn docs recommend pickling -
http://scikit-learn.org/stable/modules/model_persistence.html). These all
seem basically equivalent java serialization to me..

Would some helper functions (in, say, mllib.util.modelpersistence or
something) make sense to add?

On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <du...@gmail.com> wrote:

> that works.  is there a better way in spark?  this seems like the most
> common feature for any machine learning work - to be able to save your
> model after training it and load it later.
>
> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
> wrote:
>
>> Plain old java serialization is one straightforward approach if you're in
>> java/scala.
>>
>> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>>
>>> what is the best way to save an mllib model that you just trained and
>>> reload
>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by Duy Huynh <du...@gmail.com>.

that works.  is there a better way in spark?  this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.

On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <ev...@gmail.com>
wrote:

> Plain old java serialization is one straightforward approach if you're in
> java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Plain old java serialization is one straightforward approach if you're in
java/scala.

On Thu, Nov 6, 2014 at 11:26 PM, ll <du...@gmail.com> wrote:

> what is the best way to save an mllib model that you just trained and
> reload
> it in the future?  specifically, i'm using the mllib word2vec model...
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: word2vec: how to save an mllib model and reload it?

Posted by sharad82 <kh...@gmail.com>.

I am having problem in serializing a ML word2vec model. 

Am I doing something wrong ?


http://stackoverflow.com/questions/37723308/spark-ml-word2vec-serialization-issues
<http://stackoverflow.com/questions/37723308/spark-ml-word2vec-serialization-issues>  




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p27137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org