You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Asim Jalis <as...@gmail.com> on 2016/01/16 01:02:21 UTC

How To Save TF-IDF Model In PySpark

Hi,

I am trying to save a TF-IDF model in PySpark. Looks like this is not
supported.

Using `model.save()` causes:

AttributeError: 'IDFModel' object has no attribute 'save'

Using `pickle` causes:

TypeError: can't pickle lock objects

Does anyone have suggestions

Thanks!

Asim

Here is the full repro. Start pyspark shell and then run this code in
it.

```
# Imports
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import Vectors
from pyspark.mllib.feature import IDF

# Create some data
n = 4
freqs = [
    Vectors.sparse(n, (1, 3), (1.0, 2.0)),
    Vectors.dense([0.0, 1.0, 2.0, 3.0]),
    Vectors.sparse(n, [1], [1.0])]
data = sc.parallelize(freqs)
idf = IDF()
model = idf.fit(data)
tfidf = model.transform(data)

# View
for r in tfidf.collect(): print(r)

# Try to save it
model.save("foo.model")

# Try to save it with Pickle
import pickle
pickle.dump(model, open("model.p", "wb"))
pickle.dumps(model)
```

Re: How To Save TF-IDF Model In PySpark

Posted by Jerry Lam <ch...@gmail.com>.

Can you save it to parquet with the vector in one field?

Sent from my iPhone

> On 15 Jan, 2016, at 7:33 pm, Andy Davidson <An...@SantaCruzIntegration.com> wrote:
> 
> Are you using 1.6.0 or an older version?
> 
> I think I remember something in 1.5.1 saying save was not implemented in python.
> 
> 
> The current doc does not say anything about save()
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
> 
> http://spark.apache.org/docs/latest/ml-guide.html#saving-and-loading-pipelines
> "Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm’s API documentation to see if saving and loading is supported."
> 
> andy
> 
> 
> 
> 
> From: Asim Jalis <as...@gmail.com>
> Date: Friday, January 15, 2016 at 4:02 PM
> To: "user @spark" <us...@spark.apache.org>
> Subject: How To Save TF-IDF Model In PySpark
> 
> Hi,
> 
> I am trying to save a TF-IDF model in PySpark. Looks like this is not
> supported. 
> 
> Using `model.save()` causes:
> 
> AttributeError: 'IDFModel' object has no attribute 'save'
> 
> Using `pickle` causes:
> 
> TypeError: can't pickle lock objects
> 
> Does anyone have suggestions 
> 
> Thanks!
> 
> Asim
> 
> Here is the full repro. Start pyspark shell and then run this code in
> it.
> 
> ```
> # Imports
> from pyspark import SparkContext
> from pyspark.mllib.feature import HashingTF
> 
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.regression import Vectors
> from pyspark.mllib.feature import IDF
> 
> # Create some data
> n = 4
> freqs = [
>     Vectors.sparse(n, (1, 3), (1.0, 2.0)), 
>     Vectors.dense([0.0, 1.0, 2.0, 3.0]), 
>     Vectors.sparse(n, [1], [1.0])]
> data = sc.parallelize(freqs)
> idf = IDF()
> model = idf.fit(data)
> tfidf = model.transform(data)
> 
> # View
> for r in tfidf.collect(): print(r)
> 
> # Try to save it
> model.save("foo.model")
> 
> # Try to save it with Pickle
> import pickle
> pickle.dump(model, open("model.p", "wb"))
> pickle.dumps(model)
> ```

Re: How To Save TF-IDF Model In PySpark

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.

Are you using 1.6.0 or an older version?

I think I remember something in 1.5.1 saying save was not implemented in
python.


The current doc does not say anything about save()
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

http://spark.apache.org/docs/latest/ml-guide.html#saving-and-loading-pipelin
es
"Often times it is worth it to save a model or a pipeline to disk for later
use. In Spark 1.6, a model import/export functionality was added to the
Pipeline API. Most basic transformers are supported as well as some of the
more basic ML models. Please refer to the algorithm¹s API documentation to
see if saving and loading is supported."

andy




From:  Asim Jalis <as...@gmail.com>
Date:  Friday, January 15, 2016 at 4:02 PM
To:  "user @spark" <us...@spark.apache.org>
Subject:  How To Save TF-IDF Model In PySpark

> Hi,
> 
> I am trying to save a TF-IDF model in PySpark. Looks like this is not
> supported. 
> 
> Using `model.save()` causes:
> 
> AttributeError: 'IDFModel' object has no attribute 'save'
> 
> Using `pickle` causes:
> 
> TypeError: can't pickle lock objects
> 
> Does anyone have suggestions
> 
> Thanks!
> 
> Asim
> 
> Here is the full repro. Start pyspark shell and then run this code in
> it.
> 
> ```
> # Imports
> from pyspark import SparkContext
> from pyspark.mllib.feature import HashingTF
> 
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.regression import Vectors
> from pyspark.mllib.feature import IDF
> 
> # Create some data
> n = 4
> freqs = [
>     Vectors.sparse(n, (1, 3), (1.0, 2.0)),
>     Vectors.dense([0.0, 1.0, 2.0, 3.0]),
>     Vectors.sparse(n, [1], [1.0])]
> data = sc.parallelize(freqs)
> idf = IDF()
> model = idf.fit(data)
> tfidf = model.transform(data)
> 
> # View
> for r in tfidf.collect(): print(r)
> 
> # Try to save it
> model.save("foo.model")
> 
> # Try to save it with Pickle
> import pickle
> pickle.dump(model, open("model.p", "wb"))
> pickle.dumps(model)
> ```