You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by yh18190 <yh...@gmail.com> on 2014/09/12 10:39:53 UTC

Unable to ship external Python libraries in PYSPARK

Hi all,

I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each and every
worker machine to install these libs in python path.I tried to use
Sparkcontext pyfiles option to ship .zip files..But the problem is  these
python packages needs to be get installed on worker machines.Could anyone
let me know wat are different ways of doing it so that this lib-Textblob
could be available in python path.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Unable to ship external Python libraries in PYSPARK

Posted by Davies Liu <da...@databricks.com>.
Yes, sc.addFile() is what you want:

 |  addFile(self, path)
 |      Add a file to be downloaded with this Spark job on every node.
 |      The C{path} passed can be either a local file, a file in HDFS
 |      (or other Hadoop-supported filesystems), or an HTTP, HTTPS or
 |      FTP URI.
 |
 |      To access the file in Spark jobs, use
 |      L{SparkFiles.get(fileName)<pyspark.files.SparkFiles.get>} with the
 |      filename to find its download location.
 |
 |      >>> from pyspark import SparkFiles
 |      >>> path = os.path.join(tempdir, "test.txt")
 |      >>> with open(path, "w") as testFile:
 |      ...    testFile.write("100")
 |      >>> sc.addFile(path)
 |      >>> def func(iterator):
 |      ...    with open(SparkFiles.get("test.txt")) as testFile:
 |      ...        fileVal = int(testFile.readline())
 |      ...        return [x * fileVal for x in iterator]
 |      >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 |      [100, 200, 300, 400]

On Tue, Sep 16, 2014 at 7:02 PM, daijia <ji...@intsig.com> wrote:
> Is there some way to ship textfile just like ship python libraries?
>
> Thanks in advance
> Daijia
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Unable to ship external Python libraries in PYSPARK

Posted by daijia <ji...@intsig.com>.
Is there some way to ship textfile just like ship python libraries?

Thanks in advance
Daijia



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Unable to ship external Python libraries in PYSPARK

Posted by yh18190 <yh...@gmail.com>.
Hi David,

Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Unable to ship external Python libraries in PYSPARK

Posted by Davies Liu <da...@databricks.com>.
By SparkContext.addPyFile("xx.zip"), the xx.zip will be copies to all
the workers
and stored in temporary directory, the path to xx.zip will be in the sys.path on
worker machines, so you can "import xx" in your jobs, it does not need to be
installed on worker machines.

PS: the package or module should be in the top level in xx.zip, or it cannot
be imported. such as :

daviesliu@dm:~/work/tmp$ zipinfo textblob.zip
Archive:  textblob.zip   3245946 bytes   517 files
drwxr-xr-x  3.0 unx        0 bx stor 12-Sep-14 10:10 textblob/
-rw-r--r--  3.0 unx      203 tx defN 12-Sep-14 10:10 textblob/__init__.py
-rw-r--r--  3.0 unx      563 bx defN 12-Sep-14 10:10 textblob/__init__.pyc
-rw-r--r--  3.0 unx    61510 tx defN 12-Sep-14 10:10 textblob/_text.py
-rw-r--r--  3.0 unx    68316 bx defN 12-Sep-14 10:10 textblob/_text.pyc
-rw-r--r--  3.0 unx     2962 tx defN 12-Sep-14 10:10 textblob/base.py
-rw-r--r--  3.0 unx     5501 bx defN 12-Sep-14 10:10 textblob/base.pyc
-rw-r--r--  3.0 unx    27621 tx defN 12-Sep-14 10:10 textblob/blob.py

you can get this textblob.zip by:

pip install textblob
cd /xxx/xx/site-package/
zip -r path_to_store/textblob.zip textblob

Davies


On Fri, Sep 12, 2014 at 1:39 AM, yh18190 <yh...@gmail.com> wrote:
> Hi all,
>
> I am currently working on pyspark for NLP processing etc.I am using TextBlob
> python library.Normally in a standalone mode it easy to install the external
> python libraries .In case of cluster mode I am facing problem to install
> these libraries on worker nodes remotely.I cannot access each and every
> worker machine to install these libs in python path.I tried to use
> Sparkcontext pyfiles option to ship .zip files..But the problem is  these
> python packages needs to be get installed on worker machines.Could anyone
> let me know wat are different ways of doing it so that this lib-Textblob
> could be available in python path.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org