You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Perry Chu (JIRA)" <ji...@apache.org> on 2018/06/27 00:30:00 UTC

[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

     [ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Perry Chu updated SPARK-24447:
------------------------------
    Priority: Minor  (was: Major)

> Pyspark RowMatrix.columnSimilarities() loses spark context
> ----------------------------------------------------------
>
>                 Key: SPARK-24447
>                 URL: https://issues.apache.org/jira/browse/SPARK-24447
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Perry Chu
>            Priority: Minor
>
> The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to be losing track of the spark context. 
> I'm pretty new to spark - not sure if the problem is on the python side or the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline>
> print(sims.entries.context) #<SparkContext master=yarn appName = PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-47-50f83a6cf449> in <module>()
> ----> 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org