You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Samantha Zeitlin (Jira)" <ji...@apache.org> on 2021/04/09 17:56:00 UTC

[jira] [Commented] (SPARK-27335) cannot collect() from Correlation.corr

    [ https://issues.apache.org/jira/browse/SPARK-27335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318171#comment-17318171 ] 

Samantha Zeitlin commented on SPARK-27335:
------------------------------------------

 I'm seeing this on spark 3.0.1 and I'm not using Correlation.corr at all. Here's the full traceback:

```` 

Traceback (most recent call last):

  File "/Users/szeitlin/Radically_Different_Data_Science/Tribe/Bison/code/dbx-notebooks/gopher_changes/test_gopher_changes.py", line 37, in test_get_batch_meters

    row = output.head()

  File "/Users/szeitlin/anaconda3/envs/dbconnect/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 1369, in head

    rs = self.head(1)

  File "/Users/szeitlin/anaconda3/envs/dbconnect/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 1371, in head

    return self.take(n)

  File "/Users/szeitlin/anaconda3/envs/dbconnect/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 657, in take

    return self.limit(num).collect()

  File "/Users/szeitlin/anaconda3/envs/dbconnect/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 610, in collect

    with SCCallSiteSync(self._sc) as css:

  File "/Users/szeitlin/anaconda3/envs/dbconnect/lib/python3.7/site-packages/pyspark/traceback_utils.py", line 72, in __enter__

    self._context._jsc.setCallSite(self._call_site)

AttributeError: 'NoneType' object has no attribute 'setCallSite'
```

I think this may be related to whether there are other spark contexts available at the time, since I've seen it only when I had a notebook running while also trying to run tests. It sure would be nice if spark were a little smarter about knowing (or asking?) which spark context to use, or shutting down extras, if there is more than one available. 

> cannot collect() from Correlation.corr
> --------------------------------------
>
>                 Key: SPARK-27335
>                 URL: https://issues.apache.org/jira/browse/SPARK-27335
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Natalino Busa
>            Priority: Major
>
> reproducing the bug from the example in the documentation:
>  
>  
> {code:java}
> import pyspark
> from pyspark.ml.linalg import Vectors
> from pyspark.ml.stat import Correlation
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> dataset = [[Vectors.dense([1, 0, 0, -2])],
>  [Vectors.dense([4, 5, 0, 3])],
>  [Vectors.dense([6, 7, 0, 8])],
>  [Vectors.dense([9, 0, 0, 1])]]
> dataset = spark.createDataFrame(dataset, ['features'])
> df = Correlation.corr(dataset, 'features', 'pearson')
> df.collect()
>  
> {code}
> This produces the following stack trace:
>  
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError                            Traceback (most recent call last)
> <ipython-input-92-e7889fa5d198> in <module>()
>      11 dataset = spark.createDataFrame(dataset, ['features'])
>      12 df = Correlation.corr(dataset, 'features', 'pearson')
> ---> 13 df.collect()
> /opt/spark/python/pyspark/sql/dataframe.py in collect(self)
>     530         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>     531         """
> --> 532         with SCCallSiteSync(self._sc) as css:
>     533             sock_info = self._jdf.collectToPython()
>     534         return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
> /opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
>      70     def __enter__(self):
>      71         if SCCallSiteSync._spark_stack_depth == 0:
> ---> 72             self._context._jsc.setCallSite(self._call_site)
>      73         SCCallSiteSync._spark_stack_depth += 1
>      74 
> AttributeError: 'NoneType' object has no attribute 'setCallSite'{code}
>  
>  
> Analysis:
> Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, and `spark._jsparkSession` do not match with the ones available in the spark session.
> The following code fixes the problem (I hope this helps you narrowing down the root cause)
>  
> {code:java}
> df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
> df._sc = spark._sc
> df.collect()
> >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))]{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org