You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/04/02 00:25:00 UTC
[jira] [Commented] (SPARK-27335) cannot collect() from
Correlation.corr
[ https://issues.apache.org/jira/browse/SPARK-27335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807312#comment-16807312 ]
Hyukjin Kwon commented on SPARK-27335:
--------------------------------------
Can you show how to reproduce from scratch? Seems I can't reproduce this against the current master, and Spark 2.4.
{code}
>>> import pyspark
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.stat import Correlation
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()
>>> dataset = [[Vectors.dense([1, 0, 0, -2])],
... [Vectors.dense([4, 5, 0, 3])],
... [Vectors.dense([6, 7, 0, 8])],
... [Vectors.dense([9, 0, 0, 1])]]
>>> dataset = spark.createDataFrame(dataset, ['features'])
df = Correlation.corr(dataset, 'features', 'pearson')
df.collect()
>>> df = Correlation.corr(dataset, 'features', 'pearson')
19/04/02 09:22:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/04/02 09:22:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19/04/02 09:22:27 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
>>> df.collect()
[Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))]
{code}
> cannot collect() from Correlation.corr
> --------------------------------------
>
> Key: SPARK-27335
> URL: https://issues.apache.org/jira/browse/SPARK-27335
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.0
> Reporter: Natalino Busa
> Priority: Major
>
> reproducing the bug from the example in the documentation:
>
>
> {code:java}
> import pyspark
> from pyspark.ml.linalg import Vectors
> from pyspark.ml.stat import Correlation
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> dataset = [[Vectors.dense([1, 0, 0, -2])],
> [Vectors.dense([4, 5, 0, 3])],
> [Vectors.dense([6, 7, 0, 8])],
> [Vectors.dense([9, 0, 0, 1])]]
> dataset = spark.createDataFrame(dataset, ['features'])
> df = Correlation.corr(dataset, 'features', 'pearson')
> df.collect()
>
> {code}
> This produces the following stack trace:
>
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-92-e7889fa5d198> in <module>()
> 11 dataset = spark.createDataFrame(dataset, ['features'])
> 12 df = Correlation.corr(dataset, 'features', 'pearson')
> ---> 13 df.collect()
> /opt/spark/python/pyspark/sql/dataframe.py in collect(self)
> 530 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 531 """
> --> 532 with SCCallSiteSync(self._sc) as css:
> 533 sock_info = self._jdf.collectToPython()
> 534 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
> /opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
> 70 def __enter__(self):
> 71 if SCCallSiteSync._spark_stack_depth == 0:
> ---> 72 self._context._jsc.setCallSite(self._call_site)
> 73 SCCallSiteSync._spark_stack_depth += 1
> 74
> AttributeError: 'NoneType' object has no attribute 'setCallSite'{code}
>
>
> Analysis:
> Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, and `spark._jsparkSession` do not match with the ones available in the spark session.
> The following code fixes the problem (I hope this helps you narrowing down the root cause)
>
> {code:java}
> df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
> df._sc = spark._sc
> df.collect()
> >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))]{code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org