You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2022/02/09 12:43:00 UTC
[jira] [Commented] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results
[ https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489518#comment-17489518 ]
zhengruifeng commented on SPARK-34160:
--------------------------------------
you can get a sparse vector by calling vector.{color:#ffc66d}compressed{color}
> pyspark.ml.stat.Summarizer should allow sparse vector results
> -------------------------------------------------------------
>
> Key: SPARK-34160
> URL: https://issues.apache.org/jira/browse/SPARK-34160
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Affects Versions: 3.0.1
> Reporter: Ophir Yoktan
> Priority: Major
>
> currently pyspark.ml.stat.Summarizer will return a dense vector, even if the input is sparse.
> the Summarizer should either deduce the relevant type from the input, or add a parameter that forces sparse output
> code to reproduce:
> {{import pyspark}}
> {{from pyspark.sql.functions import col}}
> {{from pyspark.ml.stat import Summarizer}}
> {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = pyspark.SparkContext.getOrCreate()}}
> {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
> {{print(df.head())}}
> {{print(df.select(Summarizer.mean(col('v'))).head())}}
> ouput:
> {{Row(v=SparseVector(100, \{1: 1.0})) }}
> {{Row(mean(v)=DenseVector([0.0, 1.0,}}
> {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org