You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2022/02/09 12:43:00 UTC

[jira] [Commented] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results

    [ https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489518#comment-17489518 ] 

zhengruifeng commented on SPARK-34160:
--------------------------------------

you can get a sparse vector by calling vector.{color:#ffc66d}compressed{color}

> pyspark.ml.stat.Summarizer should allow sparse vector results
> -------------------------------------------------------------
>
>                 Key: SPARK-34160
>                 URL: https://issues.apache.org/jira/browse/SPARK-34160
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 3.0.1
>            Reporter: Ophir Yoktan
>            Priority: Major
>
> currently pyspark.ml.stat.Summarizer will return a dense vector, even if the input is sparse.
> the Summarizer should either deduce the relevant type from the input, or add a parameter that forces sparse output
> code to reproduce:
> {{import pyspark}}
> {{from pyspark.sql.functions import col}}
> {{from pyspark.ml.stat import Summarizer}}
> {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = pyspark.SparkContext.getOrCreate()}}
> {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
> {{print(df.head())}}
> {{print(df.select(Summarizer.mean(col('v'))).head())}}
> ouput:
> {{Row(v=SparseVector(100, \{1: 1.0})) }}
> {{Row(mean(v)=DenseVector([0.0, 1.0,}}
> {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org