You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nick Pentreath <ni...@gmail.com> on 2016/09/08 07:12:20 UTC
Re: How to convert an ArrayType to DenseVector within DataFrame?

You can use a udf like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:43:17)
SparkSession available as 'spark'.

In [1]: from pyspark.ml.regression import LinearRegression

In [2]: from pyspark.sql.functions import udf

In [3]: from pyspark.ml.linalg import Vectors, VectorUDT

In [4]: df = spark.createDataFrame([(2.3, [1.0, 2.0, 3.0]), (6.5, [2.0,
5.0, 1.0]), (4.3, [7.0, 4.0, 2.0])], ["label", "array"])

In [5]: df.printSchema()
root
 |-- label: double (nullable = true)
 |-- array: array (nullable = true)
 |    |-- element: double (containsNull = true)


In [6]: to_vector = udf(lambda a: Vectors.dense(a), VectorUDT())

In [7]: data = df.select("label", to_vector("array").alias("features"))

In [8]: data.printSchema()
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)


In [9]: lr = LinearRegression()

In [10]: lr.fit(data).transform(data).show()

+-----+-------------+------------------+
|label|     features|        prediction|
+-----+-------------+------------------+
|  2.3|[1.0,2.0,3.0]|2.3000000000000003|
|  6.5|[2.0,5.0,1.0]|6.5000000000000036|
|  4.3|[7.0,4.0,2.0]| 4.299999999999997|
+-----+-------------+------------------+


On Tue, 30 Aug 2016 at 20:45 evanzamir <za...@gmail.com> wrote:

> I have a DataFrame with a column containing a list of numeric features to
> be
> used for a regression. When I run the regression, I get the following
> error:
>
> *pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column
> features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but
> was actually ArrayType(DoubleType,true).'
> *
> It would be nice if Spark could automatically convert the type, but
> assuming
> that isn't possible, what's the easiest way for me to do the conversion?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-an-ArrayType-to-DenseVector-within-DataFrame-tp27625.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>