You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stefan Panayotov (JIRA)" <ji...@apache.org> on 2016/06/16 15:13:05 UTC

[jira] [Commented] (SPARK-13610) Create a Transformer to disassemble vectors in DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-13610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333968#comment-15333968 ] 

Stefan Panayotov commented on SPARK-13610:
------------------------------------------

We need a more general version of Transformer to disassemble vectors. The need arises from the fact that ML models (e.g. Decision Tree) can't accept labels that are vectors. Currently they require a column of Double. So, whenever we need to predict multidimensional Labels we  need to decompose a vector column (e.g. vector of size J) to a number (J) of columns of Double and then apply a Decision Tree pipeline to all of these columns. We have code doing the workaround for our specific case. Here is an example:
val rddC = featPlus_pastH.map(x => {val CArray = x(24).asInstanceOf[org.apache.spark.mllib.linalg.Vector].toArray
                                    val rowArray: ArrayBuffer[Any] = ArrayBuffer()
                                    rowArray += x(0).toString
                                    rowArray += x(1).toString
                                    rowArray += x(7).asInstanceOf[Timestamp]
                                    rowArray += x(2).toString
                                    // all columns

                                    for(coef <- CArray){
                                      rowArray += coef.toDouble
                                    }

                                    Row.fromSeq(rowArray.toSeq)
                                    }
                             )

val row = rddC.take(1)(0)
val rowSeqStr = row.toString
val schemaString = rowSeqStr.slice(rowSeqStr.indexOf("[") + 1, rowSeqStr.indexOf("]"))  //.drop(1).dropRight(1)
val schemaArray = schemaString.split(",")

var i = 0
var struct_fields = schemaArray.map(fieldName => {
  val struct_field = i match {
    case 0 => StructField("norm_net_code_c", StringType, true)
    case 1 => StructField("audience_c", StringType, true)
    case 2 => StructField("StartTimestampDay_c", TimestampType, true)
    case 3 => StructField("rating_category_cd_c", StringType, true)
    case _ => StructField("Coef" + (i - 3), DoubleType, true)
  }
i += 1
struct_field
})

val schema = StructType(struct_fields)

// incorp. partitioning rules
val dfC = sqlContext.createDataFrame(rddC, schema)



> Create a Transformer to disassemble vectors in DataFrames
> ---------------------------------------------------------
>
>                 Key: SPARK-13610
>                 URL: https://issues.apache.org/jira/browse/SPARK-13610
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, SQL
>    Affects Versions: 1.6.0
>            Reporter: Andrew MacKinlay
>            Priority: Minor
>
> It is possible to convert a standalone numeric field into a single-item Vector, using VectorAssembler. However the inverse operation of retrieving a single item from a vector and translating it into a field doesn't appear to be possible. The workaround I've found is to leave the raw field value in the DF, but I have found no other ways to get a field out of a vector (eg to perform arithmetic on it). Happy to be proved wrong though. Creating a user-defined function doesn't work (in Python at least; it gets a pickleexception).This seems like a simple operation which should be supported for various use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org