You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by apu <ap...@gmail.com> on 2017/02/15 00:11:32 UTC
PySpark: use one column to index another (udf of two columns?)
Let's say I have a Spark (PySpark) dataframe with the following schema:
root
|-- myarray: array (nullable = true)
| |-- element: string (containsNull = true)
|-- myindices: array (nullable = true)
| |-- element: integer (containsNull = true)
It looks like:
+--------------------+----------+
| myarray | myindices|
+--------------------+----------+
| [A]| [0] |
| [B, C]| [1] |
| [D, E, F, G]| [0,2] |
+--------------------+----------+
How can I use the second array to index the first?
My goal is to create a new dataframe which would look like:
+--------------------+----------+------+
| myarray | myindices|result|
+--------------------+----------+------+
| [A]| [0] | [A] |
| [B, C]| [1] | [C] |
| [D, E, F, G]| [0,2] | [D,F]|
+--------------------+----------+------+
(It is safe to assume that the contents of myindices are always guaranteed
to be within the cardinality of myarray for the row in question, so there
are no out-of-bounds problems.)
It appears that the .getItem() method only works with single arguments, so
I might need a UDF here, but I know of no way to create a UDF that has more
than one column as input. Any solutions, with or without UDFs?