You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by apu <ap...@gmail.com> on 2017/02/15 00:11:32 UTC

PySpark: use one column to index another (udf of two columns?)

Let's say I have a Spark (PySpark) dataframe with the following schema:

root
 |-- myarray: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- myindices: array (nullable = true)
 |    |-- element: integer (containsNull = true)

It looks like:

+--------------------+----------+
|          myarray   | myindices|
+--------------------+----------+
|                 [A]|    [0]   |
|              [B, C]|    [1]   |
|        [D, E, F, G]|   [0,2]  |
+--------------------+----------+

How can I use the second array to index the first?

My goal is to create a new dataframe which would look like:

+--------------------+----------+------+
|          myarray   | myindices|result|
+--------------------+----------+------+
|                 [A]|    [0]   |  [A] |
|              [B, C]|    [1]   |  [C] |
|        [D, E, F, G]|   [0,2]  | [D,F]|
+--------------------+----------+------+

(It is safe to assume that the contents of myindices are always guaranteed
to be within the cardinality of myarray for the row in question, so there
are no out-of-bounds problems.)

It appears that the .getItem() method only works with single arguments, so
I might need a UDF here, but I know of no way to create a UDF that has more
than one column as input. Any solutions, with or without UDFs?