You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Patrick McCarthy <pm...@dstillery.com.INVALID> on 2021/05/28 13:51:04 UTC
Profiling options for PandasUDF (2.4.7 on yarn)
I'm trying to do a very large aggregation of sparse matrices in which my
source data looks like
root
|-- device_id: string (nullable = true)
|-- row_id: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- column_id: array (nullable = true)
| |-- element: integer (containsNull = true)
I assume each row to reflect a sparse matrix where each combination of
(row_id, column_id) has value of 1. I have a PandasUDF which performs a
GROUPED_MAP that transforms every row into a scipy.sparse.csr_matrix and,
within the group, sums the matrices before returning columns of (count,
row_id, column_id).
It works at small scale but gets unstable as I scale up. Is there a way to
profile this function in a spark session or am I limited to profiling on
pandas data frames without spark?
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016