You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Patrick McCarthy <pm...@dstillery.com.INVALID> on 2021/05/28 13:51:04 UTC

Profiling options for PandasUDF (2.4.7 on yarn)

I'm trying to do a very large aggregation of sparse matrices in which my
source data looks like

root
 |-- device_id: string (nullable = true)
 |-- row_id: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- column_id: array (nullable = true)
 |    |-- element: integer (containsNull = true)



I assume each row to reflect a sparse matrix where each combination of
(row_id, column_id) has value of 1. I have a PandasUDF which performs a
GROUPED_MAP that transforms every row into a scipy.sparse.csr_matrix and,
within the group, sums the matrices before returning columns of (count,
row_id, column_id).

It works at small scale but gets unstable as I scale up. Is there a way to
profile this function in a spark session or am I limited to profiling on
pandas data frames without spark?

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016