You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeremy Freeman (JIRA)" <ji...@apache.org> on 2014/06/04 03:31:02 UTC
[jira] [Created] (SPARK-2012) PySpark StatCounter with numpy arrays
Jeremy Freeman created SPARK-2012:
-------------------------------------
Summary: PySpark StatCounter with numpy arrays
Key: SPARK-2012
URL: https://issues.apache.org/jira/browse/SPARK-2012
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor
In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum and maximum, as implemented, doesn't work on arrays.
I have a PR ready that re-enables this functionality by having StatCounter use the numpy element-wise functions "maximum" and "minimum", which work on both numpy arrays and scalars (and I've added new tests for this capability).
However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok, maybe it'd be worth adding this functionality as a util within PySpark MLLib?
--
This message was sent by Atlassian JIRA
(v6.2#6252)