You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by me...@apache.org on 2015/01/05 22:11:40 UTC
spark git commit: [SPARK-5089][PYSPARK][MLLIB] Fix vector convert
Repository: spark
Updated Branches:
refs/heads/master 1c0e7ce05 -> 6c6f32574
[SPARK-5089][PYSPARK][MLLIB] Fix vector convert
This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
The PR includes the fix, as well as a new test for the correct conversion behavior.
davies
Author: freeman <th...@gmail.com>
Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:
764db47 [freeman] Add a test for proper conversion behavior
704f97e [freeman] Return array after changing type
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c6f3257
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c6f3257
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c6f3257
Branch: refs/heads/master
Commit: 6c6f32574023b8e43a24f2081ff17e6e446de2f3
Parents: 1c0e7ce
Author: freeman <th...@gmail.com>
Authored: Mon Jan 5 13:10:59 2015 -0800
Committer: Xiangrui Meng <me...@databricks.com>
Committed: Mon Jan 5 13:10:59 2015 -0800
----------------------------------------------------------------------
python/pyspark/mllib/linalg.py | 2 +-
python/pyspark/mllib/tests.py | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/6c6f3257/python/pyspark/mllib/linalg.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py
index f7aa2b0..4f8491f 100644
--- a/python/pyspark/mllib/linalg.py
+++ b/python/pyspark/mllib/linalg.py
@@ -178,7 +178,7 @@ class DenseVector(Vector):
elif not isinstance(ar, np.ndarray):
ar = np.array(ar, dtype=np.float64)
if ar.dtype != np.float64:
- ar.astype(np.float64)
+ ar = ar.astype(np.float64)
self.array = ar
def __reduce__(self):
http://git-wip-us.apache.org/repos/asf/spark/blob/6c6f3257/python/pyspark/mllib/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index 5034f22..1f48bc1 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -110,6 +110,16 @@ class VectorTests(PySparkTestCase):
self.assertEquals(0.0, _squared_distance(dv, dv))
self.assertEquals(0.0, _squared_distance(lst, lst))
+ def test_conversion(self):
+ # numpy arrays should be automatically upcast to float64
+ # tests for fix of [SPARK-5089]
+ v = array([1, 2, 3, 4], dtype='float64')
+ dv = DenseVector(v)
+ self.assertTrue(dv.array.dtype == 'float64')
+ v = array([1, 2, 3, 4], dtype='float32')
+ dv = DenseVector(v)
+ self.assertTrue(dv.array.dtype == 'float64')
+
class ListTests(PySparkTestCase):
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org