You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2018/12/15 14:42:05 UTC
[spark] branch branch-2.4 updated: [SPARK-26315][PYSPARK] auto cast
threshold from Integer to Float in approxSimilarityJoin of
BucketedRandomProjectionLSHModel
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push:
new 869bfc9 [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
869bfc9 is described below
commit 869bfc906abc89ec6f6370c97e5b107212204af4
Author: Jing Chen He <ji...@us.ibm.com>
AuthorDate: Sat Dec 15 08:41:16 2018 -0600
[SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
## What changes were proposed in this pull request?
If the input parameter 'threshold' to the function approxSimilarityJoin is not a float, we would get an exception. The fix is to convert the 'threshold' into a float before calling the java implementation method.
## How was this patch tested?
Added a new test case. Without this fix, the test will throw an exception as reported in the JIRA. With the fix, the test passes.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes #23313 from jerryjch/SPARK-26315.
Authored-by: Jing Chen He <ji...@us.ibm.com>
Signed-off-by: Sean Owen <se...@databricks.com>
(cherry picked from commit 860f4497f2a59b21d455ec8bfad9ae15d2fd4d2e)
Signed-off-by: Sean Owen <se...@databricks.com>
---
python/pyspark/ml/feature.py | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index eccb7ac..bc4f4c9 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -193,6 +193,7 @@ class LSHModel(JavaModel):
"datasetA" and "datasetB", and a column "distCol" is added to show the distance
between each pair.
"""
+ threshold = TypeConverters.toFloat(threshold)
return self._call_java("approxSimilarityJoin", datasetA, datasetB, threshold, distCol)
@@ -240,6 +241,16 @@ class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, HasOutp
| 3| 6| 2.23606797749979|
+---+---+-----------------+
...
+ >>> model.approxSimilarityJoin(df, df2, 3, distCol="EuclideanDistance").select(
+ ... col("datasetA.id").alias("idA"),
+ ... col("datasetB.id").alias("idB"),
+ ... col("EuclideanDistance")).show()
+ +---+---+-----------------+
+ |idA|idB|EuclideanDistance|
+ +---+---+-----------------+
+ | 3| 6| 2.23606797749979|
+ +---+---+-----------------+
+ ...
>>> brpPath = temp_path + "/brp"
>>> brp.save(brpPath)
>>> brp2 = BucketedRandomProjectionLSH.load(brpPath)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org