You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thai Thien (Jira)" <ji...@apache.org> on 2021/08/09 11:03:00 UTC
[jira] [Created] (SPARK-36458) MinHashLSH.approxSimilarityJoin
should not required inputCol if output exist
Thai Thien created SPARK-36458:
----------------------------------
Summary: MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
Key: SPARK-36458
URL: https://issues.apache.org/jira/browse/SPARK-36458
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 3.1.1
Reporter: Thai Thien
Refer to documents and example code in MinHashLSH
https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance
The example written that:
```
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
```
However, inputCol still required in transformedA and transformedB even if they already have outputCol.
An code that should work but it doesn't
```
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
(1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
(2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])
dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
(4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
(5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])
key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(dfA)
transformedA = model.transform(dfA).select("id", "hashes")
transformedB = model.transform(dfB).select("id", "hashes")
model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("JaccardDistance")).show()
```
As in the code I give, I discard columns `features` but keep column `hashes` which is output data.
approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol).
Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about "We could avoid computing hashes by passing in the already-transformed dataset".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org