You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Thai Thien (Jira)" <ji...@apache.org> on 2021/08/09 11:03:00 UTC

[jira] [Created] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

Thai Thien created SPARK-36458:
----------------------------------

             Summary: MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
                 Key: SPARK-36458
                 URL: https://issues.apache.org/jira/browse/SPARK-36458
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.1.1
            Reporter: Thai Thien


Refer to documents and example code in MinHashLSH 
https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance

The example written that: 

```
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
```

However, inputCol still required in transformedA and transformedB even if they already have outputCol. 

An  code that should work but it doesn't 

```
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
         (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
         (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])

dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
         (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
         (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])

key = Vectors.sparse(6, [1, 3], [1.0, 1.0])

mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(dfA)

transformedA = model.transform(dfA).select("id", "hashes")
transformedB = model.transform(dfB).select("id", "hashes")

model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\
    .select(col("datasetA.id").alias("idA"),
            col("datasetB.id").alias("idB"),
            col("JaccardDistance")).show()
```

As in the code I give, I discard columns `features` but keep column `hashes` which is output data. 
approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). 

Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about  "We could avoid computing hashes by passing in the already-transformed dataset". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org