You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "shengzhang (Jira)" <ji...@apache.org> on 2021/09/10 08:38:00 UTC
[jira] [Created] (SPARK-36714) bugs in MIniLSH
shengzhang created SPARK-36714:
----------------------------------
Summary: bugs in MIniLSH
Key: SPARK-36714
URL: https://issues.apache.org/jira/browse/SPARK-36714
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 2.1.1
Reporter: shengzhang
Fix For: 2.1.1
This is about MinHashLSH algorithm.
To get the similartiy dataframe DFA and DFB, I used MinHashLSH approxSimilarityJoin function. But there are some missing data in the result.
the example in documents is no problem [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
but when the data based on distributed system(hive, more than one node)
there will be some missing data.
for example vectora= vectorb. but it no in the reslut of approxSimilarityJoin, even though
"threshold" more than 1 .
I think maybe the problem is in these codes
{code:java}
// part1
override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 = {
require(inputDim <= MinHashLSH.HASH_PRIME,
s"The input vector dimension $inputDim exceeds the threshold ${MinHashLSH.HASH_PRIME}.")
val rand = new Random($(seed))
val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
(1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), rand.nextInt(MinHashLSH.HASH_PRIME - 1))
}
new MinHashLSHModel1(uid, randCoefs)
}
// part2
@Since("2.1.0")
override protected[ml] val hashFunction: Vector => Array[Vector] = {
elems: Vector => {
require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
val elemsList = elems.toSparse.indices.toList
val hashValues = randCoefficients.map { case (a, b) =>
elemsList.map { elem: Int =>
((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
}.min.toDouble
}
// TODO: Output vectors of dimension numHashFunctions in SPARK-18450
hashValues.map(Vectors.dense(_))
}
{code}
val r1 = new scala.util.Random(1)
r1.nextInt(1000) // -> 985
val r2 = new scala.util.Random(2)
r2.nextInt(1000) // -> 108 -
val r3 = new scala.util.Random(1)
r3.nextInt(1000) // -> 985 - because seeded just as `r1`
r3.nextInt(1000). //-> 588
{{}}the reason maybe is above. if random is only initialized once . random.nextInt() will get different result every time ,like r3. r3.nextInt(1000) // -> 985 r3.nextInt(1000). //-> 588
so the code
val rand = new Random($(seed)) in def createRawLSHModel move to hashFunction is better
. every worker will initialize random class. and every worker get same data
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org