You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/09/12 02:00:00 UTC
[jira] [Commented] (SPARK-36714) bugs in MIniLSH
[ https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413640#comment-17413640 ]
Hyukjin Kwon commented on SPARK-36714:
--------------------------------------
Apache Spark 2.x is EOL. Mind double cheking if the same issue persists in Spark 3.x?
> bugs in MIniLSH
> ---------------
>
> Key: SPARK-36714
> URL: https://issues.apache.org/jira/browse/SPARK-36714
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.1.1
> Reporter: shengzhang
> Priority: Minor
>
> This is about MinHashLSH algorithm.
> To get the similartiy dataframe DFA and DFB, I used MinHashLSH approxSimilarityJoin function. But there are some missing data in the result.
> the example in documents is no problem [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
> but when the data based on distributed system(hive, more than one node)
> there will be some missing data.
> for example vectora= vectorb. but it no in the reslut of approxSimilarityJoin, even though
> "threshold" more than 1 .
> I think maybe the problem is in these codes
> {code:java}
> // part1
> override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 = {
> require(inputDim <= MinHashLSH.HASH_PRIME,
> s"The input vector dimension $inputDim exceeds the threshold ${MinHashLSH.HASH_PRIME}.")
> val rand = new Random($(seed))
> val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
> (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), rand.nextInt(MinHashLSH.HASH_PRIME - 1))
> }
> new MinHashLSHModel1(uid, randCoefs)
> }
> // part2
> @Since("2.1.0")
> override protected[ml] val hashFunction: Vector => Array[Vector] = {
> elems: Vector => {
> require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
> val elemsList = elems.toSparse.indices.toList
> val hashValues = randCoefficients.map { case (a, b) =>
> elemsList.map { elem: Int =>
> ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
> }.min.toDouble
> }
> // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
> hashValues.map(Vectors.dense(_))
> }
> {code}
> val r1 = new scala.util.Random(1)
> r1.nextInt(1000) // -> 985
> val r2 = new scala.util.Random(2)
> r2.nextInt(1000) // -> 108 -
> val r3 = new scala.util.Random(1)
> r3.nextInt(1000) // -> 985 - because seeded just as `r1`
> r3.nextInt(1000). //-> 588
> {{}}the reason maybe is above. if random is only initialized once . random.nextInt() will get different result every time ,like r3. r3.nextInt(1000) // -> 985 r3.nextInt(1000). //-> 588
> so the code
> val rand = new Random($(seed)) in def createRawLSHModel move to hashFunction is better
> . every worker will initialize random class. and every worker get same data
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org