You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/09/12 02:00:00 UTC
[jira] [Commented] (SPARK-36714) bugs in MIniLSH

    [ https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413640#comment-17413640 ] 

Hyukjin Kwon commented on SPARK-36714:
--------------------------------------

Apache Spark 2.x is EOL. Mind double cheking if the same issue persists in Spark 3.x?

> bugs in MIniLSH
> ---------------
>
>                 Key: SPARK-36714
>                 URL: https://issues.apache.org/jira/browse/SPARK-36714
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: shengzhang
>            Priority: Minor
>
> This is about MinHashLSH algorithm.
> To get the similartiy dataframe DFA and DFB, I  used MinHashLSH  approxSimilarityJoin function.  But there are some missing data in the result.
> the example in documents is no problem  [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com] 
> but when the data based on distributed system(hive, more than one node)
> there will be some missing data. 
> for example    vectora= vectorb. but it no in the reslut of  approxSimilarityJoin, even though 
> "threshold"  more than 1 .
> I think  maybe the problem is  in these codes
> {code:java}
> // part1
> override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 = {
>   require(inputDim <= MinHashLSH.HASH_PRIME,
>     s"The input vector dimension $inputDim exceeds the threshold ${MinHashLSH.HASH_PRIME}.")
>   val rand = new Random($(seed)) 
>   val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
>     (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), rand.nextInt(MinHashLSH.HASH_PRIME - 1))
>   }
>   new MinHashLSHModel1(uid, randCoefs)
> }
> // part2
> @Since("2.1.0")
> override protected[ml] val hashFunction: Vector => Array[Vector] = {
>   elems: Vector => {
>     require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
>     val elemsList = elems.toSparse.indices.toList
>     val hashValues = randCoefficients.map { case (a, b) =>
>       elemsList.map { elem: Int =>
>         ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
>       }.min.toDouble
>     }
>     // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
>     hashValues.map(Vectors.dense(_))
>   }
> {code}
>  val r1 = new scala.util.Random(1)
> r1.nextInt(1000)  // -> 985
> val r2 = new scala.util.Random(2)
> r2.nextInt(1000)  // -> 108 - 
> val r3 = new scala.util.Random(1)
> r3.nextInt(1000)  // -> 985 - because seeded just as `r1`
> r3.nextInt(1000).  //-> 588
> {{}}the reason maybe is above.  if  random is only  initialized once .  random.nextInt() will get different result every time ,like r3. r3.nextInt(1000) // -> 985   r3.nextInt(1000).  //-> 588
> so the code 
> val rand = new Random($(seed)) in  def createRawLSHModel  move to hashFunction is better
> . every worker will initialize random class. and every worker get same data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org