You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by jiayuanm <ji...@gmail.com> on 2018/07/07 01:36:01 UTC

[SPARK ML] Minhash integer overflow

Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [SPARK ML] Minhash integer overflow

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Of course, the hash value can just be negative. I thought that it would be 
after computation without overflow.

When I checked another implementation, it performs computations with int.
https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89

By copy to Xjiayuan, did you compare the hash value generated by Spark 
with it generated by other implementations?

Regards,
Kazuaki Ishizaki



From:   Sean Owen <sr...@gmail.com>
To:     jiayuanm <ji...@gmail.com>
Cc:     dev@spark.apache.org
Date:   2018/07/07 15:46
Subject:        Re: [SPARK ML] Minhash integer overflow



I think it probably still does its.job; the hash value can just be 
negative. It is likely to be very slightly biased though. Because the 
intent doesn't seem to be to allow the overflow it's worth changing to use 
longs for the calculation. 

On Fri, Jul 6, 2018, 8:36 PM jiayuanm <ji...@gmail.com> wrote:
Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [SPARK ML] Minhash integer overflow

Posted by Sean Owen <sr...@gmail.com>.

I think it probably still does its.job; the hash value can just be
negative. It is likely to be very slightly biased though. Because the
intent doesn't seem to be to allow the overflow it's worth changing to use
longs for the calculation.

On Fri, Jul 6, 2018, 8:36 PM jiayuanm <ji...@gmail.com> wrote:

> Hi everyone,
>
> I was playing around with LSH/Minhash module from spark ml module. I
> noticed
> that hash computation is done with Int (see
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
> ).
> Since "a" and "b" are from a uniform distribution of [1,
> MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
> it's likely for the multiplication to cause Int overflow with a large
> sparse
> input vector.
>
> I wonder if this is a bug or intended. If it's a bug, one way to fix it is
> to compute hashes with Long and insert a couple of mod
> MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be
> smaller
> than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
> to use BigInteger.
>
> Let me know what you think.
>
> Thanks,
> Jiayuan
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [SPARK ML] Minhash integer overflow

Posted by jiayuanm <ji...@gmail.com>.

Sure. JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24754.
I'll create the PR.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [SPARK ML] Minhash integer overflow

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Thank for you reporting this issue. I think this is a bug regarding 
integer overflow. IMHO, it would be good to compute hashes with Long.

Would it be possible to create a JIRA entry?  Do you want to submit a pull 
request, too?

Regards,
Kazuaki Ishizaki



From:   jiayuanm <ji...@gmail.com>
To:     dev@spark.apache.org
Date:   2018/07/07 10:36
Subject:        [SPARK ML] Minhash integer overflow



Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: 
http://apache-spark-developers-list.1001551.n3.nabble.com/


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org