You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <sr...@gmail.com> on 2022/09/26 23:58:37 UTC

Why are hash functions seeded with 42?

OK, it came to my attention today that hash functions in spark, like
xxhash64, actually always seed with 42:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655

This is an issue if you want the hash of some value in Spark to match the
hash you compute with xxhash64 somewhere else, and, AFAICT most any other
impl will start with seed=0.

I'm guessing there wasn't a *great* reason for this, just seemed like 42
was a nice default seed. And we can't change it now without maybe subtly
changing program behaviors. And, I am guessing it's messy to let the
function now take a seed argument, esp. in SQL.

So I'm left with, I guess we should doc that? I can do it if so.
And just a cautionary tale I guess, for hash function users.

Re: Why are hash functions seeded with 42?

Posted by Herman van Hovell <he...@databricks.com.INVALID>.

Sorry about that, it made me laugh 6 years ago, I didn't expect this to
come back and haunt me :)...

There are ways out of this, none of them are particularly appealing:
- Add a SQL conf to make the value configurable.
- Add a seed parameter to the function. I am not sure if we can make this
work well with star expansion (e.g. xxhash64(*) is allowed).
- Add a new function that allows you to set the seed: e.g.
xxhash64_with_seed(<seed>,
<value 1>, ..., <value n>).

On Mon, Sep 26, 2022 at 8:27 PM Sean Owen <sr...@gmail.com> wrote:

> Oh yeah I get why we love to pick 42 for random things. I'm guessing it
> was a bit of an oversight here as the 'seed' is directly initial state and
> 0 makes much more sense.
>
> On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson <nj...@gmail.com>
> wrote:
>
>> I don’t know the reason, however would offer a hunch that perhaps it’s a
>> nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).
>>
>>
>> https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910
>>
>> On Sep 26, 2022, at 16:59, Sean Owen <sr...@gmail.com> wrote:
>>
>> 
>> OK, it came to my attention today that hash functions in spark, like
>> xxhash64, actually always seed with 42:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655
>>
>> This is an issue if you want the hash of some value in Spark to match the
>> hash you compute with xxhash64 somewhere else, and, AFAICT most any other
>> impl will start with seed=0.
>>
>> I'm guessing there wasn't a *great* reason for this, just seemed like 42
>> was a nice default seed. And we can't change it now without maybe subtly
>> changing program behaviors. And, I am guessing it's messy to let the
>> function now take a seed argument, esp. in SQL.
>>
>> So I'm left with, I guess we should doc that? I can do it if so.
>> And just a cautionary tale I guess, for hash function users.
>>
>>

Re: Why are hash functions seeded with 42?

Posted by Felix Cheung <fe...@hotmail.com>.

+1 to doc, seed argument would be great if possible
________________________________
From: Sean Owen <sr...@gmail.com>
Sent: Monday, September 26, 2022 5:26:26 PM
To: Nicholas Gustafson <nj...@gmail.com>
Cc: dev <de...@spark.apache.org>
Subject: Re: Why are hash functions seeded with 42?

Oh yeah I get why we love to pick 42 for random things. I'm guessing it was a bit of an oversight here as the 'seed' is directly initial state and 0 makes much more sense.

On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson <nj...@gmail.com>> wrote:
I don’t know the reason, however would offer a hunch that perhaps it’s a nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).

https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910

On Sep 26, 2022, at 16:59, Sean Owen <sr...@gmail.com>> wrote:

OK, it came to my attention today that hash functions in spark, like xxhash64, actually always seed with 42: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655

This is an issue if you want the hash of some value in Spark to match the hash you compute with xxhash64 somewhere else, and, AFAICT most any other impl will start with seed=0.

I'm guessing there wasn't a great reason for this, just seemed like 42 was a nice default seed. And we can't change it now without maybe subtly changing program behaviors. And, I am guessing it's messy to let the function now take a seed argument, esp. in SQL.

So I'm left with, I guess we should doc that? I can do it if so.
And just a cautionary tale I guess, for hash function users.

Re: Why are hash functions seeded with 42?

Posted by Sean Owen <sr...@gmail.com>.

Oh yeah I get why we love to pick 42 for random things. I'm guessing it was
a bit of an oversight here as the 'seed' is directly initial state and 0
makes much more sense.

On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson <nj...@gmail.com>
wrote:

> I don’t know the reason, however would offer a hunch that perhaps it’s a
> nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).
>
>
> https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910
>
> On Sep 26, 2022, at 16:59, Sean Owen <sr...@gmail.com> wrote:
>
> 
> OK, it came to my attention today that hash functions in spark, like
> xxhash64, actually always seed with 42:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655
>
> This is an issue if you want the hash of some value in Spark to match the
> hash you compute with xxhash64 somewhere else, and, AFAICT most any other
> impl will start with seed=0.
>
> I'm guessing there wasn't a *great* reason for this, just seemed like 42
> was a nice default seed. And we can't change it now without maybe subtly
> changing program behaviors. And, I am guessing it's messy to let the
> function now take a seed argument, esp. in SQL.
>
> So I'm left with, I guess we should doc that? I can do it if so.
> And just a cautionary tale I guess, for hash function users.
>
>

Re: Why are hash functions seeded with 42?

Posted by Nicholas Gustafson <nj...@gmail.com>.

I don’t know the reason, however would offer a hunch that perhaps it’s a nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy). 

https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910

> On Sep 26, 2022, at 16:59, Sean Owen <sr...@gmail.com> wrote:
> 
> 
> OK, it came to my attention today that hash functions in spark, like xxhash64, actually always seed with 42: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655
> 
> This is an issue if you want the hash of some value in Spark to match the hash you compute with xxhash64 somewhere else, and, AFAICT most any other impl will start with seed=0.
> 
> I'm guessing there wasn't a great reason for this, just seemed like 42 was a nice default seed. And we can't change it now without maybe subtly changing program behaviors. And, I am guessing it's messy to let the function now take a seed argument, esp. in SQL.
> 
> So I'm left with, I guess we should doc that? I can do it if so.
> And just a cautionary tale I guess, for hash function users.