You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Valentyn Tymofieiev (Jira)" <ji...@apache.org> on 2020/08/28 17:42:00 UTC

[jira] [Commented] (BEAM-10824) Hash in stats.ApproximateUniqueCombineFn NON-deterministic

    [ https://issues.apache.org/jira/browse/BEAM-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186731#comment-17186731 ] 

Valentyn Tymofieiev commented on BEAM-10824:
--------------------------------------------

Related: https://issues.apache.org/jira/browse/BEAM-7525

We originally used mmh3, but reverted to default hash function without realizing the consequences for distributed execution https://github.com/apache/beam/pull/8799/.

AFAIK mmh dependency did not install cleanly on some Windows machines, we can see whether this is still the case now that we have precommit tests on Windows running on every PR.

We can also pick a different hash function that is deterministic. 

> Hash in stats.ApproximateUniqueCombineFn NON-deterministic
> ----------------------------------------------------------
>
>                 Key: BEAM-10824
>                 URL: https://issues.apache.org/jira/browse/BEAM-10824
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>            Reporter: Monica Song
>            Priority: P1
>              Labels: hash
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The python hash() function is non-deterministic. As a result, different workers will map identical values to different hashes. This leads to overestimation of the number of unique values (by several magnitudes, in my experience x1000) in a distributed processing model. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)