You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/05/15 18:01:02 UTC

[jira] [Updated] (BEAM-10824) Hash in stats.ApproximateUniqueCombineFn NON-deterministic

     [ https://issues.apache.org/jira/browse/BEAM-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-10824:
-----------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Resolved)

Hello! Due to a bug in our Jira configuration, this issue had status:Resolved but resolution:Unresolved.

I am bulk editing these issues to have resolution:Fixed

If a different resolution is appropriate, please change it. To do this, click the "Resolve" button (you can do this even for closed issues) and set the Resolution field to the right value.

> Hash in stats.ApproximateUniqueCombineFn NON-deterministic
> ----------------------------------------------------------
>
>                 Key: BEAM-10824
>                 URL: https://issues.apache.org/jira/browse/BEAM-10824
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Monica Song
>            Assignee: Monica Song
>            Priority: P1
>              Labels: hash
>   Original Estimate: 24h
>          Time Spent: 21h
>  Remaining Estimate: 3h
>
> The python hash() function is non-deterministic. As a result, different workers will map identical values to different hashes. This leads to overestimation of the number of unique values (by several magnitudes, in my experience x1000) in a distributed processing model. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)