You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by tammymendt <gi...@git.apache.org> on 2015/08/04 10:19:17 UTC

[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...

Github user tammymendt commented on the pull request:

    https://github.com/apache/flink/pull/605#issuecomment-127519735
  
    Hey! So I've been using and testing this code throughout my master thesis. Collecting count distinct makes jobs about 10% slower whereas collecting heavy hitters can make a job be 20 to 50% slower (depending on the algorithm and the distribution of the data). However this overhead is lower than that of using a histogram accumulator (not to mention the histogram might not fit in memory). I think it can be a nice addition to the code, specially since it does not affect any core components. 
    
    The version that I pushed now uses a bunch of conditionals to check which statistic is being collected. I know @fhueske did not really like this. I implemented another version which avoids the conditionals by using a different class for every type of statistic. I preferred to push this version though, since it has been more thoroughly tested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---