You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by haohui <gi...@git.apache.org> on 2017/04/24 21:18:06 UTC

[GitHub] flink pull request #3765: [FLINK-6373] Add runtime support for distinct aggr...

GitHub user haohui opened a pull request:

    https://github.com/apache/flink/pull/3765

    [FLINK-6373] Add runtime support for distinct aggregation over grouped windows

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/haohui/flink FLINK-6373

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3765.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3765
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by fhueske <gi...@git.apache.org>.

Github user fhueske commented on the issue:

    https://github.com/apache/flink/pull/3765
  
    Hi @haohui, thanks for the PR! I like the approach of the wrapping distinct aggregator. 
    
    Unfortunately, this approach won't work with the upcoming changes for the the UDAGG interface. The `AggregateFunction` interface won't define methods to accumulate, etc. Instead, these methods can be implemented for different types, will be identified by reflection and called from generated code. We handle scalar and table functions the same way. See PR #3762 for how the `AggregateFunction` interface will evolve.
    
    Best, Fabian
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3765: [FLINK-6373] Add runtime support for distinct aggr...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/3765


---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by sunjincheng121 <gi...@git.apache.org>.

Github user sunjincheng121 commented on the issue:

    https://github.com/apache/flink/pull/3765
  
    Hi @haohui @fhueske I am very interested in `DISTINCT`,  Let me share some ideas about this:
    First up, in standard database there are two situations can using `DISTINCT` keyword. 
    *  in `SELECT Clause`, e.g.: `SELECT DISTINCT name FROM table` 
    *  in `AGG Clause`, e.g.: `COUNT([ALL|DISTINCT] expression)`,`SUM([ALL|DISTINCT] expression)`, etc. 
    
    In this post,we talk about  `AGG Clause`. The `DISTINCT` keyword tells the database system to aggregate only the distinct, or unique, values within the scope of the aggregate function. i.e. database system will deal with the `DISTINCT` keyword, and put the unique value into `AGG`. Based on this understanding, I think FLINK FRAMEWORK(not the AGG) should deal with the `DISTINCT` keyword. we do not need `DistinctAccumulator.java`. About GROUP WINDOW, I think I like analyze whether the data is duplicated in `XXXWindowFunction` and `DataSetXXXAggFunction`, And add boolean variable `isFirstTimeProcess` identifies whether the data is duplicated as a parameter of `GeneratedAggregationsFunction`. `GeneratedAggregationsFunction` process data according to `aggCall.isDistinct` and `isFirstTimeProcess`. What do you think? @haohui @fhueske 
    
    Best,
    SunJincheng


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by fhueske <gi...@git.apache.org>.

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/3765

Hi @haohui,

I suggested before that PR #3771 might be used for DISTINCT group window functions. However, this does not work because we cannot register state for an AggregateFunction. The benefit of the approach of #3771 would have been that it does not need to deserialize the Map every time a record is accumulated (or retracted). Instead the distinct values are kept in a MapState that can be accessed (and deserialized) per look up key. But this approach does not work with the AggregateFunction that we use for early aggregation.

To be honest, I'm a bit concerned about the performance of the approach of this PR because the state of the DistinctAccumulator accumulator (i.e., the complete map) will be de/serialized every time we access it.

I think we can use this approach for now, but should look out, whether we can use an approach similar to the batch side where distinct aggregations (on different keys) are translated into multiple aggregations which are later joined together (the join would be rather cheap because its a 1-to-1 join).

I'll have a look at this PR later today.
Thanks, Fabian

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by fhueske <gi...@git.apache.org>.

Github user fhueske commented on the issue:

    https://github.com/apache/flink/pull/3765
  
    The features that this PR was going to implement has been resolved by PR #5555. 
    I will close it.


---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by haohui <gi...@git.apache.org>.

Github user haohui commented on the issue:

    https://github.com/apache/flink/pull/3765
  
    Updated the PR to codegen the parts used by distinct accumulator. Each column is calculated independently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3765: [FLINK-6373] Add runtime support for distinct aggregation...

Posted by haohui <gi...@git.apache.org>.

Github user haohui commented on the issue:

    https://github.com/apache/flink/pull/3765
  
    Note that this PR contains minimal amount of tests. Would love the feedbacks on what kinds of tests are required here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---