You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by ChengXiangLi <gi...@git.apache.org> on 2015/11/27 08:03:07 UTC

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

GitHub user ChengXiangLi opened a pull request:

    https://github.com/apache/flink/pull/1414

    [FLINK-3087] [Table-API] support multi count in aggregation

    The `Literal(1)` is used as the IntermediateField of `Count` aggregation, so multi `Count` looks the same in `ExpandAggregations`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChengXiangLi/flink countAggregation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1414
    
----
commit d83690c521b22e7af339353ed1b06b3d199e9c42
Author: chengxiang li <ch...@intel.com>
Date:   2015-11-27T06:46:29Z

    [FLINK-3087] [Table-API] support multi count in aggregation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by aljoscha <gi...@git.apache.org>.

Github user aljoscha commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160879522
  
    Oh, you are way ahead of me. :sweat_smile: You are right, keep it that way since we will probably have support for Null values in the future.
    
    I'll merge it later if there are no objections. @twalthr ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by ChengXiangLi <gi...@git.apache.org>.

Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-161153580
  
    Let's reuse the count aggregation now, as it's better for performance now. I would revisit this after NULL value handling is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by ChengXiangLi <gi...@git.apache.org>.

Github user ChengXiangLi closed the pull request at:

    https://github.com/apache/flink/pull/1414


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by ChengXiangLi <gi...@git.apache.org>.

Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160822216
  
    `a.count` and `b.count` may not the same all the times(for example is `a` contains NULL values), that's why i try to aggregate it twice. But NULL value is not allowed until now, so i can update the PR in your way and revisit this issue after NULL value handling in Table API is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by aljoscha <gi...@git.apache.org>.

Github user aljoscha commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160956025
  
    Ah, I see you changed it to my (wrong) suggestion. @ChengXiangLi Do you want to change it back to your initial version (plus maybe the fix for the naming and cleanExpr). Or should it be changed when adding support for null values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by twalthr <gi...@git.apache.org>.

Github user twalthr commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160949963
  
    Looks good. No objecttions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by aljoscha <gi...@git.apache.org>.

Github user aljoscha commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160153085
  
    It's good that you found the problem and added a test. :smile: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by aljoscha <gi...@git.apache.org>.

Github user aljoscha commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-161232136
  
    Alright, I merged it. Could you please close it if github doesn't do so automatically.
    
    Thanks for the work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3087] [Table-API] support multi count i...

Posted by aljoscha <gi...@git.apache.org>.

Github user aljoscha commented on the pull request:

    https://github.com/apache/flink/pull/1414#issuecomment-160153032
  
    Hi,
    the reason it was implemented as it was is to eliminate common subexpressions. For example, if you wrote:
    ```
    select(a.count, b.count)
    ```
    
    the resulting expansion would be
    ```
    intermediate = select(1.count as intermediate.1)
    agg = intermediate.aggregate(intermediate.1, sum)
    result = agg.select(intermediate.1, intermediate.1)
    ```
    
    You see, the `1` would only be aggregated once here.
    
    The problem was that selecting `intermediate.1` twice is not allowed since the field names have to be unique.
    
    Changing `ExpandAggregations` to this should do the trick:
    ```
    var intermediateCount = 0
    var resultCount = 0
    selection foreach {  f =>
      f.transformPre {
        case agg: Aggregation =>
          val intermediateReferences = agg.getIntermediateFields.zip(agg.getAggregations) map {
            case (expr, basicAgg) =>
              aggregations.get((expr, basicAgg)) match {
                case Some(intermediateName) =>
                  resultCount = resultCount + 1
                  val resultName = s"result.$resultCount"
                  Naming(ResolvedFieldReference(intermediateName, expr.typeInfo), resultName)
                case None =>
                  intermediateCount = intermediateCount + 1
                  val intermediateName = s"intermediate.$intermediateCount"
                  intermediateFields += Naming(expr, intermediateName)
                  aggregations((expr, basicAgg)) = intermediateName
                  resultCount = resultCount + 1
                  val resultName = s"result.$resultCount"
                  Naming(ResolvedFieldReference(intermediateName, expr.typeInfo), resultName)
              }
          }
    
          aggregationIntermediates(agg) = intermediateReferences
          // Return a NOP so that we don't add the children of the aggregation
          // to intermediate fields. We already added the necessary fields to the list
          // of intermediate fields.
          NopExpression()
    
        case fa: ResolvedFieldReference =>
          if (!fa.name.startsWith("intermediate")) {
            intermediateFields += Naming(fa, fa.name)
          }
          fa
      }
    }
    ```
    
    Now the aggregations will be expanded to:
    ```
    intermediate = select(1.count as intermediate.1)
    agg = intermediate.aggregate(intermediate.1, sum)
    result = agg.select(intermediate.1 as result.1, intermediate.1 as result.2)
    ```
    
    This also required changing `ExpressionCodeGenerator.scala`, line 111 to:
    ```
    def cleanExpr(e: Expression): Expression = {
      e match {
        case expressions.Naming(namedExpr, _) => cleanExpr(namedExpr)
        case _ => e
      }
    }
    ```
    
    because there is another bug that doesn't allow having nested Namings à la `Naming(Naming(expr, intermediate.1, result.1))`.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---