You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2009/02/12 20:33:02 UTC

[jira] Created: (HIVE-287) [hive] distinct on multiple columns does not work

[hive] distinct on multiple columns does not work
-------------------------------------------------

                 Key: HIVE-287
                 URL: https://issues.apache.org/jira/browse/HIVE-287
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Namit Jain




The following query does not work:


select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886183#action_12886183 ] 

Zheng Shao commented on HIVE-287:
---------------------------------

Hi Arvind, sorry for coming late for the party. I have 2 questions on the new UDAF2 interface:

1. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF.
This is good because the logic of removing duplicates are common for all UDAFs.  We do support SUM(DISTINCT val).

2. Why do we special-case "*"? It seems to me that "*" is just a short-cut.  Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF.

Since COUNT(*) is a special-case in the SQL standard (COUNT(*) is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place.

What do you think?

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884125#action_12884125 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John - are you suggesting that the grammar be updated to restrict single star argument with the specific function {{COUNT}}? If not in the grammar where else do you think these restrictions should be coded.

In either case, what other subsystems you think will be impacted by this change and what do you suggest should be the downstream changes to accomodate this?

p.s. I ask these questions to best utilize both our time and reduce the number of back/froths to the extent possible. 

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871494#action_12871494 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Modified the implementation as per review feedback - Introduced an abstract base class and reverted the resolvers to extend that instead of directly implementing the new functionality.




> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720917#action_12720917 ] 

Namit Jain commented on HIVE-287:
---------------------------------


You need to use an alias for the sub-query

select count(1) from (select product_url, name, count(1) from raw_product group by product_url, name)x

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884134#action_12884134 ] 

John Sichi commented on HIVE-287:
---------------------------------

Yes, if you can do get it to work in the grammar itself (via a choice in the existing "function" rule), that would be best.  If it it's not possible to do it there for some reason, then semantic analyzer.

Could you explain what you mean regarding impact?  Since Hive doesn't support COUNT(*) before your patch, the only impact is on the rest of your patch, right?

BTW, here's the relevant BNF from my copy of the SQL:2003 standard (ISO/IEC 9075-2:2003 part 2 section 10.9), omitting some SQL/OLAP stuff such as <filter clause>:

<aggregate function > ::=
COUNT <left paren> <asterisk> <right paren>
| <general set function>

<general set function> ::=
<set function type> <left paren> [ <set quantifier> ] <value expression> <right paren>

<set function type> ::=
AVG |
MAX |
MIN |
SUM |
...

<set quantifier> ::= DISTINCT | ALL

As you can see, they make a special case for COUNT to allow for star there alone, and they don't allow COUNT(DISTINCT *).



> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar reassigned HIVE-287:
-------------------------------------

    Assignee: Arvind Prabhakar

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886882#action_12886882 ] 

Zheng Shao commented on HIVE-287:
---------------------------------

Talked with John offline also.

I agree that we can use the new interface going forward. Can you do these also in this patch:
1. Change the comments for the 2 new fields.  It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns.
2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one.

{code}
+  /**
+   * @return true if the UDAF invocation was qualified with <tt>DISTINCT</tt>
+   * keyword, false otherwise.
+   */
+  boolean isDistinct();
+
+  /**
+   * @return true if the UDAF invocation was done with a wildcard instead of
+   * explicit parameter list.
+   */
+  boolean isAllColumns();
{code}

After this patch is in, here is a list of follow-ups. Can you open JIRA for these:

1. Let UDAF and UDF support * and regex-based column specification
2. Special-case COUNT(*) because that does not require reading any columns, while MY_UDAF(*) needs all columns.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Patrick Salami (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720900#action_12720900 ] 

Patrick Salami commented on HIVE-287:
-------------------------------------

Thanks for the reply! However, I get a parse error when I try this query:

hive> select count(1) from (select product_url, name, count(1) from
raw_product group by product_url, name);
FAILED: Parse Error: line 0:-1 mismatched input '<EOF>' expecting
Identifier in group by expression

here is my schema:
hive> describe raw_product;
OK
id      int
merchant_product_id     int
network_id      int
network_merchant_id     int
name    string
brand   string
category_name   string
sku     string
product_url     string
[...]

Thanks!

--Patrick





-- 
___________________________________
Patrick Salami | Senior Software Engineer
Extrabux, Inc | http://www.extrabux.com
patrick@extrabux.com | (858) 449-2241


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-2.patch

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879147#action_12879147 ] 

Namit Jain commented on HIVE-287:
---------------------------------

Arvind, I am sorry about the delay in getting back to this.
Can you regenerate the patch - it is not applying cleanly. 
I will look at it right away

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886950#action_12886950 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

bq. 1. Change the comments for the 2 new fields. It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns.

I updated the javadocs in various places to make this clear.

bq. 2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one.

No changes necessary for this - the previously submitted patch also did it. All existing generic UDAFs now extend from the abstract class that implements the new interface. If you see any problems with that let me know.



> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879994#action_12879994 ] 

John Sichi commented on HIVE-287:
---------------------------------

For DISTINCT:   we can check the function invocation itself (during semantic analysis) by calling supportsDistinct() immediately after instantiating the GenericUDAFEvaluator in SemanticAnalyzer.  This allows strict validation to be performed.  Or make the method name checkDistinct and allow the UDAF to throw the exception itself.  But I agree that in this case it would be cleaner to extend the interface, so I'm fine if we go ahead with that in a non-breaking fashion.

For COUNT(*):  if you think about it, COUNT(*) really means "ignore all columns" not "count all columns".  So I think an empty array actually makes a lot of sense here. Can you think of a case where UDAF(*) even makes sense, where UDAF != COUNT?  If you don't have access to any per-row data, what can you do other than count it?  I'd say we should actually disallow * for anything but COUNT, per the SQL standard.

I like your approach to keeping compatibility via instanceof, so if the decision ends up being to add the extra parameters, then we should definitely use that approach.  However, extension points should always be interfaces (not abstract classes) to allow for stuff like dynamic proxies.  So we would need to add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the new method, and make AbstractGenericUDAFResolver implement both.

Interface evolution is never pretty, but there is an interface design pattern which avoids this particular problem.  Imagine if originally we had defined a GenericUDAFResolverInput class inside of Hive itself, with a method getParameters() returning TypeInfo [].  HIve would instantiate this and pass an input object into getEvaluator, and the evaluator would call input.getParameters().  This would have allowed us to add a boolean isDistinct() method to GenericUDAFResolverInput without breaking anything (source or binary) and without needing to add a new interface; old plugins would not know about isDistinct() so they wouldn't call it, and new ones could.

I would argue that if we're going to go to the trouble of adding GenericUDAFResolver2, then we should build the pattern above into it as well in case we need further evolution later on.

p.s. I'm really glad you're working on this one...every few days I try a count(*) against Hive accidentally and then kick myself.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886804#action_12886804 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

I think keeping two different interfaces for UDAFs will lead to confusion in the long run. Thats why the current patch deprecates the old interface in favor of the new one. But if all agree that it is a good idea, then I will go with that.

Also - can you suggest an alternate name for the new interface?

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720511#action_12720511 ] 

Namit Jain commented on HIVE-287:
---------------------------------

can you try:

select count(1) from (select col1, col2, count(1) from tbl group by col1, col2);

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888106#action_12888106 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John: I updated the UDAF documentation on the wiki at http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF and also added a short blurb regarding the various interface changes on the UDAF tutorial page http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy#Writing_the_source. Please let me know if there are other places that need to be updated as well.

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886794#action_12886794 ] 

Zheng Shao commented on HIVE-287:
---------------------------------

The plan looks good to me.

Just one comment: I think we should change the comment/class name for GenericUDAFResolver2.  Let's explicitly say GenericUDAFResolver2 is for UDAFs that want to have control over whether DISTINCT or * should be treated differently. For normal UDAFs, they should still inherite from GenericUDAFResolver.

Does that sound OK?



> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-6-trunk.patch
                HIVE-287-6-branch-0.6.patch

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghotham Murthy updated HIVE-287:
----------------------------------

    Description: 

The following query does not work:


select count(distinct col1, col2) from Tbl

  was:


The following query does not work:


select count(distinct col1, col2) from Tbl

        Summary: count distinct on multiple columns does not work  (was: [hive] distinct on multiple columns does not work)

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-287:
----------------------------

    Status: Open  (was: Patch Available)

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880000#action_12880000 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John: I agree with your assessment above. Regarding the count(*), my earlier comment was not to imply that there exists a UDAF today, but that it might exist in the future. More importantly though, using an empty parameter list as an indicator for * would blur the distinction between UDAF(*) vs UDAF() invocation. This is one way of many perhaps where parameter overloading could lead to confusion and hard to understand code. 

I think introducing {{GenericUDAFResolver2}} interface is a great idea. I also like the idea of using a call back for decoupling the invocation from parameter list but am concerned that this could lead to perhaps redundant method call and object creation. I am not sure if that would add to any significant performance penalty in the long run or not. 

I would love to know what the opinion of others interested in this issue is regarding this route. If all agree that adding a new interface with callback for parameter discovery is acceptable, I can start working on that patch.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879929#action_12879929 ] 

John Sichi commented on HIVE-287:
---------------------------------

Sorry to chime in late on this one, but I have one big question on this one:  can we instead do it in a way which does not break the UDAF interface?

The existing patch adds a new method to the GenericUDAFResolver interface, meaning all existing plugin implementations outside of the Hive codebase will fail to compile (due to the fact that we did not already have the insulating abstract base class available).  We already have some of these within Facebook.

Let's analyze the two new parameters one by one.

isDistinct:  this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation.  So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator?  Then call this after instantiating the new evaluator in order to carry out the additional validation.

isAllColumns:  COUNT(*) is probably the only function which is ever even going to care about this one.  Couldn't we just use an empty array of TypeInfo to indicate all columns?

Independent of the above, I think adding the insulating abstract base should still be done now to make future transitions smoother when interface-breaking is absolutely required.  So keep that part of the patch.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870860#action_12870860 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Thanks for taking a look at this patch Namit. I have some questions and clarifcations regarding your feedback:

bq. 1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT.
For eg: select avg(distinct c1,c2) from T

Not sure how this relates to the change I made. Even before making this change, the DISTINCT qualifier was allowed for any function invocation. Can you elaborate what you mean by this? Specifically, which part of the patch needs to be changed in order to accomodate this request.

bq. 2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better.

Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example.

bq. 3. Follows from 1 - more tests are needed

Are you suggesting more tests for array_contains UDF or to add more tests for other UDFs? Please clarify with examples if possible.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Status: Patch Available  (was: Open)

Uploaded patch for trunk and branch-0.6. Ran all the tests on trunk and did spot testing on branch-0.6.

*Changes from Previous patch:*
* Modified the implementation of {{AbstractGenericUDAFResolver}} to raise an exception when invoked with the {{UDAF(STAR)}} syntax.
* Added negative test cases to assert that the current UDAFs present in the code other than {{COUNT}} do not accept the {{UDAF(STAR)}} syntax.
* Added {{EXPLAIN}} directives for the queries run in {{udf_count.q}} test file.

Will attempt to post the patch on review board as well.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885766#action_12885766 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Review board review posted:

http://review.hbase.org/r/275/


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885788#action_12885788 ] 

HBase Review Board commented on HIVE-287:
-----------------------------------------

Message from: "Carl Steinbach" <ca...@cloudera.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/275/
-----------------------------------------------------------

(Updated 2010-07-06 19:24:53.267440)


Review request for Hive Developers and John Sichi.


Changes
-------

Updated 'Bugs' field in review request.


Summary
-------

This is the patch revision 5 for HIVE-287. Please see the Jira comments for full description.


This addresses bug HIVE-287.
    http://issues.apache.org/jira/browse/HIVE-287


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 4109103 
  ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 7e6e63e 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 252b89d 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 0ef4734 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 916eb33 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 0054664 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 8d22ef1 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 6785687 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 051f3a1 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 9888b52 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java ce97afd 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 26dc84c 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION 
  ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_count.q 2d1510f 
  ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/udf_count.q.out 2adffd8 

Diff: http://review.hbase.org/r/275/diff


Testing
-------

Ran all tests on trunk. 


Thanks,

Arvind




> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-4.patch

applies cleanly on trunk and branch-0.6

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-287:
----------------------------

        Summary: support count(*) and count distinct on multiple columns  (was: count distinct on multiple columns does not work)
    Description: 
The following query does not work:


select count(distinct col1, col2) from Tbl

  was:

The following query does not work:


select count(distinct col1, col2) from Tbl


> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891889#action_12891889 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Updated the wiki in the all the above places.

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886780#action_12886780 ] 

John Sichi commented on HIVE-287:
---------------------------------

I'll see if I can clear it up with Zheng today, and if so, we can just commit as is; otherwise let's meet.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Patrick Salami (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720500#action_12720500 ] 

Patrick Salami commented on HIVE-287:
-------------------------------------

confirmed: happened to me as well; anyone have any workarounds?

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884770#action_12884770 ] 

John Sichi commented on HIVE-287:
---------------------------------

A couple of DBMS's I just tried actually do use something equivalent to that grammar snippet I quoted.

{noformat}
mysql> select sum(*) from TBLS;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '*) from TBLS' at line 1
{noformat}

{noformat}
0: jdbc:luciddb:> select sum(*) from sys_root.dba_tables;  
Error: At line 1, column 12: Unknown identifier '*' (state=,code=0)
{noformat}

But I'm fine with leaving the grammar as B and doing the check inside of SemanticAnalyzer instead as long as your negative tests demonstrate that the diagnostics are good if a user tries something like sum(*).

However, for this part:
"Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the AbstractGenericUDAFResolver class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it."

It's not just a matter of not supporting it by default.  I don't think we should support it at all, so let's prevent inside of SemanticAnalyzer and then not expose it at the resolver level at all.  If in early days, the SQL people had just done the smart thing and made the "count all" expression COUNT() instead of COUNT(*), we wouldn't even be having this discussion.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-287:
----------------------------

    Status: Open  (was: Patch Available)

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886428#action_12886428 ] 

John Sichi commented on HIVE-287:
---------------------------------

Regarding DISTINCT:  I agree with Arvind; this information should be provided to the UDAF so that it can reject invocations that don't make sense.  Once this validation is passed, the distinct elimination is still implemented generically inside of Hive (upstream of the UDAF).

Regarding F(*):  let's discriminate three cases.

COUNT(*):  this really means COUNT(), not COUNT(x,y,z).  This is a very important distinction to make from an optimizer perspective, because we want to be able to push down projection to avoid I/O and other processing for columns whose values we will never look at.

SUM(*) and similar ones:  these we should disallow.

MY_UDAF(*), or MY_UDAF(t.*):  this is similar to Pradeep's case that came up recently on the mailing list, and it needs to expand to MY_UDAF(x,y,z), not MY_UDAF().  I think the patch is currently doing MY_UDAF(), which isn't what he wants.

My recommendation is that we commit Arvind's patch as is, then create a followup JIRA issue to do what Pradeep is looking for (the expansion of * in the semantic analyzer) for both UDF and UDAF, but with a special case for COUNT. UDAF authors will be able to decide whether or not to reject the star syntax, since in the common case of a UDAF expecting a limited number of parameters, the star won't make sense.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-287:
----------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
    Release Note: This change deprecates the GenericUDAFResolver interface in favor of the new GenericUDAFResolver2.
      Resolution: Fixed

Committed to trunk and 0.6.  Thanks Arvind!


> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-287:
--------------------------------

    Fix Version/s:     (was: 0.7.0)

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680731#action_12680731 ] 

Raghotham Murthy commented on HIVE-287:
---------------------------------------

results in a compiler error:

FAILED: Error in semantic analysis: line 2:7 Function Argument Type Mismatch count: Looking for UDAF Evaluator"count" with parameters [class java.lang.String, class java.lang.String]


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886339#action_12886339 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@Zheng: Welcome to the party.

bq. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF. This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val).

Providing the information in the parameter specification is not the same as enforcing its interpretation. This is provided primarily to ensure that UDAFs that rely on this information can make appropriate decisions. For example, we wanted to disallow the invocation {{COUNT( EXPR1, EXPR2 ...)}} in favor of {{COUNT(*DISTINCT* EXPR1, EXPR2 ...)}}. Without this information, the count UDAF will not be able to enforce the later syntax.

bq. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF.

If you wish to use \* as a regular expression, you would have to quote it as a string - {{COUNT('\*')}}. This is different from the invocation as specified in SQL which treats \* as a terminal symbol. So if it is OK to deviate from the standard representation, the user can easily use the quoted string representation to achieve the effect similar to {{COUNT(col1, col2 ..)}}. The semantics of this should be more like {{COUNT(DISTINCT EXPR1, EXPR2 ...)}} as opposed to {{COUNT(\*)}}.

bq. Since COUNT(\*) is a special-case in the SQL standard (COUNT(\*) is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place.

Are you suggesting that we allow the grammar to express {{COUNT(\*)}} syntax, but in the lexical analysis stage turn it into a {{COUNT(1)}}? I can see how that may work - but personally I am not a fan of such an approach. 

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882792#action_12882792 ] 

John Sichi commented on HIVE-287:
---------------------------------

Sorry, my commit for HIVE-1387 just introduced a new conflict, so one more rebase is required.

I still don't think that we should allow anything other than COUNT(*) in the parser.  SUM(*) should be disallowed, as should COUNT(DISTINCT *), as should MY_UDF(*).  They are non-standard, and I think they would just cause confusion since COUNT(*) ignores the column values rather than doing anything with them.  Together with preventing these, add negative tests to verify that they are rejected.

Given that, we can get rid of the isAllColumns info.

Also, can you add an EXPLAIN for the COUNT(*) query in the test?

After that, I think we'll be good to go.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-1.patch

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887258#action_12887258 ] 

John Sichi commented on HIVE-287:
---------------------------------

Arvind, could you update relevant documentation in wiki?

Besides the SQL reference, Mayank Lahiri is working on some documentation for writing new UDAF's, so the resolver interface can probably go there: 

http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy


> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887261#action_12887261 ] 

John Sichi commented on HIVE-287:
---------------------------------

Followup created in HIVE-1459.  Arvind, this is a good one to work on next.


> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893045#action_12893045 ] 

Jeff Hammerbacher commented on HIVE-287:
----------------------------------------

Okay I spoke too soon. Turns out the Apache wiki just takes about 5 minutes to save an edit; the disclaimers are there now.

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-5-trunk.patch
                HIVE-287-5-branch-0.6.patch

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884699#action_12884699 ] 

Ashish Thusoo commented on HIVE-287:
------------------------------------

@John

Another disadvantage for doing C that I can think of is the fact that count would become a keyword and then any column names would have to be quoted. Not a big deal but just something that would be a side effect of going with C.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881020#action_12881020 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John: Can you please take a look at the updated patch? Let me know if you have any feedback for further tweaking this change as necessary.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886980#action_12886980 ] 

John Sichi commented on HIVE-287:
---------------------------------

+1.  Will commit if tests pass.


> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Status: Patch Available  (was: Open)

Submitting the regenerated patch with lastest trunk image. Patch file is HIVE-287-3.patch.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

           Status: Patch Available  (was: Open)
    Fix Version/s: 0.6.0

*Summary*
This patch fixes the {{count()}} aggregate function to be consistent with SQL. Specifically:
* Provides support for {{SELECT count(*) FROM table}} queries, where it returns the total number of rows of the table.
* Also extended the support for {{count()}} to include multiple expression list. {{count(DISTINCT expr1, exp2,...)}} returns the number of non-NULL and different valued rows from the evaluated expressions.

*Details*
* Modified HiveQL grammar to allow function invocation with a single * in place of parameter list.
* Propagated the presence of * as parameter or specification of {{DISTINCT}} keyword in the UDF resolver framework so that it can be used by UDFs that behave differently when these are applicable.
* Modified the {{count()}} UDAF to support the same semantics of handling NULL values as regular SQL.
* Added test case to specifically exercise the newly introduced semantics of the count UDAF.

*Testing*
Ran all tests. Noted only two failures (input20.q, input33.q) which were found to be failing on the local trunk image as well.

If and when this patch is committed to the trunk, I will go ahead and update the Hive Wiki with details and examples regarding the use of {{count()}} UDAF in various forms.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884580#action_12884580 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Thanks for the explanation John. The SQL BNF that you pointed out is the normative SQL specification. I do not think any SQL implementations use this grammar though. The parallel is that of an interface and its implementation. While the interface can be short and precise, the implementations may choose to delegate interface methods to other implementation specific methods. Similarly, most databases deal with their own SQL grammar that is compliant with the SQL standard at specific levels.

More to the point in Hive - my key concern is that by modifying the grammar to make an exception for {{COUNT}}, we will be introducing a brittle coupling between the the parser and semantic analyzer. Right now the count aggregate function is treated like any other function and is thus part of the general framework. By making this change, we will be modifying it to be specifically associated from with the grammar directives. 

This is the current function definition in Hive QL grammar (*A*):

{noformat}
-->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+-->[ RPAREN ]-->
                                  |                    |  |  |                   |  |
                                  +----------->--------+  |  +------[ COMMA ]<---+  |
                                                          |                         |
                                                          +------------->-----------+
{noformat}

The patch that I have supplied already on this Jira modifies this definition as follows (*B*):

{noformat}

-->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR ]----------------------+-->[ RPAREN ]-->
                                  |                                                       |
                                  +--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+--+
                                     |                    |  |  |                   |  |
                                     +----------->--------+  |  +------[ COMMA ]<---+  |
                                                             |                         |
                                                             +------------->-----------+
{noformat}

If I were to modify the grammar to make an exception for {{COUNT}} it will likely be changed to something like this (*C*):

{noformat}

--+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[ RPAREN ]----------------+-->
  |                                                                                                    |
  +-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[ expression ]--+--+-->[ RPAREN ]--+
                                     |  |                    |  |                   |  |
                                     |  +----------->--------+  +----[ COMMA ]<-----+  |
                                     |                                                 |
                                     +------------------------->-----------------------+

{noformat}

Consider the *C* approach closely: The production that matches a {{COUNT}} invocation can be directly matched via the top branch using {{KW_COUNT}} token, or it could follow the branch below where {{functionName}} could match {{COUNT}}. On the semantic analyzer side, it makes the matching logic more complex and less intuitive since now the {{COUNT}} can be invoked via two branch conditions. For example - there would be one invocation that would directly delegate to the {{COUNT}} aggregate function, whereas another that will use the current resolver mechanism to invoke it.

Instead, the approach *B* keeps the grammar consistent with the regular function invocation. It does not favor any one function over the other and simply establishes matching rules for function production. That way, the call is then delegated to the semantic analyzer which in turn matches the appropriate handling function based on the name and parameter type using the generic resolver mechanism without regard to what function is being invoked. 

The changes supplied in the current patch also allow individual function handlers to decide if they would like to support the {{functionName(STAR)}} syntax. Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the {{AbstractGenericUDAFResolver}} class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it. Also as you can see in the syntax diagram for (B), there is no production that will match things like {{functionName(DISTINCT STAR)}} or {{functionName(STAR, EXPR)}} etc.

That said, I am open to going either way. I just wanted to put my point across so that you could consider it and make an informed decision. Let me know what you think.



> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870789#action_12870789 ] 

Namit Jain commented on HIVE-287:
---------------------------------

Overall looks good, some minor comments:

1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT.
    For eg: select avg(distinct c1,c2) from T

 and so on.

2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which
    has the new API - and a common class which invokes the default implementation, that would be better.

3. Follows from 1 - more tests are needed

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871310#action_12871310 ] 

Namit Jain commented on HIVE-287:
---------------------------------


I am sorry about 1 and 3 - you can ignore those.



2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better.

Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example.


-- yes





> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-287:
----------------------------

        Fix Version/s: 0.6.0
                       0.7.0
    Affects Version/s: 0.6.0

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879983#action_12879983 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John: Thanks for reviewing this change. I have some follow-up comments and suggestions:

bq. isDistinct: this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator? 

While the evaluation may be happening upstream, I was concerned that it does not exclude the cases where this information is relevant to the function invocation itself. For example, the implementation of {{count}} requires that if there is a valid argument list, it must be qualified with {{DISTINCT}}.

bq. isAllColumns: COUNT is probably the only function which is ever even going to care about this one. Couldn't we just use an empty array of TypeInfo to indicate all columns?

I had a similar idea, but after some consideration opted for a simpler design. I felt that overloading arguments to indicate special cases might lead to confusion and eventual problem when a use-case emerges that invalidates this assumption. 

I do agree with your point that it will be good to stay compatible if possible. One way to do it would be as follows:

# Revert the {{GenericUDAFResolver}} to its previous state but make the interface deprecated in favor of the abstract base class.
# Push the newly introduced method into {{AbstractGenericUDAFResolver}} implementation.
# Modify {{FunctionRegistry.getGenericUDAFEvaluator()}} method to test the resolver instance to be type compatible with {{AbstractGenericUDAFResolver}} and if so, invoke the new method. Otherwise revert to the old mechanism.

What do you think about this approach?


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880325#action_12880325 ] 

John Sichi commented on HIVE-287:
---------------------------------

The object creation and method call here are negligible since this is in the semantic analyzer context (not an execution codepath).


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886977#action_12886977 ] 

HBase Review Board commented on HIVE-287:
-----------------------------------------

Message from: "John Sichi" <js...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/290/
-----------------------------------------------------------

Review request for Hive Developers.


Summary
-------

(hopefully) final review by JVS


This addresses bug HIVE-287.
    http://issues.apache.org/jira/browse/HIVE-287


Diffs
-----

  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udf_count.q 961987 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION 
  http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientpositive/udf_count.q.out 961987 

Diff: http://review.hbase.org/r/290/diff


Testing
-------


Thanks,

John




> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888737#action_12888737 ] 

John Sichi commented on HIVE-287:
---------------------------------

Searching for "count" in wiki, I think examples in these pages should change to count(*):

http://wiki.apache.org/hadoop/Hive/LanguageManual/GroupBy
http://wiki.apache.org/hadoop/Hive/GettingStarted
http://wiki.apache.org/hadoop/Hive/Tutorial



> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884776#action_12884776 ] 

John Sichi commented on HIVE-287:
---------------------------------

@Ashish:  given where we are currently, the keyword concern is valid.

But, I think we might be doing our users a disservice in the long run letting them use standard SQL reserved keywords as unquoted identifiers.  This just causes interoperability confusion since most other systems don't support this, and it makes data portability harder.

(I'm not proposing any immediate action on this, just pointing it out.)


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886769#action_12886769 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

I vote for a meeting to hash this out face-to-face. I am willing to modify the patch provided we all are in agreement as to how it should be changed. It will be much better use of everyone's time to avoid the numerous deltas to the patch before settling in on the final solution. Please let me know what you think.

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) support count(*) and count distinct on multiple columns

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893044#action_12893044 ] 

Jeff Hammerbacher commented on HIVE-287:
----------------------------------------

We ran into some users who were using older versions of Hive (including the version of Hive distributed with CDH3b2) and were confused by the COUNT(*) in the documentation. I've tried to change the wiki but it does not appear to be saving my changes. If someone gets the chance, I'd love to see a disclaimer like "Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1) in place of COUNT(*)." added to the wiki pages that reference COUNT(*).

> support count(*) and count distinct on multiple columns
> -------------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch, HIVE-287-6-branch-0.6.patch, HIVE-287-6-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HIVE-287) count distinct on multiple columns does not work

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884134#action_12884134 ] 

John Sichi edited comment on HIVE-287 at 6/30/10 10:07 PM:
-----------------------------------------------------------

Yes, if you can do get it to work in the grammar itself (via a choice in the existing "function" rule), that would be best.  If it it's not possible to do it there for some reason, then semantic analyzer.

Could you explain what you mean regarding impact?  Since Hive doesn't support COUNT(*) before your patch, the only impact is on the rest of your patch, right?

BTW, here's the relevant BNF from my copy of the SQL:2003 standard (ISO/IEC 9075-2:2003 part 2 section 10.9), omitting some SQL/OLAP stuff such as <filter clause>:

{noformat}
<aggregate function > ::=
COUNT <left paren> <asterisk> <right paren>
| <general set function>

<general set function> ::=
<set function type> <left paren> [ <set quantifier> ] <value expression> <right paren>

<set function type> ::=
AVG |
MAX |
MIN |
SUM |
...

<set quantifier> ::= DISTINCT | ALL
{noformat}

As you can see, they make a special case for COUNT to allow for star there alone, and they don't allow COUNT(DISTINCT *).



      was (Author: jvs):
    Yes, if you can do get it to work in the grammar itself (via a choice in the existing "function" rule), that would be best.  If it it's not possible to do it there for some reason, then semantic analyzer.

Could you explain what you mean regarding impact?  Since Hive doesn't support COUNT(*) before your patch, the only impact is on the rest of your patch, right?

BTW, here's the relevant BNF from my copy of the SQL:2003 standard (ISO/IEC 9075-2:2003 part 2 section 10.9), omitting some SQL/OLAP stuff such as <filter clause>:

<aggregate function > ::=
COUNT <left paren> <asterisk> <right paren>
| <general set function>

<general set function> ::=
<set function type> <left paren> [ <set quantifier> ] <value expression> <right paren>

<set function type> ::=
AVG |
MAX |
MIN |
SUM |
...

<set quantifier> ::= DISTINCT | ALL

As you can see, they make a special case for COUNT to allow for star there alone, and they don't allow COUNT(DISTINCT *).


  
> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Status: Patch Available  (was: Open)

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar updated HIVE-287:
----------------------------------

    Attachment: HIVE-287-3.patch

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-287) count distinct on multiple columns does not work

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-287:
----------------------------

    Status: Open  (was: Patch Available)

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>             Fix For: 0.6.0
>
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.