You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alexis Rondeau (JIRA)" <ji...@apache.org> on 2009/05/06 19:26:30 UTC

[jira] Created: (HIVE-474) Support for distinct selection on two or more columns

Support for distinct selection on two or more columns
-----------------------------------------------------

                 Key: HIVE-474
                 URL: https://issues.apache.org/jira/browse/HIVE-474
             Project: Hadoop Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Alexis Rondeau


The ability to select distinct several, individual columns as by example: 

select count(distinct user), count(distinct session) from actions;   

Currently returns the following failure: 

FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by jian yi <ey...@gmail.com>.
Hi Liu,

How to implement to support for distinct selection on two or more columns?

Regards
Jian

2010/2/26 Liu (JIRA) <ji...@apache.org>

>
>    [
> https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838683#action_12838683]
>
> Liu commented on HIVE-474:
> --------------------------
>
> We have implemented this feature using union type, as metioned as "A2" by
> Zheng.
>
> > Support for distinct selection on two or more columns
> > -----------------------------------------------------
> >
> >                 Key: HIVE-474
> >                 URL: https://issues.apache.org/jira/browse/HIVE-474
> >             Project: Hadoop Hive
> >          Issue Type: Improvement
> >          Components: Query Processor
> >            Reporter: Alexis Rondeau
> >
> > The ability to select distinct several, individual columns as by example:
> > select count(distinct user), count(distinct session) from actions;
> > Currently returns the following failure:
> > FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different
> Columns not Supported user
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Hadoop Forum: http://bbs.hadoopor.com

[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Attachment: patch-474-1.txt

Patch incorporates Namit's comments.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Attachment: patch-474-2.txt

bq. Not a good idea to ignore skew for multiple distincts.
I agree.

Updated patch throws error when there are multiple distincts with skew in data. Also, adds negative testcases.


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923602#action_12923602 ] 

Namit Jain commented on HIVE-474:
---------------------------------

1. add initEvaluators() in Operator.java instead of ReduceSinkOperator.java
2. ReduceSinkDesc: use numKeys and getNumKeys() or change numKeys to numDistributionKeys -
   You may run into problems with serialization/deserialization
3. Add some comments in initEvaluatorsAndReturnStruct in ReduceSinkOperator
   -- explain that it is same as parent in case of no union for groupby
4. Can you more comments in GroupByOperator and SemanticAnalyzer also ?
   It looks OK, but it will help if there are more comments.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Mafish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mafish updated HIVE-474:
------------------------

    Attachment: hive-474.0.4.2rc.patch

for 0.4.2rc

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-474:
----------------------------

    Status: Open  (was: Patch Available)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Status: Patch Available  (was: Open)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474-3.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Mafish (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851261#action_12851261 ] 

Mafish commented on HIVE-474:
-----------------------------

I have uploaded a patch originated from 0.4.2rc, please have a review.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>         Attachments: hive-474.1.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Mafish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mafish updated HIVE-474:
------------------------

    Attachment: hive-474.1.patch

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>         Attachments: hive-474.1.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921147#action_12921147 ] 

John Sichi commented on HIVE-474:
---------------------------------

I think you'll see the problem if you make sure the key/value entries in each row of your src table are uncorrelated.  If you're using data/files/kv1.txt, the value is just the key prefixed by "val_".


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719368#action_12719368 ] 

Min Zhou commented on HIVE-474:
-------------------------------

I thought there is another special case here.  If the query has multiple distinct operations on the same column , we can push down the evaluation of those expressions into reducers.

Query:
  select a, count(distinct if(codition, b, null)) as col1, count(distinct if(!condition, null, b)) as col2, count(distinct b) as col3

Plan:
  Job :
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: nothing
    Reduce side:
      Group By
        a,  count col1, col2, col3 by evaluating their expressions

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925081#action_12925081 ] 

Namit Jain commented on HIVE-474:
---------------------------------

+1

Otherwise, the changes look good

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-474:
----------------------------

    Status: Open  (was: Patch Available)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Attachment: patch-474-3.txt

Patch is updated to trunk.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474-3.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-474:
-------------------------------

    Assignee: Mafish

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Mafish
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Status: Patch Available  (was: Open)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-474:
----------------------------

    Status: Open  (was: Patch Available)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719368#action_12719368 ] 

Min Zhou edited comment on HIVE-474 at 6/14/09 7:02 PM:
--------------------------------------------------------

I thought there is another special case here.  If the query has multiple distinct operations on the same column , we can push down the evaluation of those expressions into reducers.
{code}
Query:
  select a, count(distinct if(codition, b, null)) as col1, count(distinct if(!condition, null, b)) as col2, count(distinct b) as col3

Plan:
  Job :
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: nothing
    Reduce side:
      Group By
        a,  count col1, col2, col3 by evaluating their expressions
{code}

      was (Author: coderplay):
    I thought there is another special case here.  If the query has multiple distinct operations on the same column , we can push down the evaluation of those expressions into reducers.

Query:
  select a, count(distinct if(codition, b, null)) as col1, count(distinct if(!condition, null, b)) as col2, count(distinct b) as col3

Plan:
  Job :
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: nothing
    Reduce side:
      Group By
        a,  count col1, col2, col3 by evaluating their expressions
  
> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910404#action_12910404 ] 

John Sichi commented on HIVE-474:
---------------------------------

We'll take a look at this one.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Mafish
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924719#action_12924719 ] 

Namit Jain commented on HIVE-474:
---------------------------------

   Not a good idea to ignore skew for multiple distincts.
   It might be safer to throw an error right now in such a scenario - we can add a better
   technique for handling this scenario later.

Can you add a negative testcase for the scenario above ?

Otherwise, it looks good.


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Attachment: patch-474.txt

I have reworked on the patch from Mafish so that it works for trunk. Now, the patch takes care of multiple columns in distinct (HIVE-287) also.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919023#action_12919023 ] 

Namit Jain commented on HIVE-474:
---------------------------------

Once HIVE-537 is committed, the general idea is as listed in the example in HIVE-537.


Say, the query is:

select a, count(distinct b), count(distinct c) from T group by a

and the data is:

a1   b1   c1
a1   b1   c2
a1   b2   c2
a1   b2   c1
a2   ...


Mapper will emit a union type:

a1  0:b1
a1  1:c1
a1  0:b1
a1  1:c2
a1  0:b2
a1  1:c2
a1  0:b2
a1  1:c1


Since the sort key is (a, union_tag, (b|c))

The data will come to the reducer in the following order: 

a1  0:b1
a1  0:b1
a1  0:b2
a1  0:b2
a1  1:c1
a1  1:c1
a1  1:c2
a1  1:c2

and then the reducer can stream the distincts

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Assignee: Amareshwari Sriramadasu  (was: Mafish)

Will upload a patch once HIVE-537 is committed.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-474:
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.7.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed. Thanks Amareshwari

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474-3.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Mafish (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mafish updated HIVE-474:
------------------------

    Attachment:     (was: hive-474.1.patch)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920924#action_12920924 ] 

Amareshwari Sriramadasu commented on HIVE-474:
----------------------------------------------

I'm trying to understand the real problem we are solving here. As first step, I did the changes in SemanticAnalyzer to allow distinct selection for multiple columns and executed the query: 
{code}
select count(key), count(value), count(distinct key), count(distinct value), count(distinct key, value) from src;
{code}
It returns corrects results with the current implementation itself. 
Can you explain the problem with current implementation?


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Status: Patch Available  (was: Open)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716057#action_12716057 ] 

Zheng Shao commented on HIVE-474:
---------------------------------

There are several approaches to solve this problem:

A1: separate group-by and join the results.

{code}
SELECT COALESCE(t1.key, t2.key), COALESCE(d_a, 0), , COALESCE(d_b, 0)
FROM
(SELECT key, count(distinct a) as d_a ...) t1
OUTER JOIN
(SELECT key, count(distinct b) as d_b ...) t2
ON t1.key = t2.key
{code}

A2: Take advantage of union type (HIVE-537).
See HIVE-537 for details.

A3: Take advantage of partitioned merge join:
Here is a different plan. It depends on partitioned merge join.
Also the 2 jobs have to have the same 

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Job1:
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: d
      Save a, c to temp_file1
    Reduce side:
      Group By:
        a, count(distinct b), sum(d)
    Output: temp_file2
  Job 2: Input: temp_file1
    Map side:
      Emit: distribution_key: a, sort_key: a, c, value: nothing
    Reduce side:
      Group By
        a, count(distinct c)
      Partitioned Merge Join with temp_file2 on a
        a, count(distinct b), sum(d), count(distinct c)
{code}


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Liu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838683#action_12838683 ] 

Liu commented on HIVE-474:
--------------------------

We have implemented this feature using union type, as metioned as "A2" by Zheng.

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-474:
-----------------------------------------

    Status: Patch Available  (was: Open)

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925263#action_12925263 ] 

Namit Jain commented on HIVE-474:
---------------------------------

running tests

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474-3.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923035#action_12923035 ] 

Namit Jain commented on HIVE-474:
---------------------------------

Diff at https://review.cloudera.org/r/1052/ for review


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925080#action_12925080 ] 

Namit Jain commented on HIVE-474:
---------------------------------

Can you refresh and regenerate the patch - I am getting some compile errors after applying to trunk ?


    [javac] /data/users/njain/hive-commit1/ql/build.xml:159: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 622 source files to /data/users/njain/hive-commit1/build/ql/classes
    [javac] /data/users/njain/hive-commit1/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:204: cannot find symbol
    [javac] symbol  : class StructField
    [javac] location: class org.apache.hadoop.hive.ql.exec.GroupByOperator
    [javac]     List<? extends StructField> sfs =
    [javac]                    ^
    [javac] /data/users/njain/hive-commit1/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:205: cannot find symbol
    [javac] symbol  : class StandardStructObjectInspector
    [javac] location: class org.apache.hadoop.hive.ql.exec.GroupByOperator
    [javac]       ((StandardStructObjectInspector) rowInspector).getAllStructFieldRefs();
    [javac]         ^
    [javac] /data/users/njain/hive-commit1/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:207: cannot find symbol
    [javac] symbol  : class StructField
    [javac] location: class org.apache.hadoop.hive.ql.exec.GroupByOperator
    [javac]       StructField keyField = sfs.get(0);
    [javac]       ^
    [javac] /data/users/njain/hive-commit1/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:211: cannot find symbol
    [javac] symbol  : class StandardStructObjectInspector
    [javac] location: class org.apache.hadoop.hive.ql.exec.GroupByOperator
    [javac]         if (keyObjInspector instanceof StandardStructObjectInspector) {
    [javac]                                        ^



Most probably, some merge issue

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474-1.txt, patch-474-2.txt, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913865#action_12913865 ] 

John Sichi commented on HIVE-474:
---------------------------------

Some comments after a brief look:

* The patch is going to need to be rebased against trunk (I guess after HIVE-537 is committed)?

* We should make sure that in the case of a single distinct agg, we leave the plan as it is today, and only use the new plan generation when multiple distincts are present.  This may already be the case; I couldn't quite tell from the example plans in the test cases (it would be nice to have some simpler queries for that).

* Regarding moving expression evaluation to the reduce side:  in general, this is something that needs cost-based optimization, due to factors like (a) data size before and after expression evaluation and (b) parallelization benefit of spreading out the computation over lots of mappers (assuming many more mappers than reducers).


> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Mafish
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923033#action_12923033 ] 

Namit Jain commented on HIVE-474:
---------------------------------

I will take a look 

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Amareshwari Sriramadasu
>         Attachments: hive-474.0.4.2rc.patch, patch-474.txt
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Venkatesh S (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902421#action_12902421 ] 

Venkatesh S commented on HIVE-474:
----------------------------------

Any update on this?

> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>            Assignee: Mafish
>         Attachments: hive-474.0.4.2rc.patch
>
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-474) Support for distinct selection on two or more columns

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716057#action_12716057 ] 

Zheng Shao edited comment on HIVE-474 at 6/3/09 1:42 PM:
---------------------------------------------------------

There are several approaches to solve this problem:

A1: separate group-by and join the results.

{code}
SELECT COALESCE(t1.key, t2.key), COALESCE(d_a, 0), , COALESCE(d_b, 0)
FROM
(SELECT key, count(distinct a) as d_a ...) t1
OUTER JOIN
(SELECT key, count(distinct b) as d_b ...) t2
ON t1.key = t2.key
{code}

A2: Take advantage of union type (HIVE-537).
See HIVE-537 for details.

A3: Take advantage of partitioned merge join:
Here is a different plan. It depends on partitioned merge join.
Also the 2 jobs have to have the same number of reducers.

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Job1:
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: d
      Save a, c to temp_file1
    Reduce side:
      Group By:
        a, count(distinct b), sum(d)
    Output: temp_file2
  Job 2: Input: temp_file1
    Map side:
      Emit: distribution_key: a, sort_key: a, c, value: nothing
    Reduce side:
      Group By
        a, count(distinct c)
      Partitioned Merge Join with temp_file2 on a
        a, count(distinct b), sum(d), count(distinct c)
{code}


      was (Author: zshao):
    There are several approaches to solve this problem:

A1: separate group-by and join the results.

{code}
SELECT COALESCE(t1.key, t2.key), COALESCE(d_a, 0), , COALESCE(d_b, 0)
FROM
(SELECT key, count(distinct a) as d_a ...) t1
OUTER JOIN
(SELECT key, count(distinct b) as d_b ...) t2
ON t1.key = t2.key
{code}

A2: Take advantage of union type (HIVE-537).
See HIVE-537 for details.

A3: Take advantage of partitioned merge join:
Here is a different plan. It depends on partitioned merge join.
Also the 2 jobs have to have the same 

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Job1:
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: d
      Save a, c to temp_file1
    Reduce side:
      Group By:
        a, count(distinct b), sum(d)
    Output: temp_file2
  Job 2: Input: temp_file1
    Map side:
      Emit: distribution_key: a, sort_key: a, c, value: nothing
    Reduce side:
      Group By
        a, count(distinct c)
      Partitioned Merge Join with temp_file2 on a
        a, count(distinct b), sum(d), count(distinct c)
{code}

  
> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.