You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Venky Iyer (JIRA)" <ji...@apache.org> on 2009/01/29 23:17:59 UTC

[jira] Created: (HIVE-259) Add PERCENTILE aggregate function

Add PERCENTILE aggregate function
---------------------------------

                 Key: HIVE-259
                 URL: https://issues.apache.org/jira/browse/HIVE-259
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Venky Iyer


Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment: Percentile.xlsx

Percentiles that match included test case

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837522#action_12837522 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

@Carl: How did you get this list?

Also, I'm not sure to understand this: 

Why HashMap and ArrayList are not allowed if supported??

43:7: Declaring variables, return values or parameters of type 'HashMap' is not allowed.
44:7: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.
164:12: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.
184:7: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832139#action_12832139 ] 

Todd Lipcon commented on HIVE-259:
----------------------------------

Agreed re HashMap. Also, there should be some kind of setting that limits how much RAM gets used up. In a later iteration we could do adaptive histogramming once we hit the limit. In this version we should just throw up our hands and fail with a message that says the user needs to discretize harder.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment:     (was: Percentile.xlsx)

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839391#action_12839391 ] 

He Yongqiang commented on HIVE-259:
-----------------------------------

The code looks very good. Thanks for the code work, Jerome and Zheng!
Just some minor comments:
(1) I am not familiar with the exact definition of percentile function. Is the percentile()'s result must be a member of input data? 
(2) HashMap and ArrayList is used to copy and sort. Can we use tree map here? this is a small and can be ignored.
In the beginning of  new test case, 
DESCRIBE FUNCTION percentile;
DESCRIBE FUNCTION EXTENDED percentile;
appears two times.

And this is a very good function to have, it will be great if we can update its usage to the wiki page or somewhere.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-259:
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6.0
     Release Note: Add PERCENTILE aggregate function
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed. Thanks for the hard work, Jerome Boulon and Zheng.

Btw, i manually fixed a show_function.q diff.  Please update the usage of percentile function on the wiki or somewhere.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>             Fix For: 0.6.0
>
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment: Percentile.xlsx
                jb2.txt

Percentile test file + validation using Excep Percentile function:
CREATE TABLE JB2
(
duration bigint,
code string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
    STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/jb2.txt' INTO TABLE JB2;



Result:
hive> select percentile(duration,"25,50,99") from JB2;
Ended Job = job_201002201654_0006
OK
[14.0,33.0,416.4000000000001]
Time taken: 36.261 seconds

hive> select code,percentile(duration,"25,50,99") from JB2 group by code;
Ended Job = job_201002201654_0007
OK
a	[2.0,17.5,427.2299999999999]
b	[22.75,44.5,345.84999999999997]
c	[18.0,29.0,58.760000000000005]
Time taken: 23.419 seconds
hive> quit;


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838119#action_12838119 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

The test cases looks a bit too trivial or the results have problems? They always return the same number for the 3 different percentile values.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837527#action_12837527 ] 

Carl Steinbach commented on HIVE-259:
-------------------------------------

bq. How did you get this list? 

Run 'ant checkstyle'. The list of violations gets dumped to build/checkstyle/checkstyle-errors.txt.

bq. Why HashMap and ArrayList are not allowed if supported?

You're allowed to use ArrayList and HashMap, but you're supposed to refer
to instances of these classes using the interface (List or Map) instead of the
concrete type, e.g.

{code:java}
Map<String, String> myMap = new HashMap<String, String>();

public List<String> getStringList() {
   return new ArrayList<String>();
}
{code}



> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858600#action_12858600 ] 

Ning Zhang commented on HIVE-259:
---------------------------------

Hi Jerome and Zheng, 

Could any of you write the syntax and semantics of the percentile function in the wiki page (http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF or http://wiki.apache.org/hadoop/Hive/HiveUDFGuide)?

Thanks,

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>             Fix For: 0.6.0
>
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806183#action_12806183 ] 

Carl Steinbach commented on HIVE-259:
-------------------------------------


@Jerome: Agreed. Allowing sort results to be shared by multiple functions (like in the following example) is key to supporting analytic functions efficiently.

{code:sql}
SELECT department_id,
   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary DESC) 
      "Median cont",
   PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary DESC) 
      "Median disc"
   FROM employees GROUP BY department_id;
{code}

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781824#action_12781824 ] 

Carl Steinbach commented on HIVE-259:
-------------------------------------

This would be a very useful function to have.

For the sake of completeness (and without much additional effort) it would be nice to provide both PERCENTILE_DISC and PERCENTILE_CONT.

PERCENTILE_CONT: http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm
PERCENTILE_DISC:  http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions111.htm


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832146#action_12832146 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

Didn't know that we can use an Hash on the state Object ...
Is there any limitation on what can be used on the state object or can we use any java Object? 
Also how is the state serialized between Map and Reduce?

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834474#action_12834474 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

> Is there any limitation on what can be used on the state object or can we use any java Object? 
We support primitive classes, HashMap (translated into map<> type in Hive), ArrayList (array type in Hive), and any simple struct-like classes (struct type in Hive).
We support arbitrary levels of nesting, but no recursive types.

> Also how is the state serialized between Map and Reduce?
We use SerDe (see SerDe.serialize(...) ) to serialize/deserialize the objects, as well as translations between objects that have the same "type" (see ObjectInspector and ObjectInspectorConverters).


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.1.patch, HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-259:
----------------------------

    Attachment: HIVE-259.5.patch

We take the method recommended by NIST.

See http://en.wikipedia.org/wiki/Percentile#Alternative_methods

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Status: Open  (was: Patch Available)

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668699#action_12668699 ] 

Edward Capriolo commented on HIVE-259:
--------------------------------------

95% percentile is very often used in Internet Service Provider billing that might be useful. 

The percentile calculation is a sort and then picking an element. The syntax could be like:

* PERCENTILE(column, .99) 
* PERCENTILE(column, .50)

In this manner you could do any percentile.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832134#action_12832134 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

Jerome, it seems to me that the best data structure for counting is a HashMap, which allows near-constant-time insertion, find, and insertion. When we "terminate" we can get the entries and sort them but that cost should be small (it's one-time cost and the number of unique items won't be too big - users should have used "round" to shrink the number of unique numbers).

It seems currently we are paying log(n) cost for each find, and O(n) cost for each insertion.

Does that make sense?

For sharing the state object, we can just declare the state class as public static.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832572#action_12832572 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

Sure, with Map support it's much simple ;-)


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.1.patch, HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806134#action_12806134 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

It will also be good to be able to ask for more than one PERCENTILE(column, .99) with only one single structure in memory
ex: select PERCENTILE(column, .99), PERCENTILE(column, .50) from myTable;


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment: HIVE-259.patch

First iteration for percentile (tested using Hive trunk and Hadoop 0.18.3):
usage:
CREATE TEMPORARY FUNCTION percentile AS 'org.apache.hadoop.hive.ql.udf.Percentile';
select percentile(myColumn,"25,50,99") from MyTable;

- How can I share the state object cross functions?


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839394#action_12839394 ] 

He Yongqiang commented on HIVE-259:
-----------------------------------

looks good, will test and commit.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-259 started by Jerome Boulon.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Alex Loddengaard (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837526#action_12837526 ] 

Alex Loddengaard commented on HIVE-259:
---------------------------------------

Hey Jerome,

I assume it's because you're supposed to use the interface type (e.g. Map or List) for return types, parameter types, and declaring variables.

Correct me if I'm wrong, those of you more knowledgeable about Hive's checkstyle :).

Alex

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838512#action_12838512 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

Can someone explain how can I create/populate a new table to be used by the ant test target?


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802616#action_12802616 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

This is a good first step. We can provide some UDFs to "bucketize" the values first in case the user needs it.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Status: Patch Available  (was: Open)

HIVE-259-3.patch

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782120#action_12782120 ] 

Todd Lipcon commented on HIVE-259:
----------------------------------

An easy way to do this that would work for a ton of data sets would to be essentially do counting sort. If you have only a few thousand distinct values in the column to be analyzed, just make a hashtable, count up how many you see, and then in the single reducer use the histogram to figure out the percentile. This should work great for datasets like age, and even for sets like "number of days since user signed up". For sets that are truly continuous, would be useful when combined with a binning UDF to discretize it.

Sadly it's not general case, but would be an easy first step.

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860061#action_12860061 ] 

John Sichi commented on HIVE-259:
---------------------------------

I couldn't see the point of having two competing UDF guide pages, so I renamed the XPath-specific one as such and linked it from the main one.  Just housekeeping to reduce confusion; I did not actually add the percentile info.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>             Fix For: 0.6.0
>
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-259:
----------------------------

    Attachment: HIVE-259.4.patch

This one fixes all checkstyle errors, and uses *Writable classes to avoid creating new objects as much as possible.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863415#action_12863415 ] 

John Sichi commented on HIVE-259:
---------------------------------

PERCENTILE docs are still missing on the consolidated page:

http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>             Fix For: 0.6.0
>
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-259:
----------------------------

    Attachment: HIVE-259.1.patch

Jerome, I did a skeleton of the code to use HashMap. Do you want to start from there and add what is missing?


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259.1.patch, HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838735#action_12838735 ] 

Todd Lipcon commented on HIVE-259:
----------------------------------

Doesn't the autoboxing of Integer types actually allocate objects? I think JVM only flyweights integers for very small ones (iirc only from -127 to 128)

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838173#action_12838173 ] 

Jerome Boulon commented on HIVE-259:
------------------------------------

- From my point of view, changing variable access to private in the state object will not make the code more readable ...
- I'll change all variables to be lowerCase to match java style, current variable's name are based on Oracle definition.

@Zheng - I'm not using an ArrayList<Integer> but a String to avoid unnecessary object creation (for every single row) ... would even be better if the constructor could have been used but I haven't found how to do that. If we care about 1 extra empty arrayList per mapper/spill in memory then we should care about creating (1 ArrayList + 1 Integer Object per percentile) per row.

@Zheng - Regarding the test case that what I add in mind when I asked you, howto create my own table and that exactly the reason why I post Jb2.* files


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-259:
--------------------------------

    Release Note:   (was: Add PERCENTILE aggregate function)

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>             Fix For: 0.6.0
>
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838118#action_12838118 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

Also see http://wiki.apache.org/hadoop/Hive/HowToContribute#Coding_Convention

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838516#action_12838516 ] 

Carl Steinbach commented on HIVE-259:
-------------------------------------

@Jerome: take a look at ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment: HIVE-259-2.patch

Percentile function

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-259:
-------------------------------

    Assignee: Jerome Boulon

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Status: Patch Available  (was: In Progress)

Percentile function.
Usage: select code,percentile(MyColumnB,"<P1,P2,P3,Px>") from <MyTable> group by <myColumn>;

> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerome Boulon updated HIVE-259:
-------------------------------

    Attachment: HIVE-259-3.patch

- use Double instead of Integer for percentile so we can ask for 99.999 percentile 
- checkstyle fix except State object
- new test case


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839393#action_12839393 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

> (1) I am not familiar with the exact definition of percentile function. Is the percentile()'s result must be a member of input data?
See the link above.

> (2) HashMap and ArrayList is used to copy and sort. Can we use tree map here? this is a small and can be ignored.
In the beginning of new test case, 
I think HashMap is better here. The reason is that the number of "iterate" is usually much higher than the number of unique numbers (the size of the HashMap). By using HashMap we reduce the cost of "iterate".

> In the beginning of new test case, .. appears two times
Fixed in HIVE-259.5.patch


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838135#action_12838135 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

1. We are converting "25,50,99" to ArrayList<Integer>. Why don't we directly accept an int array (or a double array to allow 99.9).

In the query, the user can say:

SELECT percentile(mycol, array(25, 50, 99) FROM mytable;

2. Get rid of State.initDone.  We can set "ArrayList<Integer> percentiles" to null first. That saves some space in memory as well as network when we transfer the state from mapper to reducer.

3. In Java, variable names should be lowercased.

4. We should change the test case to be non-trivial.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837500#action_12837500 ] 

Carl Steinbach commented on HIVE-259:
-------------------------------------

Please fix the new Checkstyle errors in UDAFPercentile.java:

35: Missing a Javadoc comment.
39: Missing a Javadoc comment.
39:10: 'public' modifier out of order with the JLS suggestions.
41: Missing a Javadoc comment.
41:12: 'public' modifier out of order with the JLS suggestions.
42:15: Variable 'initDone' must be private and have accessor methods.
43:7: Declaring variables, return values or parameters of type 'HashMap' is not allowed.
43:35: Variable 'counts' must be private and have accessor methods.
44:7: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.
44:26: Variable 'percentiles' must be private and have accessor methods.
47: Missing a Javadoc comment.
47:12: 'public' modifier out of order with the JLS suggestions.
56:11: Variable 'state' must be private and have accessor methods.
82:43: Name '_percentiles' must match pattern '^[a-z][a-zA-Z0-9]*$'.
85:28: Expression can be simplified.
105:39: ')' is preceded with whitespace.
117:26: Expression can be simplified.
125:65: Name 'RN' must match pattern '^[a-z][a-zA-Z0-9]*$'.
129:12: Name 'CRN' must match pattern '^[a-z][a-zA-Z0-9]*$'.
130:12: Name 'FRN' must match pattern '^[a-z][a-zA-Z0-9]*$'.
164:12: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.
173: Line is longer than 100 characters.
184:7: Declaring variables, return values or parameters of type 'ArrayList' is not allowed.
188:12: Name 'N' must match pattern '^[a-z][a-zA-Z0-9]*$'.
189:14: Name 'RN' must match pattern '^[a-z][a-zA-Z0-9]*$'.
191:16: Name 'P' must match pattern '^[a-z][a-zA-Z0-9]*$'.


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838718#action_12838718 ] 

Zheng Shao commented on HIVE-259:
---------------------------------

Hi Jerome, using ArrayList<Integer> won't cause unnecessary Object creation. We will just create a single ArrayList<Integer> and use it forever.
Does that make sense?


> Add PERCENTILE aggregate function
> ---------------------------------
>
>                 Key: HIVE-259
>                 URL: https://issues.apache.org/jira/browse/HIVE-259
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Venky Iyer
>            Assignee: Jerome Boulon
>         Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx
>
>
> Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.