You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Carter Shanklin (JIRA)" <ji...@apache.org> on 2014/07/02 18:21:25 UTC

[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

    [ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050265#comment-14050265 ] 

Carter Shanklin commented on HIVE-7296:
---------------------------------------

[~sjtufighter]

I have spoken to some Hive users that implemented their own UDF to compute approximate counts and ranks using lossy counting http://www.vldb.org/conf/2002/S10P03.pdf

They had tried some other approaches but settled on this because it allows tunable error and deals with skew fairly well.

This could be implemented in Hive using partitioned table functions and I think there are some users who would like this functionality. This sounds similar to your number (3). I've spoken to a few people on the Hive team and they think it sounds like a good idea, any interest in building this?

> big data approximate processing  at a very  low cost  based on hive sql 
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7296
>                 URL: https://issues.apache.org/jira/browse/HIVE-7296
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics：
> 1.Cardinality Estimation,   count the number of different elements in the collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already registered?
> According to the implementation mechanism of hive , it  will cost too large memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case  , we can use  approximate processing  to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for these new features so much if possible. 
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)