You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Qifan Chen (Jira)" <ji...@apache.org> on 2021/02/23 17:15:00 UTC
[jira] [Commented] (IMPALA-10538) Document the newly added scale argument of ndv function

    [ https://issues.apache.org/jira/browse/IMPALA-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289199#comment-17289199 ] 

Qifan Chen commented on IMPALA-10538:
-------------------------------------

Hi

The commit message for the NDV extension (eef61d22d89b97eb589936701a41d05d84b0da8a) has relevant content.  I copied the relevant part as follows. 

   This work addresses the current limitation in NDV function by                                
    extending the function to optionally take a secondary argument                               
    called scale.                                                                                
                                                                                                 
       NDV([DISTINCT | ALL] expression [, scale])                                                
                                                                                                 
    Without the secondary argument, all the syntax and semantics are                             
    preserved. The precision, which determines the total number                                  
    of different estimators in the HLL algorithm, is still 10.                                   
                                                                                                 
    When supplied, the scale argument must be an interger literal                                
    in the range from 1 to 10. Its value is internally mapped                                    
    to a precision used by the HLL algorithm, with the following                                 
    mapping formula:                                                                             
                                                                                                 
      precision = scale + 8.                                                                     
                                                                                                 
    Thus, a scale of 1 is mapped to a precision of 9 and a scale of                              
    10 is mapped to a precision of 18.                                                           
                                                                                                 
    A large precision value generally produces a better estimation                               
    (i.e. with less error) than a small precision value, due to extra                            
    number of estimators involved. The expense is at the extra amount of                         
    memory needed. For a given precision p, the amount of memory used                            
    by the HLL algorithm is in the order of 2^p bytes.     

   Performance:                                                                           
    1. Ran estimation error tests against a total of 22 distinct data sets                 
       loaded into external Impala tables.                                                 
                                                                                           
       The error was computed as                                                           
       abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>.          
                                                                                           
       Overall, the precision of 18 (or the scale value of 10) gave                        
       the best result with worst estimation error at 0.42% (for one set                   
       of 10 million integers), and average error no more than 0.17%,                      
       at the cost of 256Kb of memory for the internal data structure per                  
       evaluation of the HLL algorithm.  Other precisions (such as 16 and                  
       17) were also very reasonable but with slightly larger estimation                   
       errors.                                                                             
                                                                                           
    2. Ran execution time tests against a total of 6 distinct data files                   
       on a single node EC2 VM in debug mode. These data files were loaded                 
       in turn into a single column in an external Impala table.  It was                   
       found that the total execution time was relatively the same across                  
       different scales for a given table configuration. It remains to be                  
       seen the execution time for tables involving multiple data files                    
       across multiple nodes.                                                              
                                                      

> Document the newly added scale argument of ndv function
> -------------------------------------------------------
>
>                 Key: IMPALA-10538
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10538
>             Project: IMPALA
>          Issue Type: Documentation
>          Components: Docs
>            Reporter: Quanlong Huang
>            Assignee: shajini thayasingh
>            Priority: Critical
>
> We add a new argument, scale, to the ndv() function in IMPALA-2658 to control the precision. We need to update the related documents and give more examples in
> [https://github.com/apache/impala/blob/d271baa33da1a02aa6ffc47b0380dc62239107b4/docs/topics/impala_ndv.xml]
> Web link is [https://impala.apache.org/docs/build/html/topics/impala_ndv.html]
> cc [~sql_forever] who is the author of this great feature. He could provide more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org