You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/26 17:38:52 UTC

[GitHub] [incubator-druid] nishantmonu51 opened a new issue #4349: [Proposal] Improve Handling of Nulls in Druid

nishantmonu51 opened a new issue #4349: [Proposal] Improve Handling of Nulls in Druid
URL: https://github.com/apache/incubator-druid/issues/4349
 
 
   ## Motivation  
   This proposal is to improve handling of null values in druid by treating nulls as missing values which don’t actually exist. This will make it more compatible to SQL standard and help with integration with other existing BI systems which support ODBC/JDBC.
   
   ## Current Behavior 
   Current Null handling in druid is as follows - 
   1. String Dimensions - Nulls and Empty Strings are considered equivalent 
   2. Long/Float columns Nulls are considered equivalent to 0.
   
   ## Proposed Behavior Changes for Null Handling
   ### Null semantics after the changes described in proposal -
   1. Null and Empty Strings will be handled differently
   2. For every column type, null value will be considered equivalent to an absent/missing value. 
   3. Aggregate functions will ignore null values. If the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null except count which will return 0 instead of null. 
   e.g
       1. a min aggregation over null and 5 will give a result of 5
       2. longSum/doubleSum over values (null, 1, null) will return 1. 
       3. Count over values (null, 1, null) will return 1
       4. Count over values (null, null) will return 0
   4. Rollup of metrics - if a null and a non-null value are rolled up using an aggregator at ingestion time the result will be defined by the behavior of aggregator specified above.
   5. Math expressions - the result value will be null if any of the components is null. 
   6. missing column will be considered equivalent to a column having null values for all the rows.
   7. In case of Sorting, by default nulls will be sorted as the lowest value. Users can specify to sort nulls as the highest value as part of the query. 
   8. Users will be able to specify default values to replace NULL values at ingestion time. 
   9. Comparisons between null and non-null values will follow [three-valued logic](https://en.wikipedia.org/wiki/Three-valued_logic) and the result of a comparison involving NULL is not a boolean value – it is a null value.
   
   ## Implementation Changes - 
   Ability to specify default values for dimensions/metrics while ingestion 
   This idea is to allow users to specify a default value for each column that can be used to replace nulls during ingestion.. e.g 
   
   ```json
   {
   "type" : "String", 
   "name" : "column1", 
   "replaceMissingValuesWith" : "abc"
   }, 
   {
   "type" : "Long", 
   "Name": "column2", 
   "replaceMissingValuesWith" : 5
   } 
   ```
   
   In this case, any null value in column1 will be replaced with String “abc” and any null value for long column will be replaced with 5. The final segment created will store these default values instead of nulls. 
   
   ## Storage Format changes 
   The current storage format is not sufficient for being able to differentiate between the fact that whether a value is null or not. We need to change the storage format to also store the information about the presence of a value. we can choose following options - 
   1. For String columns - we can choose one of the following - 
       1. In the dictionary encoding reserve 0 as being mapped to null value, any other value will get dictionary encoding value > 0. This will also work for multi value column, where an array of values can store 0 as the id. This will need incrementing segment version for backwards compatibility. 
       2. Add a presence bitmap which tells whether the value is present or absent in the row. For a row with multiple value, it may be possible that both the presence bit is not set and the row have some values, which will indicate that the value is a multi value including null. 
   2. For long/Float columns - Add a presence bitmap to the columns which tells whether the value is present or not.
   
   For String column, it seems approach (a) will be better as it will allow simplifying handling of multi values columns also.
   We can also skip creating a presence bitmap for the case when all the values in a column are known to be non-null. 
   
   ### After the changes the schema for the columns will be as follows - 
   1. String Column - 
      1. Dictionary Encoding starting will 1, 0 will be reserved for mapping null value. 
      2. Column Data
      3. Bitmap Indexes
   2. Numeric Column - 
       1. Column Data 
       2. Presence Bitmap 
   
   ## Nullability for Aggregators/Metrics 
   As per the current implementation, most aggregators are coded to replace null values with default  e.g.  sum treats them as zeroes, min treats them as positive infinity, etc. 
   To match the new semantics we need to make following changes - 
   1. Make aggregators nullable, where if an aggregator encounters only the null values the result will be null value. 
   2. The behavior will be same for max,min and sum except that they will return null in case all the encountered values are null. 
   3. Count aggregator will ignore null values and only count non-null values, If all the encountered values are null, the result will be 0.  This is different from current behavior where null values are also counted.
   
   ### Math Expressions Null Handling 
   Null handling for math expressions will be similar to the aggregators. For general calculations like sum, full expression will be considered as null if any of the components is null. Specifying a default value for null will be supported by the use of NVL or IF clause to assign default values at query time.
   
   ### Filtering on Null values 
   SelectorDimFilter currently specifies filtering on null values but the implementation assumes null and empty strings as equivalent. The implementation will be changed to consider null and empty string differently. Cache key for selectorDimFilter also assumes. 
   Generation of cache key for selectorDimFilter also needs to be modified to support null. 
   
   ### Changes to Druid build-in SQL layer 
   1. IS NULL and IS NOT NULL will be mapped to SelectorDimFilter with value null. 
   2. NVL, IFNULL and COALESCE will be mapped to DimensionSpec replaceMissingValueWith property. 
   
   ### Misc Changes
   Above are the major changes in the user facing APIs and behavior. Other than these there are multiple places in the code where we convert empty Strings to null and vice-versa. They will be changed accordingly in order to treat null and String values separately. 
   
   ### Backwards Compatibility 
   1. For backwards compatibility we can add a new runtime property to preserve old behavior. e.g 'druid.null.behavior' If this property is set to 'sql' we will follow the sql behavior, If this is not set we will preserve backwards compatibility. 
   2. Query execution backwards compatibility - There are lots of places in code that are based on current druid null handling e.g count aggregator. They will need to be made backwards compatible in order to preserve current behavior where null and "" the treated the same. This would mean having two branches of the code at all the places which currently assumes null and empty strings to be equivalent during query execution e.g filters, extraction functions, lookups, count aggregator and more.
   3. Ingestion time - At ingestion we replace null with empty strings, the same behavior will be preserved by having multiple code paths based on the 'druid.null.behavior' property
   4. Increment the versions of columns for supporting backwards compatibility in storage changes and be able to read columns with older version. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org