You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2018/07/25 21:09:05 UTC

[GitHub] villebro opened a new pull request #5487: Match viz dataframe column case to form_data fields

villebro opened a new pull request #5487: Match viz dataframe column case to form_data fields
URL: https://github.com/apache/incubator-superset/pull/5487
 
 
   This might be slightly hacky, but I feel SQL Alchemy gives so much latitude to engines that Superset might need to be slightly more forgiving if a query result has different case compared to the datasource/form metadata. A brief summary of what this PR does:
   
   Adjust dataframe column case, by performing _case-sensitive_ comparison of all colums in central fields in form_data (metrics, groupby) with column names in dataframe:
   - Leave all matches untouched.
   - Replace all dataframe columns with whichever _case-insensitive_ variant is present in form_data.
   
   Examples:
   - form_data: `__timestamp`, dataframe: `__timestamp` -> Do nothing.
   - form_data: `__timestamp`, dataframe: `__TIMESTAMP` -> Rename dataframe column name to `__timestamp`.
   - form_data: `__timeSTAMP`, dataframe: `__TIMESTAMP` -> Rename dataframe column name to `__timeSTAMP`.
   
   The dataframe is adjusted prior to caching in `get_df_payload`. To minimize risk of collisions, dedup has been changed to be able to perform deduping both in a case-sensitive and case-insensitive manner. Default handling would now be case-insensitive, while maintaining original case. Example from test case (note the last Bar which is seen as a duplicate despite different case):
   
   ```
   >>> print(','.join(dedup(['foo', 'bar', 'bar', 'bar', 'Bar'], case_sensitive=False)))
   foo,bar,bar__1,bar__2,Bar__3
   ```
   
   While there is a theoretical risk of the new deduping logic causing dismay for some users, I've yet to see anyone intentionally use identical column names with different case (e.g. `id`, `Id` and/or `ID`) in the same dataset.
   
   Oh, I also fixed what seemed to be a small bug in how `handle_nulls()` was implemented.
   
   Feel free to try it out, and don't feel shy to give feedback/critique. Like stated above, this might be borderline hacky, but seems to work fairly well, and might make it easier to add new engines later on. This fix should also be backwards compatible (except for the handle_nulls()-part, which was recently added), at least `0.26` and `0.25`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org