You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2021/01/25 07:06:49 UTC

[GitHub] [superset] wernerdaehn opened a new issue #12724: [SIP] Propose visualizations based on data

wernerdaehn opened a new issue #12724:
URL: https://github.com/apache/superset/issues/12724

## [SIP] Propose visualizations based on data

### Motivation

I have been working for Business Objects and SAP and I am in the Business Intelligence Market for more than 20 years. One thing that is still not satisfying is how the charting options are chosen.
Over the time the number of available charts and their variants will increase more and more and selecting from the long list is cumbersome. Also not everybody knows all visualization options for every case.
But given that superset has a semantic layer, you can preselect the visualizations.

Example: 2 Attributes & 2 Measures? Very likely a Pie Chart will not be the proper visualization.

There is an entire academic theory about different axis types (Nominalscale, Ordinalscale, Intervalscale, Ratioscale) for example. In case you are interested we can work on the details.

### Proposed Change

1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...
2. Define the aggregation type of a measure and if it is semi-additive
3. For each charting option and variant specify a rank how useful it is based on the number of attributes, number of measures, axis type of the attribute, measure type.
4. Order the charting options based on an overall rank

Please let me know if you are interested and I would spend some time to work out the details.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] junlincc commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

junlincc commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766699982






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

What you described is automated chart specification, or automated Exploratory Data Analysis (EDA), which is also quite big among DataViZ & ML researchers: https://github.com/mstaniak/autoEDA-resources

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, but I’d recommend keep researching on this topic and start digging into the a Superset codebase/architecture to form a better idea of

1. What is possible and what is not, and
2. What is the MVP
3. Which API we need to change or add?
4. What other areas of work we need to tackle first before working on this? E.g. SIP-34 column stats looks like a must.

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- DIVE | Turn Data into Stories Without Writing Code

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766704359


   @junlincc Thanks for the feedback. Just for the records, what Tableau does is just the very beginning!
   See here for how wide the topic can get: https://datavizproject.com/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-824606806

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

According to explanatory statistics there are four types of scales, ordered by capabilities:

- Nominal: Only useful calculation is around counting. Example: Color.
- Ordinal: has in addition an order. Example: User satisfaction 1-10. It is clear that 1 is better than 2 but a difference between 1-and-3 does not have the same meaning as 8-and-10.
- Interval: has in addition a useful meaning of distance between two values. Example: Today it is 5°C warmer than yesterday.
- Ratio: in addition it has a value of 0 and hence absolute comparisons do make sense. Example: Revenue was 10% higher.

If somebody wants to visualize a nominal value and a ratio value, e.g. Revenue per Color, a Bar chart is one of the few that makes sense. For two ratio values, e.g. revenue per customer-age a scatter plot is suited.

The next type of decision is the number of axis.
- If there is a single nominal axis, e.g. gender, the pie chart might be interesting to show the number of customers per gender.
- If I want to visualize the revenue compared to the previous year revenue per country and time, I need a chart type that can show a ratio scale, a list of regions and the development over time. A geomap colored as a heatmap and a time animation would do the trick.

The type of axis can further be refined:

- time: year, month, day, timestamp, week, weekday
- geo
- hierarchy

One side effect of these types is how to render missing values. A country without revenue should still be present (geomap) or not (bar chart). A month without revenue should still be shown, you do not want to see just 11 months.

The number of distinct values of nominal and ordinal scales is an important decision point as well. A Pie chart with 5000 categories might not be the best suited chart type. Showing above revenue per country over time could be shown as line chart with one line per country. Excellent for comparisons between countries unless you have 100 countries and 100 lines hence.

The final decision type is the purpose of the visualization:
- Comparison
- Relationship
- Proportion
- Percent of the whole
- Location
- Distribution...

The nice thing is that we can start small and grow the solution. Initially we just categorize each column of the result set into the scale type and each chart has the information which scale type it allows for what axis. That by itself would reduce the list of charts to offer by a lot. And from that we can grow and grow with the available metadata on the data and the chart info.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] junlincc edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

junlincc edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766699982

Thanks for suggesting! @wernerdaehn

> 1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...

It is aligned with our long term product roadmap. in fact, when we implemented new time picker in Superset, we thought about allowing user to query the earliest(min) and latest(max) time available in the timestamp dimension. couldn't get to it by v1.0 because of potential performance issues and our time constraints. collecting more metadata of dataset is something we wanna do once we get to refactoring the major control fields like metrics, filter etc.

> 2. Define the aggregation type of a measure and if it is semi-additive

something we will consider. it probably will require us to 'thickening' our semantic layer in Superset and steepen the learning curve of Superset.

> 3 & 4.

both are features available in Tableau. I agree they provides nice user experience and enables non tech users to create visualization intuitively. we would love to get to both someday.

https://user-images.githubusercontent.com/67837651/105690448-dd00bc00-5eb0-11eb-8145-8aea2a934399.mov

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] junlincc commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

junlincc commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766703663


   @wernerdaehn if you would like contribute any above items to Superset in any ways, we would love to work with you! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766705212


   Any suggestion of what I can do for you in that regards? Else I will try to come up with something to discuss but would love to get your guidance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-824606806

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

According to explanatory statistics there are four types of scales, ordered by capabilities:

The type of axis can further be refined:

- time: year, month, day, timestamp, week, weekday
- geo
- hierarchy
-
One side effect of these types is how to render missing values. A country without revenue should still be present (geomap) or not (bar chart). A month without revenue should still be shown, you do not want to see just 11 months.

The final decision type is the purpose of the visualization:
- Comparison
- Relationship
- Proportion
- Percent of the whole
- Location
- Distribution...

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] ktmud edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, but I’d recommend keep researching on this topic and start digging into the a Superset codebase/architecture to form a better idea of

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- https://www.usedive.com

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] stale[bot] commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

stale[bot] commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-868936156


   This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue `.pinned` to prevent stale bot from closing the issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, but I’d recommend keep researching on this topic and start digging into the a Superset codebase/architecture to form a more concrete action plan. We should at least be able to answer:

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- https://www.usedive.com

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] ktmud commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, but I’d recommend keep researching on this topic and start digging into the a Superset codebase/architecture to form a better idea of

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- DIVE | Turn Data into Stories Without Writing Code

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766704359






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ktmud edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

What you described is often called automated chart specification, or automated Exploratory Data Analysis (EDA), which is also quite big among DataViZ academics: https://github.com/mstaniak/autoEDA-resources

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, which seems to have identified a couple of items we can already do. I’d recommend keep researching on this topic and start digging into the Superset codebase/architecture to form a more concrete action plan. We should at least be able to answer:

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- https://www.usedive.com

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] ktmud edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

Some other useful links:
- https://github.com/antvis/AVA
- https://exploratory.io/
- https://www.usedive.com

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] ktmud edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

ktmud edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766729401






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-824606806

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

According to explanatory statistics there are four types of scales, ordered by capabilities:
- Nominal: Only useful calculation is around counting. Example: Color.
- Ordinal: has in addition an order. Example: User satisfaction 1-10. It is clear that 1 is better than 2 but a difference between 1-and-3 does not have the same meaning as 8-and-10.
- Interval: has in addition a useful meaning of distance between two values. Example: Today it is 5°C warmer than yesterday.
- Ratio: in addition it has a value of 0 and hence absolute comparisons do make sense. Example: Revenue was 10% higher.

The type of axis can further be refined:
- time: year, month, day, timestamp, week, weekday
- geo
- hierarchy
One side effect of these types is how to render missing values. A country without revenue should still be present (geomap) or not (bar chart). A month without revenue should still be shown, you do not want to see just 11 months.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] junlincc edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

junlincc edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766699982

Thanks for suggesting! @wernerdaehn

> 1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...

> 2. Define the aggregation type of a measure and if it is semi-additive

something we will consider. it probably will require us to 'thickening' our semantic layer in Superset and steepen the learning curve of Superset.

> 3 & 4.

both are features available in Tableau. I agree they provides nice user experience and enables non tech users to create visualization intuitively. we would love to get to both someday.

https://user-images.githubusercontent.com/67837651/105690448-dd00bc00-5eb0-11eb-8145-8aea2a934399.mov

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] junlincc commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

junlincc commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-766699982


   Thanks for suggesting! @wernerdaehn 
   
   > 1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...
   
   It is aligned with our long term product roadmap. in fact, when we implemented new time picker in Superset, we thought about allowing user to query the earliest(min) and latest(max) time available in the timestamp dimension. couldn't get to it by v1.0 because of potential performance issues and our time constraints. collecting more metadata of dataset is something we wanna do once we get to refactoring the major control fields like metrics, filter etc. 
   
   > 2. something we will consider. it probably will require us to 'thickening' our semantic layer in Superset and steepen the learning curve of Superset. 
   
   > 3 & 4. both are features available in Tableau. I agree they provides nice user experience and enables non tech users to create visualization intuitively. we would love to get to both someday.
   
   https://user-images.githubusercontent.com/67837651/105690448-dd00bc00-5eb0-11eb-8145-8aea2a934399.mov
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] rusackas commented on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

rusackas commented on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-824584958


   I just wanted to chime in and say that I love this idea, and it's something that my team is _starting_ to more seriously investigate. @wernerdaehn would you be interested in joining discussions (synchronously or otherwise) around this and being a part of implementing the solution? If not, I think we may need more clarification on how the approaches to implementation and any risks/dependencies involved, as @ktmud was suggesting. In other words, I think this is a great idea for a SIP, but we need more details to be able to put it to a vote and carry it out effectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn edited a comment on issue #12724: [SIP] Propose visualizations based on data

Posted by GitBox <gi...@apache.org>.

wernerdaehn edited a comment on issue #12724:
URL: https://github.com/apache/superset/issues/12724#issuecomment-824606806

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

The final decision type is the purpose of the visualization:
- Comparison
- Relationship
- Proportion
- Percent of the whole
- Location
- Distribution...

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org