You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2021/01/25 08:32:32 UTC

[GitHub] [superset] wernerdaehn opened a new issue #12729: [SIP] Support star schemas

wernerdaehn opened a new issue #12729:
URL: https://github.com/apache/superset/issues/12729

## [SIP] Support star schemas

### Motivation

While I am absolutely fine with the decision to base a diagram on a single table/view only, most databases cannot cope with that. A typical view would join 100 tables and have 5000 columns.
Take sales orders as example: For a proper self service BI you would join order header, order line item, three times the customer mater table for soldto-shipto-billto, the material master table, tons of tiny lookup tables with e.g. statusid, statustext etc.
If the joined table has a foreign key defined and the column is not-null, then the database optimizer can ignore that join in case none of the columns are used. Only a few databases are that advanced.
Further more, for many visual clues like slice/dice and filters a distinct list of attributes is needed. What is faster? Finding the three order states in the state text table or a select on the text table itself?

Hence I would suggest to support star schema data models. A chart is based on a fact table. The list of columns shown are all columns of the fact table plus all columns of all tables this fact table has a FK relationship with.

### Proposed Change

1. In addition to tables there are virtual objects FACT_TABLE and DIMENSION_TABLE. These simply point to physical tables. Note: A fact table in one data model can be a dimension table for another. Example: ORDER is a fact table with the measure ORDER_AMOUNT but is a referenced dimension table for the BILLINGS fact.
2. There is a relationship between facts and dimensions. Only trivial PK based equal conditions are allowed.
3. Charts can use tables or FACT_TABLEs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] zhaoyongjie commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

zhaoyongjie commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766668628


   @wernerdaehn 
   Thank you for your comments. I think stars or snowflake model is OLAP Dataware house concept, for more modern BI prefer to use single-table model, like PBI/Tableau.
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766693338


   @villebro A generic join feature I would advise against. Things like a theta join, snowflake-like joins with intermediate tables etc. These can all be built in the database using views. So why implementing these things a second time? I would really support only the most trivial case with a single fact in the middle and directly linked dimension tables with PKs. 
   This gives you the biggest bang for your bucks.
   
   As with the other proposals, let me know if I should create a comprehensive write-up. I would be willing to invest the time, but only if it is worth it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] villebro commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

villebro commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766704025


   @wernerdaehn I was speaking in general terms, i.e. supporting joining tables beyond the current table/virtual table functionality.
   
   One thing that sticks out: if we have 100 FKs and thousands of columns, this can quickly become very burdensome, both for the application but even for the end user. So this needs to be properly scoped to make sure it improves the user experience rather than overwhelm it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766685154






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766709704


   Understood. My fear is just that when you implement that, you start with the SQL join clause in mind. I have seen multiple products fail because the join clause is extremely powerful and frankly, not suited for graphical tools. Use the SQL Editor of Superset for those. Anyway, you got my point, that is the important part.
   
   Regarding the second paragraph, while your statement is certainly correct, it is not related to the star schema feature as such. If all needs to be in a single table, then this single table has equally many columns and must not be overwhelming as well.
   
   Please let me know if I can do any work in that regards, in case it makes sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ducchetrongminh edited a comment on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

ducchetrongminh edited a comment on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-924186508

I would love to see this feature go live, not sure if there is any progress. The description above is good enough, I just want to share my case.
Currently, I am using dbt for transformation, and I am following the star schema practice. Then I will visualize on Metabase. As metabase supports star schema model, this approach works well.
If I go with Superset, I will need to create 1 flat table. Some issues with the flat table are:
- storage: the dimension data will be duplicated and redundancy
- no benefit from incremental processing: when there is change in dimension table, I will need to refresh the whole flat table. If we support star schema, the fact table can remain unchanged
- slow on filter values: it is faster to fetch 30k customer names compared to 100mil customer names (but only 30k distinct name)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] villebro commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

villebro commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766688379


   Thanks for the comprehensive explanation @wernerdaehn ! Ping @amitmiran137 , here's a good motivation for adding support for join semantics.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] ducchetrongminh commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

ducchetrongminh commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-924186508

I would love to see this feature go live, not sure if there is any progress. The description above is good enough, I just want to share my case.
Currently, I am using dbt for transformation, and I am following the star schema practice. Then I will visualize on Metabase. This approach works well because we have 1 central dimension table, as well as storage optimizing.
If I go with Superset, I will need to create 1 flat table. Some issues with the flat table are:
- storage: the dimension data will be duplicated and redundancy
- no benefit from incremental processing: when there is change in dimension table, I will need to refresh the whole flat table. If we support star schema, the fact table can remain unchanged
- slow on filter values: it is faster to fetch 30k customer names compared to 100mil customer names (but only 30k distinct name)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] djouallah commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

djouallah commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-927030582


   @ducchetrongminh you can build a virtual dateset based on a view which join the fact to the dimension tables ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] wernerdaehn commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

wernerdaehn commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766685154

@zhaoyongjie From a technical side, if all databases would support views with a virtual unlimited number of columns and do a perfect job from the SQL optimizer point of view, then yes. But the opposite is the case. Only Oracle And SAP Hana (Calculation Views) are even close to that and far from perfect.
From a logical point of view, Superset wants to sum up measures and group by attributes. That is a dimensional concept.
Further more, if the tool knows what a fact table is and what a dimension, we have much more options later. For example you might have an ORDER and DELIVERY table, both have a shared dimension CUSTOMER. In a single-table model the customer table with its 500 columns must be defined multiple times whereas with a shared dimension just once. And in the dashboard, when you want to view the order amount, shipped amount and order backlog, a shared dimension allows to filter all measures at once. You know what is shared and how.
Also for the users the selection of columns is much easier when they are grouped. The customer master dimension obviously provides all columns that are customer related, the material all material related fields. In a single table model you would have 300 columns from the one and 300 columns from the other. And if customer can be filtered on sold-to, ship-to and bill-to basis, you would have the 300 customer fields three times.
And finally, while today the semantic layer of Superset is very thin, it will grow. And then specifying the same information multiple times get even more of a burden.

So no, this has nothing to do with OLAP, it is related to being a business intelligence tool.

I see that as something fairly straight forward to implement and with lots of potential initially and later.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [superset] zhaoyongjie commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

zhaoyongjie commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766668628


   @wernerdaehn 
   Thank you for your comments. I think stars or snowflake model is OLAP Dataware house concept, for more modern BI prefer to use single-table model, like PBI/Tableau.
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] villebro commented on issue #12729: [SIP] Support star schemas

Posted by GitBox <gi...@apache.org>.

villebro commented on issue #12729:
URL: https://github.com/apache/superset/issues/12729#issuecomment-766688379






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org