You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by BELLIER Jean-luc <je...@rte-france.com> on 2018/02/13 17:49:54 UTC

usage of Web inteface Kylin an performances

Hello,

I have several questions on Kylin, especially about performances and how to manage them. I would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches are performed on the values of primary keys in dimensions. These subsets are then joined to get the corresponding lines on the facts table. As the dimensions tables are much smaller than the facts table, the queries run faster


1.       Questions on performances

*         the raw data are stored in Hive, and the models and structures (cubes) are stored in HBase; I presume that the whole .json files are stored, is it right ?

*         Where are the cube results stores (I mean after a build, a refresh or an append action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube data ?

*         I noticed that when I build the 'sample_cube', the volume of data was very important compared to the size of the original files. Is there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml file, called 'compression' for the connector on port 7070, but I do not know if it is related to the cube size). I tried to change this parameter to 'yes', but I noticed a huge increase of the duration of cube generation. So I am wondering if it is the good method.

*         How is it possible to optimize cube size to keep good performance ?

*         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance when querying high volumes of data  ? Is it through the 'rowkeys' in the advanced settings when you build the cube ?
Or is the answer elsewhere ?


2.       Questions on cube building

*         By the way, the 'Advanced settings' step is still unclear for me. I tried to build a cube from scratch using the tables provided in the sample project. But I do not know very much what to put in this section.

*         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

*         I do not understand very well why the aggregation group contains so many columns. I tried to remove as many as possible, but when I tried to set up the joins, but some fields were missing so the saving of the cube failed.

*         What shall we put exactly in the 'Rowkeys' section ? I understand that this is used to define data encoding (for speed access ? ).Am I right ?

*         Are the aggregation groups used for speed of the queries. I assume it is the case, because it represents the most commonly used associations of columns for the cube.

Thank you in advance for your help.

Best regards,
Jean-Luc.






"Ce message est destin? exclusivement aux personnes ou entit?s auxquelles il est adress? et peut contenir des informations privil?gi?es ou confidentielles. Si vous avez re?u ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de proc?der ? sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."

Re: usage of Web inteface Kylin an performances

Posted by Luke Han <lu...@gmail.com>.
There's no build in XMLA/MDX services today, you could leverage 3rd party
solution, please search Kylin Mondrian for example.

Thanks.


Best Regards!
---------------------

Luke Han

On Tue, Feb 20, 2018 at 10:01 PM, Ge Silas <go...@live.cn> wrote:

> Hi Jean-luc,
>
> You should use Insight interface just like what you do in Hive. If the
> query can be answered by pre-calculated cube, cube will answer. Otherwise
> you can configure query pushdown http://kylin.apache.
> org/docs21/tutorial/query_pushdown.html for the queries which cannot be
> answered by cube.
>
> I personally have not tried XMLA and Mondrian setup yet but you may want
> to check more in the Tomcat logs or else.
>
> Thanks,
> Silas
>
> On 20 Feb 2018, at 9:47 PM, BELLIER Jean-luc <jean-luc.bellier@rte-france.
> com> wrote:
>
> Hello Silas,
>
> I did a query using the ‘SQL’ tab defining the cube, and customized it to
> add filters. What I noticed is that the interface displays the name of the
> cube, so I presume the tool uses this cube to get the result, but I do not
> understand very much how.
> Here is an example of query. I used also this query directly in Hive, so
> my question is : does the web interface query the cube (i.e. the structure
> stored in HBase) or the tables of the model (stored in Hive) ?
>
> *SELECT*
> *KYLIN_SALES.TRANS_ID as KYLIN_SALES_TRANS_ID*
> *,KYLIN_SALES.PART_DT AS KYLIN_SALES_PART_DT*
> *,KYLIN_SALES.LEAF_CATEG_ID as KYLIN_SALES_LEAF_CATEG_ID*
> *,KYLIN_SALES.LSTG_SITE_ID as KYLIN_SALES_LSTG_SITE_ID*
> *,KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME as
> KYLIN_CATEGORY_GROUPINGS_META_CATEG_NAME*
> *,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME as
> KYLIN_CATEGORY_GROUPINGS_CATEG_LVL2_NAME*
> *,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL3_NAME as
> KYLIN_CATEGORY_GROUPINGS_CATEG_LVL3_NAME*
> *,KYLIN_SALES.LSTG_FORMAT_NAME as KYLIN_SALES_LSTG_FORMAT_NAME*
> *,KYLIN_SALES.SELLER_ID as KYLIN_SALES_SELLER_ID*
> *,KYLIN_SALES.BUYER_ID as KYLIN_SALES_BUYER_ID*
> *,BUYER_ACCOUNT.ACCOUNT_BUYER_LEVEL as BUYER_ACCOUNT_ACCOUNT_BUYER_LEVEL*
> *,SELLER_ACCOUNT.ACCOUNT_SELLER_LEVEL as
> SELLER_ACCOUNT_ACCOUNT_SELLER_LEVEL*
> *,BUYER_ACCOUNT.ACCOUNT_COUNTRY as BUYER_ACCOUNT_ACCOUNT_COUNTRY*
> *,SELLER_ACCOUNT.ACCOUNT_COUNTRY as SELLER_ACCOUNT_ACCOUNT_COUNTRY*
> *,BUYER_COUNTRY.NAME <http://BUYER_COUNTRY.NAME> as BUYER_COUNTRY_NAME*
> *,SELLER_COUNTRY.NAME <http://SELLER_COUNTRY.NAME> as SELLER_COUNTRY_NAME*
> *,KYLIN_SALES.OPS_USER_ID as KYLIN_SALES_OPS_USER_ID*
> *,KYLIN_SALES.OPS_REGION as KYLIN_SALES_OPS_REGION*
> *,KYLIN_SALES.PRICE as KYLIN_SALES_PRICE*
> *FROM KYLIN_SALES as KYLIN_SALES*
> *INNER JOIN KYLIN_CATEGORY_GROUPINGS as KYLIN_CATEGORY_GROUPINGS*
> *ON KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID AND
> KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID*
> *INNER JOIN KYLIN_ACCOUNT as BUYER_ACCOUNT*
> *ON KYLIN_SALES.BUYER_ID = BUYER_ACCOUNT.ACCOUNT_ID*
> *INNER JOIN KYLIN_ACCOUNT as SELLER_ACCOUNT*
> *ON KYLIN_SALES.SELLER_ID = SELLER_ACCOUNT.ACCOUNT_ID*
> *INNER JOIN KYLIN_COUNTRY as BUYER_COUNTRY*
> *ON BUYER_ACCOUNT.ACCOUNT_COUNTRY = BUYER_COUNTRY.COUNTRY*
> *INNER JOIN KYLIN_COUNTRY as SELLER_COUNTRY*
> *ON SELLER_ACCOUNT.ACCOUNT_COUNTRY = SELLER_COUNTRY.COUNTRY*
> *WHERE BUYER_COUNTRY.COUNTRY in ('FR','BE','DE')*
> *AND KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME = 'Computers/Tablets &
> Networking'*
>
>
> I wondered also how to query the dimensions and the measures defined in
> the cube at step ‘Advanced settings’ : can it be done on the Kylin
> interface or through another driver. You mail suggested to used XMondrian.
> I tried it on my personal PC, on a VM.  This WM was not installed by me but
> by a teacher for a course about Big Data Programming. So I have not
> modified this part. I have configuration problems at step 2 :
> ·         At step 1, I got a Mondrian.war file, which produced a
> ‘xmondrian’ folder under ‘tomcat’ folder, which is located under
> $KYLIN_HOME folder
> ·         I started tomcat from my tomcat folder (using the ‘tomcat’)
> folder of my VM. The system tells me that tomcat service has started
> ·         I try then to reach http://localhost:8080/xmondrian/xmla. But I
> receive a message : “unable to connect”. How should I parametrize to make
> it work ?
> ·         Same question about the parametrization if I want to use the
> XMLA driver in Excel on my computer, and use my Kylin cubes.
>
> Thank you in advance for your help. Have a good day.
>
> Best regards,
> Jean-Luc
>
>
> *De :* Ge Silas [mailto:gosoy@live.cn <go...@live.cn>]
> *Envoyé :* mardi 20 février 2018 14:04
> *À :* user@kylin.apache.org
> *Objet :* Re: usage of Web inteface Kylin an performances
>
> Hi Jean-Luc,
>
> There should not be any need to specify the cube name when sending the
> query through Insight interface. Can you provide more information on how
> you did your query?
>
> For MDX, you may want to check http://dekarlab.de/wp/?p=363
>
> Thanks,
> Silas
>
> On 19 Feb 2018, at 7:22 PM, BELLIER Jean-luc <jean-luc.bellier@rte-france.
> com> wrote:
>
> Hello Shaofeng,
>
> Thank you for this response.
>
> I would like to clarify some things about the ‘Insight’ part of the Kylin
> Web interface.
> It indicates the name of the cube when a query is launched, but my
> assumption is that the Hive tables are directly requested. So I do not know
> where the cube enters into account.
> Does this query tool support MDX ? I am not sure, so how can we query the
> cube elements ?
> If we can, can you send me a few examples of MDX queries ?
> My assumption is that we need an external tool such as XMLA or other BI
> tools.
>
> Thank you in advance for your help.
>
> Have a good day.
> Best regards,
> Jean-Luc.
>
>
> *De :* ShaoFeng Shi [mailto:shaofengshi@apache.org
> <sh...@apache.org>]
> *Envoyé :* dimanche 18 février 2018 03:23
> *À :* user <us...@kylin.apache.org>
> *Objet :* Re: usage of Web inteface Kylin an performances
>
> Hi Jean-luc,
>
> Most of the Kylin developers are in the new year holiday, so there might
> be some delay. Here are some comments from my side:
>
> 1.  I presume that the whole .json files are stored, is it right ?
> yes
> 2. Do these kinds of tables contain the cube data ?
> yes; cube are stored in HBase with "KYLINL_" as prefix
> 3. So I am wondering if it is the good method
> the "compression" in Tomcat/conf/server.xml has nothing with cube build.
> To enable compression for cube, you need to configure that in your Hadoop
> configurations like mapred-site.xml, hbase-site.xml or
> kylin/conf/kylin_job_conf.xml.
> 4. How is it possible to optimize cube size to keep good performance ?
> https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
> 5.  Is it through the ‘rowkeys’ in the advanced settings when you build
> the cube ?
> yes, exactly; putting the most used filtering column to the heading
> position on the rowkey can get better performance.
> 6. What shall we put exactly in the ‘Rowkeys’ section ?
> All dimensions (excluding 'derived' dimensions) need be on rowkey; If you
> see too many columns in the agg. group, remove some dimensions from your
> cube.
> 7.  Are the aggregation groups used for speed of the queries.
> The agg. group is used to optimize the dimension combinations. For a N
> dimension cube, by default it will have 2^N combinations (we called
> cuboid). If you can divide N dimensions to several groups, the combination
> numbers can be greatly reduced, so the cube build will be much easier and
> taking much less space. How to define the agg. group? You can do that with
> your business query patterns.
>
>
>
> 2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <jean-luc.bellier@rte-france.
> com>:
>
> Hello,
>
> I have several questions on Kylin, especially about performances and how
> to manage them. I would like to understand precisely how it works to see if
> I can use it in my business context.
>
> I come from the relational database world, so as far as I understand on
> OLAP, the searches are performed on the values of primary keys in
> dimensions. These subsets are then joined to get the corresponding lines on
> the facts table. As the dimensions tables are much smaller than the facts
> table, the queries run faster
>
>
> *1.*       *Questions on performances*
>
> ·         the raw data are stored in Hive, and the models and structures
> (cubes) are stored in HBase; I presume that the whole .json files are
> stored, is it right ?
>
> ·         Where are the cube results stores (I mean after a build, a
> refresh or an append action). Is it also in HBase ? I can see in HBase
> tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube
> data ?
>
> ·         I noticed that when I build the ‘sample_cube’, the volume of
> data was very important compared to the size of the original files. Is
> there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml
> file, called ‘compression’ for the connector on port 7070, but I do not
> know if it is related to the cube size). I tried to change this parameter
> to ‘yes’, but I noticed a huge increase of the duration of cube generation.
> So I am wondering if it is the good method.
>
> ·         How is it possible to optimize cube size to keep good
> performance ?
>
> ·         In Hive, putting indexes is not recommended. So how Kylin is
> ensuring good performance when querying high volumes of data  ? Is it
> through the ‘rowkeys’ in the advanced settings when you build the cube ?
> Or is the answer elsewhere ?
>
>
> *2.*       *Questions on cube building*
>
> ·         By the way, the ‘Advanced settings’ step is still unclear for
> me. I tried to build a cube from scratch using the tables provided in the
> sample project. But I do not know very much what to put in this section.
>
> ·         My goal is to define groups of data on YEAR_BEG_DT,
> QTR_BEG_DT,MONTH_BEG_DT.
>
> ·         I do not understand very well why the aggregation group
> contains so many columns. I tried to remove as many as possible, but when I
> tried to set up the joins, but some fields were missing so the saving of
> the cube failed.
>
> ·         What shall we put exactly in the ‘Rowkeys’ section ? I
> understand that this is used to define data encoding (for speed access ?
> ).Am I right ?
>
> ·         Are the aggregation groups used for speed of the queries. I
> assume it is the case, because it represents the most commonly used
> associations of columns for the cube.
>
> Thank you in advance for your help.
>
> Best regards,
> Jean-Luc.
>
>
>
>
>
>
> "Ce message est destiné exclusivement aux personnes ou entités auxquelles
> il est adressé et peut contenir des informations privilégiées ou
> confidentielles. Si vous avez reçu ce document par erreur, merci de nous
> l'indiquer par retour, de ne pas le transmettre et de procéder à sa
> destruction.
>
> This message is solely intended for the use of the individual or entity to
> which it is addressed and may contain information that is privileged or
> confidential. If you have received this communication by error, please
> notify us immediately by electronic mail, do not disclose it and delete the
> original message."
>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
>
> "Ce message est destiné exclusivement aux personnes ou entités auxquelles
> il est adressé et peut contenir des informations privilégiées ou
> confidentielles. Si vous avez reçu ce document par erreur, merci de nous
> l'indiquer par retour, de ne pas le transmettre et de procéder à sa
> destruction.
>
> This message is solely intended for the use of the individual or entity to
> which it is addressed and may contain information that is privileged or
> confidential. If you have received this communication by error, please
> notify us immediately by electronic mail, do not disclose it and delete the
> original message."
>
>
>
> "Ce message est destiné exclusivement aux personnes ou entités auxquelles
> il est adressé et peut contenir des informations privilégiées ou
> confidentielles. Si vous avez reçu ce document par erreur, merci de nous
> l'indiquer par retour, de ne pas le transmettre et de procéder à sa
> destruction.
>
> This message is solely intended for the use of the individual or entity to
> which it is addressed and may contain information that is privileged or
> confidential. If you have received this communication by error, please
> notify us immediately by electronic mail, do not disclose it and delete the
> original message."
>
>

Re: usage of Web inteface Kylin an performances

Posted by Ge Silas <go...@live.cn>.
Hi Jean-luc,

You should use Insight interface just like what you do in Hive. If the query can be answered by pre-calculated cube, cube will answer. Otherwise you can configure query pushdown http://kylin.apache.org/docs21/tutorial/query_pushdown.html for the queries which cannot be answered by cube.

I personally have not tried XMLA and Mondrian setup yet but you may want to check more in the Tomcat logs or else.

Thanks,
Silas

On 20 Feb 2018, at 9:47 PM, BELLIER Jean-luc <je...@rte-france.com>> wrote:

Hello Silas,

I did a query using the ‘SQL’ tab defining the cube, and customized it to add filters. What I noticed is that the interface displays the name of the cube, so I presume the tool uses this cube to get the result, but I do not understand very much how.
Here is an example of query. I used also this query directly in Hive, so my question is : does the web interface query the cube (i.e. the structure stored in HBase) or the tables of the model (stored in Hive) ?

SELECT
KYLIN_SALES.TRANS_ID as KYLIN_SALES_TRANS_ID
,KYLIN_SALES.PART_DT AS KYLIN_SALES_PART_DT
,KYLIN_SALES.LEAF_CATEG_ID as KYLIN_SALES_LEAF_CATEG_ID
,KYLIN_SALES.LSTG_SITE_ID as KYLIN_SALES_LSTG_SITE_ID
,KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME as KYLIN_CATEGORY_GROUPINGS_META_CATEG_NAME
,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME as KYLIN_CATEGORY_GROUPINGS_CATEG_LVL2_NAME
,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL3_NAME as KYLIN_CATEGORY_GROUPINGS_CATEG_LVL3_NAME
,KYLIN_SALES.LSTG_FORMAT_NAME as KYLIN_SALES_LSTG_FORMAT_NAME
,KYLIN_SALES.SELLER_ID as KYLIN_SALES_SELLER_ID
,KYLIN_SALES.BUYER_ID as KYLIN_SALES_BUYER_ID
,BUYER_ACCOUNT.ACCOUNT_BUYER_LEVEL as BUYER_ACCOUNT_ACCOUNT_BUYER_LEVEL
,SELLER_ACCOUNT.ACCOUNT_SELLER_LEVEL as SELLER_ACCOUNT_ACCOUNT_SELLER_LEVEL
,BUYER_ACCOUNT.ACCOUNT_COUNTRY as BUYER_ACCOUNT_ACCOUNT_COUNTRY
,SELLER_ACCOUNT.ACCOUNT_COUNTRY as SELLER_ACCOUNT_ACCOUNT_COUNTRY
,BUYER_COUNTRY.NAME as BUYER_COUNTRY_NAME
,SELLER_COUNTRY.NAME as SELLER_COUNTRY_NAME
,KYLIN_SALES.OPS_USER_ID as KYLIN_SALES_OPS_USER_ID
,KYLIN_SALES.OPS_REGION as KYLIN_SALES_OPS_REGION
,KYLIN_SALES.PRICE as KYLIN_SALES_PRICE
FROM KYLIN_SALES as KYLIN_SALES
INNER JOIN KYLIN_CATEGORY_GROUPINGS as KYLIN_CATEGORY_GROUPINGS
ON KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID AND KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID
INNER JOIN KYLIN_ACCOUNT as BUYER_ACCOUNT
ON KYLIN_SALES.BUYER_ID = BUYER_ACCOUNT.ACCOUNT_ID
INNER JOIN KYLIN_ACCOUNT as SELLER_ACCOUNT
ON KYLIN_SALES.SELLER_ID = SELLER_ACCOUNT.ACCOUNT_ID
INNER JOIN KYLIN_COUNTRY as BUYER_COUNTRY
ON BUYER_ACCOUNT.ACCOUNT_COUNTRY = BUYER_COUNTRY.COUNTRY
INNER JOIN KYLIN_COUNTRY as SELLER_COUNTRY
ON SELLER_ACCOUNT.ACCOUNT_COUNTRY = SELLER_COUNTRY.COUNTRY
WHERE BUYER_COUNTRY.COUNTRY in ('FR','BE','DE')
AND KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME = 'Computers/Tablets & Networking'


I wondered also how to query the dimensions and the measures defined in the cube at step ‘Advanced settings’ : can it be done on the Kylin interface or through another driver. You mail suggested to used XMondrian. I tried it on my personal PC, on a VM.  This WM was not installed by me but by a teacher for a course about Big Data Programming. So I have not modified this part. I have configuration problems at step 2 :
•         At step 1, I got a Mondrian.war file, which produced a ‘xmondrian’ folder under ‘tomcat’ folder, which is located under $KYLIN_HOME folder
•         I started tomcat from my tomcat folder (using the ‘tomcat’) folder of my VM. The system tells me that tomcat service has started
•         I try then to reach http://localhost:8080/xmondrian/xmla. But I receive a message : “unable to connect”. How should I parametrize to make it work ?
•         Same question about the parametrization if I want to use the XMLA driver in Excel on my computer, and use my Kylin cubes.

Thank you in advance for your help. Have a good day.

Best regards,
Jean-Luc


De : Ge Silas [mailto:gosoy@live.cn]
Envoyé : mardi 20 février 2018 14:04
À : user@kylin.apache.org<ma...@kylin.apache.org>
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-Luc,

There should not be any need to specify the cube name when sending the query through Insight interface. Can you provide more information on how you did your query?

For MDX, you may want to check http://dekarlab.de/wp/?p=363

Thanks,
Silas

On 19 Feb 2018, at 7:22 PM, BELLIER Jean-luc <je...@rte-france.com>> wrote:

Hello Shaofeng,

Thank you for this response.

I would like to clarify some things about the ‘Insight’ part of the Kylin Web interface.
It indicates the name of the cube when a query is launched, but my assumption is that the Hive tables are directly requested. So I do not know where the cube enters into account.
Does this query tool support MDX ? I am not sure, so how can we query the cube elements ?
If we can, can you send me a few examples of MDX queries ?
My assumption is that we need an external tool such as XMLA or other BI tools.

Thank you in advance for your help.

Have a good day.
Best regards,
Jean-Luc.


De : ShaoFeng Shi [mailto:shaofengshi@apache.org]
Envoyé : dimanche 18 février 2018 03:23
À : user <us...@kylin.apache.org>>
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-luc,

Most of the Kylin developers are in the new year holiday, so there might be some delay. Here are some comments from my side:

1.  I presume that the whole .json files are stored, is it right ?
yes
2. Do these kinds of tables contain the cube data ?
yes; cube are stored in HBase with "KYLINL_" as prefix
3. So I am wondering if it is the good method
the "compression" in Tomcat/conf/server.xml has nothing with cube build. To enable compression for cube, you need to configure that in your Hadoop configurations like mapred-site.xml, hbase-site.xml or kylin/conf/kylin_job_conf.xml.
4. How is it possible to optimize cube size to keep good performance ?
https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
5.  Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
yes, exactly; putting the most used filtering column to the heading position on the rowkey can get better performance.
6. What shall we put exactly in the ‘Rowkeys’ section ?
All dimensions (excluding 'derived' dimensions) need be on rowkey; If you see too many columns in the agg. group, remove some dimensions from your cube.
7.  Are the aggregation groups used for speed of the queries.
The agg. group is used to optimize the dimension combinations. For a N dimension cube, by default it will have 2^N combinations (we called cuboid). If you can divide N dimensions to several groups, the combination numbers can be greatly reduced, so the cube build will be much easier and taking much less space. How to define the agg. group? You can do that with your business query patterns.



2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <je...@rte-france.com>>:
Hello,

I have several questions on Kylin, especially about performances and how to manage them. I would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches are performed on the values of primary keys in dimensions. These subsets are then joined to get the corresponding lines on the facts table. As the dimensions tables are much smaller than the facts table, the queries run faster


1.       Questions on performances

•         the raw data are stored in Hive, and the models and structures (cubes) are stored in HBase; I presume that the whole .json files are stored, is it right ?

•         Where are the cube results stores (I mean after a build, a refresh or an append action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube data ?

•         I noticed that when I build the ‘sample_cube’, the volume of data was very important compared to the size of the original files. Is there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml file, called ‘compression’ for the connector on port 7070, but I do not know if it is related to the cube size). I tried to change this parameter to ‘yes’, but I noticed a huge increase of the duration of cube generation. So I am wondering if it is the good method.

•         How is it possible to optimize cube size to keep good performance ?

•         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance when querying high volumes of data  ? Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?

Or is the answer elsewhere ?


2.       Questions on cube building

•         By the way, the ‘Advanced settings’ step is still unclear for me. I tried to build a cube from scratch using the tables provided in the sample project. But I do not know very much what to put in this section.

•         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

•         I do not understand very well why the aggregation group contains so many columns. I tried to remove as many as possible, but when I tried to set up the joins, but some fields were missing so the saving of the cube failed.

•         What shall we put exactly in the ‘Rowkeys’ section ? I understand that this is used to define data encoding (for speed access ? ).Am I right ?

•         Are the aggregation groups used for speed of the queries. I assume it is the case, because it represents the most commonly used associations of columns for the cube.


Thank you in advance for your help.

Best regards,
Jean-Luc.







"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



--
Best regards,

Shaofeng Shi 史少锋



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."


RE: usage of Web inteface Kylin an performances

Posted by BELLIER Jean-luc <je...@rte-france.com>.
Hello Silas,

I did a query using the ‘SQL’ tab defining the cube, and customized it to add filters. What I noticed is that the interface displays the name of the cube, so I presume the tool uses this cube to get the result, but I do not understand very much how.
Here is an example of query. I used also this query directly in Hive, so my question is : does the web interface query the cube (i.e. the structure stored in HBase) or the tables of the model (stored in Hive) ?

SELECT
KYLIN_SALES.TRANS_ID as KYLIN_SALES_TRANS_ID
,KYLIN_SALES.PART_DT AS KYLIN_SALES_PART_DT
,KYLIN_SALES.LEAF_CATEG_ID as KYLIN_SALES_LEAF_CATEG_ID
,KYLIN_SALES.LSTG_SITE_ID as KYLIN_SALES_LSTG_SITE_ID
,KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME as KYLIN_CATEGORY_GROUPINGS_META_CATEG_NAME
,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME as KYLIN_CATEGORY_GROUPINGS_CATEG_LVL2_NAME
,KYLIN_CATEGORY_GROUPINGS.CATEG_LVL3_NAME as KYLIN_CATEGORY_GROUPINGS_CATEG_LVL3_NAME
,KYLIN_SALES.LSTG_FORMAT_NAME as KYLIN_SALES_LSTG_FORMAT_NAME
,KYLIN_SALES.SELLER_ID as KYLIN_SALES_SELLER_ID
,KYLIN_SALES.BUYER_ID as KYLIN_SALES_BUYER_ID
,BUYER_ACCOUNT.ACCOUNT_BUYER_LEVEL as BUYER_ACCOUNT_ACCOUNT_BUYER_LEVEL
,SELLER_ACCOUNT.ACCOUNT_SELLER_LEVEL as SELLER_ACCOUNT_ACCOUNT_SELLER_LEVEL
,BUYER_ACCOUNT.ACCOUNT_COUNTRY as BUYER_ACCOUNT_ACCOUNT_COUNTRY
,SELLER_ACCOUNT.ACCOUNT_COUNTRY as SELLER_ACCOUNT_ACCOUNT_COUNTRY
,BUYER_COUNTRY.NAME as BUYER_COUNTRY_NAME
,SELLER_COUNTRY.NAME as SELLER_COUNTRY_NAME
,KYLIN_SALES.OPS_USER_ID as KYLIN_SALES_OPS_USER_ID
,KYLIN_SALES.OPS_REGION as KYLIN_SALES_OPS_REGION
,KYLIN_SALES.PRICE as KYLIN_SALES_PRICE
FROM KYLIN_SALES as KYLIN_SALES
INNER JOIN KYLIN_CATEGORY_GROUPINGS as KYLIN_CATEGORY_GROUPINGS
ON KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID AND KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID
INNER JOIN KYLIN_ACCOUNT as BUYER_ACCOUNT
ON KYLIN_SALES.BUYER_ID = BUYER_ACCOUNT.ACCOUNT_ID
INNER JOIN KYLIN_ACCOUNT as SELLER_ACCOUNT
ON KYLIN_SALES.SELLER_ID = SELLER_ACCOUNT.ACCOUNT_ID
INNER JOIN KYLIN_COUNTRY as BUYER_COUNTRY
ON BUYER_ACCOUNT.ACCOUNT_COUNTRY = BUYER_COUNTRY.COUNTRY
INNER JOIN KYLIN_COUNTRY as SELLER_COUNTRY
ON SELLER_ACCOUNT.ACCOUNT_COUNTRY = SELLER_COUNTRY.COUNTRY
WHERE BUYER_COUNTRY.COUNTRY in ('FR','BE','DE')
AND KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME = 'Computers/Tablets & Networking'


I wondered also how to query the dimensions and the measures defined in the cube at step ‘Advanced settings’ : can it be done on the Kylin interface or through another driver. You mail suggested to used XMondrian. I tried it on my personal PC, on a VM.  This WM was not installed by me but by a teacher for a course about Big Data Programming. So I have not modified this part. I have configuration problems at step 2 :

·         At step 1, I got a Mondrian.war file, which produced a ‘xmondrian’ folder under ‘tomcat’ folder, which is located under $KYLIN_HOME folder

·         I started tomcat from my tomcat folder (using the ‘tomcat’) folder of my VM. The system tells me that tomcat service has started

·         I try then to reach http://localhost:8080/xmondrian/xmla. But I receive a message : “unable to connect”. How should I parametrize to make it work ?

·         Same question about the parametrization if I want to use the XMLA driver in Excel on my computer, and use my Kylin cubes.

Thank you in advance for your help. Have a good day.

Best regards,
Jean-Luc


De : Ge Silas [mailto:gosoy@live.cn]
Envoyé : mardi 20 février 2018 14:04
À : user@kylin.apache.org
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-Luc,

There should not be any need to specify the cube name when sending the query through Insight interface. Can you provide more information on how you did your query?

For MDX, you may want to check http://dekarlab.de/wp/?p=363

Thanks,
Silas

On 19 Feb 2018, at 7:22 PM, BELLIER Jean-luc <je...@rte-france.com>> wrote:

Hello Shaofeng,

Thank you for this response.

I would like to clarify some things about the ‘Insight’ part of the Kylin Web interface.
It indicates the name of the cube when a query is launched, but my assumption is that the Hive tables are directly requested. So I do not know where the cube enters into account.
Does this query tool support MDX ? I am not sure, so how can we query the cube elements ?
If we can, can you send me a few examples of MDX queries ?
My assumption is that we need an external tool such as XMLA or other BI tools.

Thank you in advance for your help.

Have a good day.
Best regards,
Jean-Luc.


De : ShaoFeng Shi [mailto:shaofengshi@apache.org]
Envoyé : dimanche 18 février 2018 03:23
À : user <us...@kylin.apache.org>>
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-luc,

Most of the Kylin developers are in the new year holiday, so there might be some delay. Here are some comments from my side:

1.  I presume that the whole .json files are stored, is it right ?
yes
2. Do these kinds of tables contain the cube data ?
yes; cube are stored in HBase with "KYLINL_" as prefix
3. So I am wondering if it is the good method
the "compression" in Tomcat/conf/server.xml has nothing with cube build. To enable compression for cube, you need to configure that in your Hadoop configurations like mapred-site.xml, hbase-site.xml or kylin/conf/kylin_job_conf.xml.
4. How is it possible to optimize cube size to keep good performance ?
https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
5.  Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
yes, exactly; putting the most used filtering column to the heading position on the rowkey can get better performance.
6. What shall we put exactly in the ‘Rowkeys’ section ?
All dimensions (excluding 'derived' dimensions) need be on rowkey; If you see too many columns in the agg. group, remove some dimensions from your cube.
7.  Are the aggregation groups used for speed of the queries.
The agg. group is used to optimize the dimension combinations. For a N dimension cube, by default it will have 2^N combinations (we called cuboid). If you can divide N dimensions to several groups, the combination numbers can be greatly reduced, so the cube build will be much easier and taking much less space. How to define the agg. group? You can do that with your business query patterns.



2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <je...@rte-france.com>>:
Hello,

I have several questions on Kylin, especially about performances and how to manage them. I would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches are performed on the values of primary keys in dimensions. These subsets are then joined to get the corresponding lines on the facts table. As the dimensions tables are much smaller than the facts table, the queries run faster


1.       Questions on performances

•         the raw data are stored in Hive, and the models and structures (cubes) are stored in HBase; I presume that the whole .json files are stored, is it right ?

•         Where are the cube results stores (I mean after a build, a refresh or an append action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube data ?

•         I noticed that when I build the ‘sample_cube’, the volume of data was very important compared to the size of the original files. Is there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml file, called ‘compression’ for the connector on port 7070, but I do not know if it is related to the cube size). I tried to change this parameter to ‘yes’, but I noticed a huge increase of the duration of cube generation. So I am wondering if it is the good method.

•         How is it possible to optimize cube size to keep good performance ?

•         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance when querying high volumes of data  ? Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
Or is the answer elsewhere ?


2.       Questions on cube building

•         By the way, the ‘Advanced settings’ step is still unclear for me. I tried to build a cube from scratch using the tables provided in the sample project. But I do not know very much what to put in this section.

•         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

•         I do not understand very well why the aggregation group contains so many columns. I tried to remove as many as possible, but when I tried to set up the joins, but some fields were missing so the saving of the cube failed.

•         What shall we put exactly in the ‘Rowkeys’ section ? I understand that this is used to define data encoding (for speed access ? ).Am I right ?

•         Are the aggregation groups used for speed of the queries. I assume it is the case, because it represents the most commonly used associations of columns for the cube.

Thank you in advance for your help.

Best regards,
Jean-Luc.






"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



--
Best regards,

Shaofeng Shi 史少锋



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."

Re: usage of Web inteface Kylin an performances

Posted by Ge Silas <go...@live.cn>.
Hi Jean-Luc,

There should not be any need to specify the cube name when sending the query through Insight interface. Can you provide more information on how you did your query?

For MDX, you may want to check http://dekarlab.de/wp/?p=363

Thanks,
Silas

On 19 Feb 2018, at 7:22 PM, BELLIER Jean-luc <je...@rte-france.com>> wrote:

Hello Shaofeng,

Thank you for this response.

I would like to clarify some things about the ‘Insight’ part of the Kylin Web interface.
It indicates the name of the cube when a query is launched, but my assumption is that the Hive tables are directly requested. So I do not know where the cube enters into account.
Does this query tool support MDX ? I am not sure, so how can we query the cube elements ?
If we can, can you send me a few examples of MDX queries ?
My assumption is that we need an external tool such as XMLA or other BI tools.

Thank you in advance for your help.

Have a good day.
Best regards,
Jean-Luc.


De : ShaoFeng Shi [mailto:shaofengshi@apache.org]
Envoyé : dimanche 18 février 2018 03:23
À : user <us...@kylin.apache.org>>
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-luc,

Most of the Kylin developers are in the new year holiday, so there might be some delay. Here are some comments from my side:

1.  I presume that the whole .json files are stored, is it right ?
yes
2. Do these kinds of tables contain the cube data ?
yes; cube are stored in HBase with "KYLINL_" as prefix
3. So I am wondering if it is the good method
the "compression" in Tomcat/conf/server.xml has nothing with cube build. To enable compression for cube, you need to configure that in your Hadoop configurations like mapred-site.xml, hbase-site.xml or kylin/conf/kylin_job_conf.xml.
4. How is it possible to optimize cube size to keep good performance ?
https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
5.  Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
yes, exactly; putting the most used filtering column to the heading position on the rowkey can get better performance.
6. What shall we put exactly in the ‘Rowkeys’ section ?
All dimensions (excluding 'derived' dimensions) need be on rowkey; If you see too many columns in the agg. group, remove some dimensions from your cube.
7.  Are the aggregation groups used for speed of the queries.
The agg. group is used to optimize the dimension combinations. For a N dimension cube, by default it will have 2^N combinations (we called cuboid). If you can divide N dimensions to several groups, the combination numbers can be greatly reduced, so the cube build will be much easier and taking much less space. How to define the agg. group? You can do that with your business query patterns.



2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <je...@rte-france.com>>:
Hello,

I have several questions on Kylin, especially about performances and how to manage them. I would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches are performed on the values of primary keys in dimensions. These subsets are then joined to get the corresponding lines on the facts table. As the dimensions tables are much smaller than the facts table, the queries run faster


1.       Questions on performances

•         the raw data are stored in Hive, and the models and structures (cubes) are stored in HBase; I presume that the whole .json files are stored, is it right ?

•         Where are the cube results stores (I mean after a build, a refresh or an append action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube data ?

•         I noticed that when I build the ‘sample_cube’, the volume of data was very important compared to the size of the original files. Is there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml file, called ‘compression’ for the connector on port 7070, but I do not know if it is related to the cube size). I tried to change this parameter to ‘yes’, but I noticed a huge increase of the duration of cube generation. So I am wondering if it is the good method.

•         How is it possible to optimize cube size to keep good performance ?

•         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance when querying high volumes of data  ? Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?

Or is the answer elsewhere ?


2.       Questions on cube building

•         By the way, the ‘Advanced settings’ step is still unclear for me. I tried to build a cube from scratch using the tables provided in the sample project. But I do not know very much what to put in this section.

•         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

•         I do not understand very well why the aggregation group contains so many columns. I tried to remove as many as possible, but when I tried to set up the joins, but some fields were missing so the saving of the cube failed.

•         What shall we put exactly in the ‘Rowkeys’ section ? I understand that this is used to define data encoding (for speed access ? ).Am I right ?

•         Are the aggregation groups used for speed of the queries. I assume it is the case, because it represents the most commonly used associations of columns for the cube.


Thank you in advance for your help.

Best regards,
Jean-Luc.







"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



--
Best regards,

Shaofeng Shi 史少锋



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."


RE: usage of Web inteface Kylin an performances

Posted by BELLIER Jean-luc <je...@rte-france.com>.
Hello Shaofeng,

Thank you for this response.

I would like to clarify some things about the ‘Insight’ part of the Kylin Web interface.
It indicates the name of the cube when a query is launched, but my assumption is that the Hive tables are directly requested. So I do not know where the cube enters into account.
Does this query tool support MDX ? I am not sure, so how can we query the cube elements ?
If we can, can you send me a few examples of MDX queries ?
My assumption is that we need an external tool such as XMLA or other BI tools.

Thank you in advance for your help.

Have a good day.
Best regards,
Jean-Luc.


De : ShaoFeng Shi [mailto:shaofengshi@apache.org]
Envoyé : dimanche 18 février 2018 03:23
À : user <us...@kylin.apache.org>
Objet : Re: usage of Web inteface Kylin an performances

Hi Jean-luc,

Most of the Kylin developers are in the new year holiday, so there might be some delay. Here are some comments from my side:

1.  I presume that the whole .json files are stored, is it right ?
yes
2. Do these kinds of tables contain the cube data ?
yes; cube are stored in HBase with "KYLINL_" as prefix
3. So I am wondering if it is the good method
the "compression" in Tomcat/conf/server.xml has nothing with cube build. To enable compression for cube, you need to configure that in your Hadoop configurations like mapred-site.xml, hbase-site.xml or kylin/conf/kylin_job_conf.xml.
4. How is it possible to optimize cube size to keep good performance ?
https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
5.  Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
yes, exactly; putting the most used filtering column to the heading position on the rowkey can get better performance.
6. What shall we put exactly in the ‘Rowkeys’ section ?
All dimensions (excluding 'derived' dimensions) need be on rowkey; If you see too many columns in the agg. group, remove some dimensions from your cube.
7.  Are the aggregation groups used for speed of the queries.
The agg. group is used to optimize the dimension combinations. For a N dimension cube, by default it will have 2^N combinations (we called cuboid). If you can divide N dimensions to several groups, the combination numbers can be greatly reduced, so the cube build will be much easier and taking much less space. How to define the agg. group? You can do that with your business query patterns.



2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <je...@rte-france.com>>:
Hello,

I have several questions on Kylin, especially about performances and how to manage them. I would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches are performed on the values of primary keys in dimensions. These subsets are then joined to get the corresponding lines on the facts table. As the dimensions tables are much smaller than the facts table, the queries run faster


1.       Questions on performances

•         the raw data are stored in Hive, and the models and structures (cubes) are stored in HBase; I presume that the whole .json files are stored, is it right ?

•         Where are the cube results stores (I mean after a build, a refresh or an append action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube data ?

•         I noticed that when I build the ‘sample_cube’, the volume of data was very important compared to the size of the original files. Is there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml file, called ‘compression’ for the connector on port 7070, but I do not know if it is related to the cube size). I tried to change this parameter to ‘yes’, but I noticed a huge increase of the duration of cube generation. So I am wondering if it is the good method.

•         How is it possible to optimize cube size to keep good performance ?

•         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance when querying high volumes of data  ? Is it through the ‘rowkeys’ in the advanced settings when you build the cube ?
Or is the answer elsewhere ?


2.       Questions on cube building

•         By the way, the ‘Advanced settings’ step is still unclear for me. I tried to build a cube from scratch using the tables provided in the sample project. But I do not know very much what to put in this section.

•         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

•         I do not understand very well why the aggregation group contains so many columns. I tried to remove as many as possible, but when I tried to set up the joins, but some fields were missing so the saving of the cube failed.

•         What shall we put exactly in the ‘Rowkeys’ section ? I understand that this is used to define data encoding (for speed access ? ).Am I right ?

•         Are the aggregation groups used for speed of the queries. I assume it is the case, because it represents the most commonly used associations of columns for the cube.

Thank you in advance for your help.

Best regards,
Jean-Luc.






"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."



--
Best regards,

Shaofeng Shi 史少锋



"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed and may contain information that is privileged or confidential. If you have received this communication by error, please notify us immediately by electronic mail, do not disclose it and delete the original message."

Re: usage of Web inteface Kylin an performances

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Jean-luc,

Most of the Kylin developers are in the new year holiday, so there might be
some delay. Here are some comments from my side:

1.  I presume that the whole .json files are stored, is it right ?
yes
2. Do these kinds of tables contain the cube data ?
yes; cube are stored in HBase with "KYLINL_" as prefix
3. So I am wondering if it is the good method
the "compression" in Tomcat/conf/server.xml has nothing with cube build. To
enable compression for cube, you need to configure that in your Hadoop
configurations like mapred-site.xml, hbase-site.xml or
kylin/conf/kylin_job_conf.xml.
4. How is it possible to optimize cube size to keep good performance ?
https://kylin.apache.org/docs21/howto/howto_optimize_cubes.html
5.  Is it through the ‘rowkeys’ in the advanced settings when you build the
cube ?
yes, exactly; putting the most used filtering column to the heading
position on the rowkey can get better performance.
6. What shall we put exactly in the ‘Rowkeys’ section ?
All dimensions (excluding 'derived' dimensions) need be on rowkey; If you
see too many columns in the agg. group, remove some dimensions from your
cube.
7.  Are the aggregation groups used for speed of the queries.
The agg. group is used to optimize the dimension combinations. For a N
dimension cube, by default it will have 2^N combinations (we called
cuboid). If you can divide N dimensions to several groups, the combination
numbers can be greatly reduced, so the cube build will be much easier and
taking much less space. How to define the agg. group? You can do that with
your business query patterns.



2018-02-14 1:49 GMT+08:00 BELLIER Jean-luc <je...@rte-france.com>
:

> Hello,
>
>
>
> I have several questions on Kylin, especially about performances and how
> to manage them. I would like to understand precisely how it works to see if
> I can use it in my business context.
>
>
>
> I come from the relational database world, so as far as I understand on
> OLAP, the searches are performed on the values of primary keys in
> dimensions. These subsets are then joined to get the corresponding lines on
> the facts table. As the dimensions tables are much smaller than the facts
> table, the queries run faster
>
>
>
> *1.       **Questions on performances*
>
> ·         the raw data are stored in Hive, and the models and structures
> (cubes) are stored in HBase; I presume that the whole .json files are
> stored, is it right ?
>
> ·         Where are the cube results stores (I mean after a build, a
> refresh or an append action). Is it also in HBase ? I can see in HBase
> tables like "KYLIN_FF46WDAAGH". Do these kinds of tables contain the cube
> data ?
>
> ·         I noticed that when I build the ‘sample_cube’, the volume of
> data was very important compared to the size of the original files. Is
> there a way to reduce it (I saw a attribute in the $KYLIN_HOME/tomcat/conf/server.xml
> file, called ‘compression’ for the connector on port 7070, but I do not
> know if it is related to the cube size). I tried to change this parameter
> to ‘yes’, but I noticed a huge increase of the duration of cube generation.
> So I am wondering if it is the good method.
>
> ·         How is it possible to optimize cube size to keep good
> performance ?
>
> ·         In Hive, putting indexes is not recommended. So how Kylin is
> ensuring good performance when querying high volumes of data  ? Is it
> through the ‘rowkeys’ in the advanced settings when you build the cube ?
>
> Or is the answer elsewhere ?
>
>
>
> *2.       **Questions on cube building*
>
> ·         By the way, the ‘Advanced settings’ step is still unclear for
> me. I tried to build a cube from scratch using the tables provided in the
> sample project. But I do not know very much what to put in this section.
>
> ·         My goal is to define groups of data on YEAR_BEG_DT,
> QTR_BEG_DT,MONTH_BEG_DT.
>
> ·         I do not understand very well why the aggregation group
> contains so many columns. I tried to remove as many as possible, but when I
> tried to set up the joins, but some fields were missing so the saving of
> the cube failed.
>
> ·         What shall we put exactly in the ‘Rowkeys’ section ? I
> understand that this is used to define data encoding (for speed access ?
> ).Am I right ?
>
> ·         Are the aggregation groups used for speed of the queries. I
> assume it is the case, because it represents the most commonly used
> associations of columns for the cube.
>
>
>
> Thank you in advance for your help.
>
>
>
> Best regards,
>
> Jean-Luc.
>
>
>
>
>
>
>
>
> "Ce message est destiné exclusivement aux personnes ou entités auxquelles
> il est adressé et peut contenir des informations privilégiées ou
> confidentielles. Si vous avez reçu ce document par erreur, merci de nous
> l'indiquer par retour, de ne pas le transmettre et de procéder à sa
> destruction.
>
> This message is solely intended for the use of the individual or entity to
> which it is addressed and may contain information that is privileged or
> confidential. If you have received this communication by error, please
> notify us immediately by electronic mail, do not disclose it and delete the
> original message."
>



-- 
Best regards,

Shaofeng Shi 史少锋