You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Luke Han <lu...@gmail.com> on 2015/08/03 14:27:41 UTC
Re: On improving WHEN statements performance on other columns

Hi Luca, the auto generator is a helper for user to create cube faster.
Well data model is the essential for better performance, it's require
modeler to pay attention to and well design each piece. A tool can't not
help too much especially for row key order, aggregation group and so on.
There were many cycles to tune to best balance in our production cases, the
trade-off of storage and performance is not easy to know.

But you are also right, we should put more better information on UI, we are
trying to build some guide in house now, hopefully it will bring a better
tutorial for people to start with Kylin.

meanwhile, please refer to this page for more detail about hierarchies:
http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/

Thank you very much.



Best Regards!
---------------------

Luke Han

On Fri, Jul 31, 2015 at 5:49 PM, Luca Costabello <lu...@gmail.com>
wrote:

> Hello Li,
>
> Thanks a lot for the heads up.
>
> Indeed, I was trying to apply EQ and IN statements on columns belonging to
> a derived dimension.
> I did not get that such columns are not included in the rowkey generation,
> hence my need for a secondary index on HBase.
>
> I have now added the columns involved in filters as normal dimensions, and
> I get sub-second queries with EQ and IN statements as expected.
>
> As a side note, I was a little misled by the "Auto Generator" wizard in the
> cube creation UI (step 3):  the wizard adds all the selected columns from a
> lookup table as a derived dimension by default. Nevertheless, as you
> mentioned above, if a column must be used in EQ and IN statements later on,
> it should not be included in the derived dimension, and put in a normal
> dimension instead (to include it in the rowkey). Maybe an additional info
> panel that explains such behaviour could be useful.
>
> Also, I think the UI should better inform that the order of columns in the
> rowkey is important performance-wise (although you wrote it in the slide
> deck).
>
> I have also noticed that someone else have raised some clarification about
> the definition of hierarchies.
> https://issues.apache.org/jira/browse/KYLIN-887
>
> Thanks,
>
> luca
>
>
> On Sat, Jul 11, 2015 at 2:12 AM, Li Yang <li...@apache.org> wrote:
>
> > Hi Luca, could you give an example of your cube definition and query? I'm
> > not 100% sure I understand the problem.
> >
> > > Such statements include EQ or IN operators and are not defined on
> > rowkeys.
> > If a column is not on rowkey, then you defined it as derived? From a cube
> > design point of view, such columns should be on rowkey for best
> > performance. And better to be the first column of rowkey, because then
> the
> > EQ / IN condition will cut down the scan range significantly.
> >
> > Cheers
> > Yang
> >
> > On Tue, Jul 7, 2015 at 4:28 AM, Julian Hyde <jh...@apache.org> wrote:
> >
> > > Does your use case look like
> > >
> > >    …
> > >    WHERE (CASE
> > >                    WHEN condition1 THEN constant1
> > >                    WHEN condition2 THEN constant2 …
> > >                    END ) = constant1
> > >
> > > If so, https://issues.apache.org/jira/browse/CALCITE-727 may help.
> (The
> > > fix is not in current Kylin, but maybe it could be in within a month or
> > so.)
> > >
> > > Julian
> > >
> > > On Jul 6, 2015, at 2:49 AM, Luca Costabello <luca.costabello@gmail.com
> >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > In my adoption scenario (~50 M records) I must execute queries with
> > WHEN
> > > > statements. Such statements include EQ or IN operators and are not
> > > defined
> > > > on rowkeys.
> > > >
> > > > Unfortunately, the lack of secondary indexes in HBase determines
> > response
> > > > times that go well above 1 minute. While this can be acceptable under
> > > many
> > > > circumstances, it severely degrades the performance of the system I
> > have
> > > > built over Kylin (it is my understanding that each EQ condition or IN
> > > > element determines a HBase full scan).
> > > >
> > > > I would like to know if someone have come up with a solution or
> > > workaround.
> > > > I think you guys already apply some client request filters [1] to
> some
> > > > extent.
> > > > Has some of you tried to integrate Kylin HBase client code with
> hindex
> > > [2]?
> > > > I wonder if the coprocessor-based approach adopted by hindex might be
> > > > effective - even though hindex does not come as a standalone jar, so
> > > > deploying the hindex HBase fork is necessary (I am not aware of how
> > > hindex
> > > > is reliable and the latest commit is 6 month old). Besides, some
> change
> > > to
> > > > Kylin HBase client code would be required (when creating cube
> HTables).
> > > > I have also had a quick look at Phoenix [3], which comes with
> secondary
> > > > indexes support, but I wonder if it makes sense to integrate that
> with
> > > > Kylin (in this case I think Kylin HBase client code should be heavily
> > > > modified to switch to Phoenix APIs.)
> > > >
> > > > Long story short, I wonder if someone could give me a heads up and
> > point
> > > me
> > > > in the right direction.
> > > >
> > > >
> > > > Cheers,
> > > > luca
> > > >
> > > > [1] http://hbase.apache.org/book.html#client.filter
> > > > [2] https://github.com/Huawei-Hadoop/hindex/tree/hbase-0.98
> > > > [3] https://phoenix.apache.org/secondary_indexing.html
> > >
> > >
> >
>