You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Tero Paananen <te...@gmail.com> on 2015/02/13 17:38:55 UTC

Using one-to-many/many-to-many type relationships in Kylin

We're looking at using Kylin for an analytics solution. We have a dev
environment set up and it's working fine for simple use cases.

However, we have a data model that has a main entity with properties
that have multiple values and we're not quite sure how to make that
available to Kylin the best way. We're using HBase to store the data.

I'll try and explain the data model using JSON. Hopefully it makes sense.

{
  "simpleProperty" : "foo",
  "anotherSimpleProperty" : "bar",
  "multiValueProperty" : "a,b,c,d,e,f",
  "anotherMultiValueProperty" : "1,2,3,4,5"
}

We'll need to do analysis like this:

select multiValueProperty, anotherMultiValueProperty, simpleProperty, count(*)
from entity
where multiValueProperty = 'a'
  and anotherMultiValueProperty = '1'
group by multiValueProperty, anotherMultiValueProperty, simpleProperty

Output example:

a, 1, foo, 10
a, 1, foo2, 5

or

select multiValueProperty, anotherMultiValueProperty, simpleProperty, count(*)
from entity
where multiValueProperty = 'a'
  and simpleProperty = 'foo'
group by multiValueProperty, anotherMultiValueProperty, simpleProperty

Output example:

a, 1, foo, 10
a, 2, foo, 5
a, 3, foo, 2
a, 4, foo, 25
a, 5, foo, 43
a, 1, foo2, 11
a, 2, foo2, 6
a, 3, foo2, 3
a, 4, foo2, 26
a, 5, foo2, 44

We're looking at some other OLAP on Hadoop type tools that can parse
column values that are delimited in some way that would allow us to
keep the data representation shown in my example JSON, but I don't see
that sort of capability in Kylin.

What's the best way we should model our data to make the data useful
for us? Our data model is not locked by any means, and we can adjust
it to meet the needs of our analytics platform.

-TPP

Re: Using one-to-many/many-to-many type relationships in Kylin

Posted by Li Yang <li...@apache.org>.
The "Second solution" is more generic and practical. It deals well with
different sizes of multi-values. True the fact table (or view) is exploded,
but the final cube will be compact as column values are dictionary
compressed.

On Fri, Feb 20, 2015 at 1:06 PM, 周千昊 <z....@gmail.com> wrote:

> Hi, Paananen,
>      The first step is to create an intermediate hive table.
>      Basically it execute some hql, you can see the exact hqls by click
> the little icon (names parameter look like a key) in the job detail. And
> then you can execute the hql through hive through hive command line to
> see if anything goes wrong.
>
> On Thu Feb 19 2015 at 2:08:10 AM Tero Paananen <te...@gmail.com>
> wrote:
>
>> On Sun, Feb 15, 2015 at 1:22 AM, Zhou, Qianhao <qi...@ebay.com> wrote:
>> > Hi, Paananen,
>> >     Can you tell us about
>> >     1. how many distinct value count in your multi value property
>>
>> It varies a lot. We have some entities that have none, and some that
>> may have thousands.
>>
>> Normally, though, it's probably 3 - 6 values.
>>
>> For our evaluation purposes we could ignore the thousands of values cases.
>>
>> >     2. Which exactly step of the job is stopping you from the cube build
>>
>> It fails on the very first step.
>>
>> The Kylin log lists the job name as:
>>
>> Starting to launch local task to process map join;
>>
>> -TPP
>>
>

Re: Using one-to-many/many-to-many type relationships in Kylin

Posted by 周千昊 <z....@gmail.com>.
Hi, Paananen,
     The first step is to create an intermediate hive table.
     Basically it execute some hql, you can see the exact hqls by click the
little icon (names parameter look like a key) in the job detail. And then
you can execute the hql through hive through hive command line to see if
anything goes wrong.

On Thu Feb 19 2015 at 2:08:10 AM Tero Paananen <te...@gmail.com>
wrote:

> On Sun, Feb 15, 2015 at 1:22 AM, Zhou, Qianhao <qi...@ebay.com> wrote:
> > Hi, Paananen,
> >     Can you tell us about
> >     1. how many distinct value count in your multi value property
>
> It varies a lot. We have some entities that have none, and some that
> may have thousands.
>
> Normally, though, it's probably 3 - 6 values.
>
> For our evaluation purposes we could ignore the thousands of values cases.
>
> >     2. Which exactly step of the job is stopping you from the cube build
>
> It fails on the very first step.
>
> The Kylin log lists the job name as:
>
> Starting to launch local task to process map join;
>
> -TPP
>

Re: Using one-to-many/many-to-many type relationships in Kylin

Posted by Tero Paananen <te...@gmail.com>.
On Sun, Feb 15, 2015 at 1:22 AM, Zhou, Qianhao <qi...@ebay.com> wrote:
> Hi, Paananen,
>     Can you tell us about
>     1. how many distinct value count in your multi value property

It varies a lot. We have some entities that have none, and some that
may have thousands.

Normally, though, it's probably 3 - 6 values.

For our evaluation purposes we could ignore the thousands of values cases.

>     2. Which exactly step of the job is stopping you from the cube build

It fails on the very first step.

The Kylin log lists the job name as:

Starting to launch local task to process map join;

-TPP

Re: Using one-to-many/many-to-many type relationships in Kylin

Posted by "Zhou, Qianhao" <qi...@ebay.com>.
Hi, Paananen,
    Can you tell us about
    1. how many distinct value count in your multi value property
    2. Which exactly step of the job is stopping you from the cube build


Best Regard
Zhou QianHao





On 2/14/15, 5:01 AM, "Tero Paananen" <te...@gmail.com> wrote:

>> I'll try and explain the data model using JSON. Hopefully it makes
>>sense.
>>
>> {
>>   "simpleProperty" : "foo",
>>   "anotherSimpleProperty" : "bar",
>>   "multiValueProperty" : "a,b,c,d,e,f",
>>   "anotherMultiValueProperty" : "1,2,3,4,5"
>> }
>
>An update on this.
>
>We spent some time today to prototype different solutions.
>
>First solution:
>
>Create a Hive table with an array type column, and sync that to Kylin.
>
>create external table MyTable(key STRING, document_id STRING,
>document_date DATE, mvp ARRAY<STRING>)
>ROW FORMAT DELIMITED
>COLLECTION ITEMS TERMINATED BY ','
>STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>WITH SERDEPROPERTIES ('hbase.columns.mapping' =
>':key,c:document_id,c:document_date,c:multiValueProperty)
>TBLPROPERTIES ('hbase.table.name' = 'HBASE_TABLE');
>
>This works well in Hive. The data is represented as an array rather
>than a comma delimited string.
>
>However, Kylin doesn't support the ARRAY data type in Hive tables.
>When you try and sync the table into Kylin it will fail.
>
>
>Second solution:
>
>Use the table created above as a base table for a view in Hive that
>explodes the multiValueProperty array:
>
>create view mvp_view as select document_id, mvp_value FROM MyTable
>LATERAL VIEW explode(mvp) mvpTbl as mvp_value;
>
>The view contains the data as:
>
>"doc_id 1", "a"
>"doc_id 1", "b"
>"doc_id 2", "a"
>etc.
>
>This view will sync to Kylin. We can then create a cube that joins
>into this Hive view linking with the document_id in the Join Condition
>between the fact table and this view table.
>
>The cube build process, however, seems to be using a LOT more memory
>and horsepower than the jobs for cubes using simpler data models and
>we're currently unable to complete that process as our dev servers
>aren't quite powerful enough.
>
>
>Using the LATERAL VIEW and explode() UDF in this way is also probably
>quite inefficient.
>
>-TPP


Re: Using one-to-many/many-to-many type relationships in Kylin

Posted by Tero Paananen <te...@gmail.com>.
> I'll try and explain the data model using JSON. Hopefully it makes sense.
>
> {
>   "simpleProperty" : "foo",
>   "anotherSimpleProperty" : "bar",
>   "multiValueProperty" : "a,b,c,d,e,f",
>   "anotherMultiValueProperty" : "1,2,3,4,5"
> }

An update on this.

We spent some time today to prototype different solutions.

First solution:

Create a Hive table with an array type column, and sync that to Kylin.

create external table MyTable(key STRING, document_id STRING,
document_date DATE, mvp ARRAY<STRING>)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' =
':key,c:document_id,c:document_date,c:multiValueProperty)
TBLPROPERTIES ('hbase.table.name' = 'HBASE_TABLE');

This works well in Hive. The data is represented as an array rather
than a comma delimited string.

However, Kylin doesn't support the ARRAY data type in Hive tables.
When you try and sync the table into Kylin it will fail.


Second solution:

Use the table created above as a base table for a view in Hive that
explodes the multiValueProperty array:

create view mvp_view as select document_id, mvp_value FROM MyTable
LATERAL VIEW explode(mvp) mvpTbl as mvp_value;

The view contains the data as:

"doc_id 1", "a"
"doc_id 1", "b"
"doc_id 2", "a"
etc.

This view will sync to Kylin. We can then create a cube that joins
into this Hive view linking with the document_id in the Join Condition
between the fact table and this view table.

The cube build process, however, seems to be using a LOT more memory
and horsepower than the jobs for cubes using simpler data models and
we're currently unable to complete that process as our dev servers
aren't quite powerful enough.


Using the LATERAL VIEW and explode() UDF in this way is also probably
quite inefficient.

-TPP