You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by zhong zhang <zz...@gmail.com> on 2016/01/08 23:17:58 UTC

build a cube with two ultra high cardinality columns

Hi All,

There are two ultra high carnality columns in our cube. Both of them are
over 50 million cardinality. When building the cube, it keeps giving us the
error: Error: GC overhead limit exceeded for the reduce jobs at the
step Extract
Fact Table Distinct Columns.

We've just updated to version1.2.

Can anyone give some ideas to solve this issue?

Best regards,
Zhong

Re: build a cube with two ultra high cardinality columns

Posted by ShaoFeng Shi <sh...@apache.org>.
Agree with feng yu, you need think about whether you need build such
high-cardinality dimension into Cube;

For example, if the column is something like a free text description, or a
timestamp column, it doesn't make sense to have them in Cube, as Kylin is
an OLDAP engine not a common database; you'd better redesign the cube.

If it is something like a "seller_id" (assuming you have a large number of
sellers, like eBay), and you need aggregte the data by each seller_id, this
is a valid case for UHC.

Just think about and then decide how to move on.

2016-01-09 9:52 GMT+08:00 yu feng <ol...@gmail.com>:

> assume average size of this column is 32 bytes, 50 millions cardinality
> means 1.5GB, in the step of 'Fact Table Distinct Columns.' mapper need read
> from intermediate table and remove duplicate values(do it in Combiner),
> however, this job will startup more than one mapper and just one reducer,
> therefore, input for reducer is more than 1.5GB and in reduce function
> kylin will create a new Set to contain all unique values, so , this is a
> another 1.5GB.
>
> I have encounter this probelm and I have to change MR config preperty for
> every job, I modify those properties :
>     <property>
>         <name>mapreduce.reduce.java.opts</name>
>         <value>-Xmx6000M</value>
>         <description>Larger heap-size for child jvms of
> reduces.</description>
>     </property>
>
>     <property>
>         <name>mapreduce.reduce.memory.mb</name>
>         <value>8000</value>
>         <description>Larger resource limit for reduces.</description>
>     </property>
> you can check the value of those properties currently used and increase
> them.
>
> At Last, ask yourself Do you really need all detail values of those two
> column, if not , you can create create view to change the source data or
> just do not use dictionary while creating cube, set the length value for
> them in 'Advanced Setting' step..
>
> Hope to be helpful to you.
>
> 2016-01-09 6:17 GMT+08:00 zhong zhang <zz...@gmail.com>:
>
> > Hi All,
> >
> > There are two ultra high carnality columns in our cube. Both of them are
> > over 50 million cardinality. When building the cube, it keeps giving us
> the
> > error: Error: GC overhead limit exceeded for the reduce jobs at the
> > step Extract
> > Fact Table Distinct Columns.
> >
> > We've just updated to version1.2.
> >
> > Can anyone give some ideas to solve this issue?
> >
> > Best regards,
> > Zhong
> >
>



-- 
Best regards,

Shaofeng Shi

Re: build a cube with two ultra high cardinality columns

Posted by yu feng <ol...@gmail.com>.
assume average size of this column is 32 bytes, 50 millions cardinality
means 1.5GB, in the step of 'Fact Table Distinct Columns.' mapper need read
from intermediate table and remove duplicate values(do it in Combiner),
however, this job will startup more than one mapper and just one reducer,
therefore, input for reducer is more than 1.5GB and in reduce function
kylin will create a new Set to contain all unique values, so , this is a
another 1.5GB.

I have encounter this probelm and I have to change MR config preperty for
every job, I modify those properties :
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx6000M</value>
        <description>Larger heap-size for child jvms of
reduces.</description>
    </property>

    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>8000</value>
        <description>Larger resource limit for reduces.</description>
    </property>
you can check the value of those properties currently used and increase
them.

At Last, ask yourself Do you really need all detail values of those two
column, if not , you can create create view to change the source data or
just do not use dictionary while creating cube, set the length value for
them in 'Advanced Setting' step..

Hope to be helpful to you.

2016-01-09 6:17 GMT+08:00 zhong zhang <zz...@gmail.com>:

> Hi All,
>
> There are two ultra high carnality columns in our cube. Both of them are
> over 50 million cardinality. When building the cube, it keeps giving us the
> error: Error: GC overhead limit exceeded for the reduce jobs at the
> step Extract
> Fact Table Distinct Columns.
>
> We've just updated to version1.2.
>
> Can anyone give some ideas to solve this issue?
>
> Best regards,
> Zhong
>