You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by 仇同心 <qi...@jd.com> on 2016/11/10 06:54:10 UTC

Cube 构建优化咨询

大家好：
     目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension Dictionary就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10个月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
    服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary 会一直在跑很长时间，最后到导致内存溢出。
     对于这种度量基数高的问题，有什么好的优化方案吗？



谢谢~

Re: 答复: Cube 构建优化咨询

Posted by Li Yang <li...@apache.org>.

> cube维度的基数不是很高，但是度量里的字段基数很高

You must have confused the concept of dimension and measure. Because
measure don't need dictionary, a high cardinality of measure won't cause
problem to dictionary building at all.

For measure columns, don't add them as dimension. And your problem shall be
gone.

If not sure, please share the model and cube JSON with us. We could help
more.

Cheers
Yang

2016-11-14 15:03 GMT+08:00 仇同心 <qi...@jd.com>:

> According to 15 days in a batch,I tried to build cube ,cube build
> succeed,but wen auto merge cube,an error appeared:
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.
> java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(
> ByteArrayOutputStream.java:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.
> java:140)
>         at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2147)
>         at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
>         at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
>         at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
>         at org.apache.kylin.storage.hbase.HBaseResourceStore.
> putResourceImpl(HBaseResourceStore.java:239)
>         at org.apache.kylin.common.persistence.ResourceStore.
> putResource(ResourceStore.java:208)
>         at org.apache.kylin.dict.DictionaryManager.save(
> DictionaryManager.java:413)
>         at org.apache.kylin.dict.DictionaryManager.saveNewDict(
> DictionaryManager.java:209)
>         at org.apache.kylin.dict.DictionaryManager.trySaveNewDict(
> DictionaryManager.java:176)
>         at org.apache.kylin.dict.DictionaryManager.mergeDictionary(
> DictionaryManager.java:269)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> mergeDictionaries(MergeDictionaryStep.java:145)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> makeDictForNewSegment(MergeDictionaryStep.java:135)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> doWork(MergeDictionaryStep.java:67)
>         at org.apache.kylin.job.execution.AbstractExecutable.
> execute(AbstractExecutable.java:113)
>         at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
> DefaultChainedExecutable.java:57)
>         at org.apache.kylin.job.execution.AbstractExecutable.
> execute(AbstractExecutable.java:113)
>         at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
> JobRunner.run(DefaultScheduler.java:136)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> -----邮件原件-----
> 发件人: Luke Han [mailto:luke.hq@gmail.com]
> 发送时间: 2016年11月12日 0:1
> 收件人: user@kylin.apache.org
> 抄送: dev
> 主题: Re: Cube 构建优化咨询
>
> don't try to run such huge job one time, please run them one by one, for
> example, run 1 month data and then next one...
>
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> 2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:
>
> > 大家好：
> >
> >      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension
> > Dictionary
> > 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10
> > 个
> > 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
> >
> >     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> > 会一直在跑很长时间，最后到导致内存溢出。
> >
> >      对于这种度量基数高的问题，有什么好的优化方案吗？
> >
> >
> >
> >
> >
> >
> >
> > 谢谢~
> >
> >
> >
> >
> >
> >
> >
>

Re: 答复: Cube 构建优化咨询

Posted by Li Yang <li...@apache.org>.

> cube维度的基数不是很高，但是度量里的字段基数很高

You must have confused the concept of dimension and measure. Because
measure don't need dictionary, a high cardinality of measure won't cause
problem to dictionary building at all.

For measure columns, don't add them as dimension. And your problem shall be
gone.

If not sure, please share the model and cube JSON with us. We could help
more.

Cheers
Yang

2016-11-14 15:03 GMT+08:00 仇同心 <qi...@jd.com>:

> According to 15 days in a batch,I tried to build cube ,cube build
> succeed,but wen auto merge cube,an error appeared:
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.
> java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(
> ByteArrayOutputStream.java:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.
> java:140)
>         at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2147)
>         at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
>         at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
>         at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
>         at org.apache.kylin.storage.hbase.HBaseResourceStore.
> putResourceImpl(HBaseResourceStore.java:239)
>         at org.apache.kylin.common.persistence.ResourceStore.
> putResource(ResourceStore.java:208)
>         at org.apache.kylin.dict.DictionaryManager.save(
> DictionaryManager.java:413)
>         at org.apache.kylin.dict.DictionaryManager.saveNewDict(
> DictionaryManager.java:209)
>         at org.apache.kylin.dict.DictionaryManager.trySaveNewDict(
> DictionaryManager.java:176)
>         at org.apache.kylin.dict.DictionaryManager.mergeDictionary(
> DictionaryManager.java:269)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> mergeDictionaries(MergeDictionaryStep.java:145)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> makeDictForNewSegment(MergeDictionaryStep.java:135)
>         at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.
> doWork(MergeDictionaryStep.java:67)
>         at org.apache.kylin.job.execution.AbstractExecutable.
> execute(AbstractExecutable.java:113)
>         at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
> DefaultChainedExecutable.java:57)
>         at org.apache.kylin.job.execution.AbstractExecutable.
> execute(AbstractExecutable.java:113)
>         at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
> JobRunner.run(DefaultScheduler.java:136)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> -----邮件原件-----
> 发件人: Luke Han [mailto:luke.hq@gmail.com]
> 发送时间: 2016年11月12日 0:1
> 收件人: user@kylin.apache.org
> 抄送: dev
> 主题: Re: Cube 构建优化咨询
>
> don't try to run such huge job one time, please run them one by one, for
> example, run 1 month data and then next one...
>
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> 2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:
>
> > 大家好：
> >
> >      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension
> > Dictionary
> > 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10
> > 个
> > 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
> >
> >     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> > 会一直在跑很长时间，最后到导致内存溢出。
> >
> >      对于这种度量基数高的问题，有什么好的优化方案吗？
> >
> >
> >
> >
> >
> >
> >
> > 谢谢~
> >
> >
> >
> >
> >
> >
> >
>

答复: Cube 构建优化咨询

Posted by 仇同心 <qi...@jd.com>.

According to 15 days in a batch,I tried to build cube ,cube build succeed,but wen auto merge cube,an error appeared:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.util.Arrays.copyOf(Arrays.java:2271)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2147)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
	at org.apache.kylin.storage.hbase.HBaseResourceStore.putResourceImpl(HBaseResourceStore.java:239)
	at org.apache.kylin.common.persistence.ResourceStore.putResource(ResourceStore.java:208)
	at org.apache.kylin.dict.DictionaryManager.save(DictionaryManager.java:413)
	at org.apache.kylin.dict.DictionaryManager.saveNewDict(DictionaryManager.java:209)
	at org.apache.kylin.dict.DictionaryManager.trySaveNewDict(DictionaryManager.java:176)
	at org.apache.kylin.dict.DictionaryManager.mergeDictionary(DictionaryManager.java:269)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.mergeDictionaries(MergeDictionaryStep.java:145)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.makeDictForNewSegment(MergeDictionaryStep.java:135)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.doWork(MergeDictionaryStep.java:67)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
	at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:57)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
	at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:136)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)




-----邮件原件-----
发件人: Luke Han [mailto:luke.hq@gmail.com] 
发送时间: 2016年11月12日 0:1
收件人: user@kylin.apache.org
抄送: dev
主题: Re: Cube 构建优化咨询

don't try to run such huge job one time, please run them one by one, for example, run 1 month data and then next one...




Best Regards!
---------------------

Luke Han

2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:

> 大家好：
>
>      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension 
> Dictionary
> 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10
> 个
> 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
>
>     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> 会一直在跑很长时间，最后到导致内存溢出。
>
>      对于这种度量基数高的问题，有什么好的优化方案吗？
>
>
>
>
>
>
>
> 谢谢~
>
>
>
>
>
>
>

答复: Cube 构建优化咨询

Posted by 仇同心 <qi...@jd.com>.

According to 15 days in a batch,I tried to build cube ,cube build succeed,but wen auto merge cube,an error appeared:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.util.Arrays.copyOf(Arrays.java:2271)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2147)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
	at org.apache.kylin.storage.hbase.HBaseResourceStore.putResourceImpl(HBaseResourceStore.java:239)
	at org.apache.kylin.common.persistence.ResourceStore.putResource(ResourceStore.java:208)
	at org.apache.kylin.dict.DictionaryManager.save(DictionaryManager.java:413)
	at org.apache.kylin.dict.DictionaryManager.saveNewDict(DictionaryManager.java:209)
	at org.apache.kylin.dict.DictionaryManager.trySaveNewDict(DictionaryManager.java:176)
	at org.apache.kylin.dict.DictionaryManager.mergeDictionary(DictionaryManager.java:269)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.mergeDictionaries(MergeDictionaryStep.java:145)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.makeDictForNewSegment(MergeDictionaryStep.java:135)
	at org.apache.kylin.engine.mr.steps.MergeDictionaryStep.doWork(MergeDictionaryStep.java:67)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
	at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:57)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
	at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:136)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)




-----邮件原件-----
发件人: Luke Han [mailto:luke.hq@gmail.com] 
发送时间: 2016年11月12日 0:1
收件人: user@kylin.apache.org
抄送: dev
主题: Re: Cube 构建优化咨询

don't try to run such huge job one time, please run them one by one, for example, run 1 month data and then next one...




Best Regards!
---------------------

Luke Han

2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:

> 大家好：
>
>      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension 
> Dictionary
> 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10
> 个
> 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
>
>     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> 会一直在跑很长时间，最后到导致内存溢出。
>
>      对于这种度量基数高的问题，有什么好的优化方案吗？
>
>
>
>
>
>
>
> 谢谢~
>
>
>
>
>
>
>

Re: Cube 构建优化咨询

Posted by Luke Han <lu...@gmail.com>.

don't try to run such huge job one time, please run them one by one, for
example, run 1 month data and then next one...




Best Regards!
---------------------

Luke Han

2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:

> 大家好：
>
>      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension Dictionary
> 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10个
> 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
>
>     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> 会一直在跑很长时间，最后到导致内存溢出。
>
>      对于这种度量基数高的问题，有什么好的优化方案吗？
>
>
>
>
>
>
>
> 谢谢~
>
>
>
>
>
>
>

Re: Cube 构建优化咨询

Posted by Luke Han <lu...@gmail.com>.

don't try to run such huge job one time, please run them one by one, for
example, run 1 month data and then next one...




Best Regards!
---------------------

Luke Han

2016-11-10 14:54 GMT+08:00 仇同心 <qi...@jd.com>:

> 大家好：
>
>      目前在构建cube时遇到问题：cube维度的基数不是很高，但是度量里的字段基数很高，Build Dimension Dictionary
> 就非常的占用本机内存，选取的度量的基数有千万、亿，甚至是十亿左右的，度量大多都是SUM，Count_distinct的精确计算。数据量是10个
> 月的数据，我们是打算一次跑完10个月历史数据，然后在按日增跑作业。
>
>     服务器的内存配置为125G，#4 Step Name: Build Dimension Dictionary
> 会一直在跑很长时间，最后到导致内存溢出。
>
>      对于这种度量基数高的问题，有什么好的优化方案吗？
>
>
>
>
>
>
>
> 谢谢~
>
>
>
>
>
>
>