You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by xuchuanyin <xu...@hust.edu.cn> on 2018/10/23 10:55:07 UTC

[Discussion] How to configure the unsafe working memory for data loading

Hi all,
I go through the code and get another formula to estimate the unsafe working
memory. It is inaccurate too but we can open this thread to optimize it.

# Memory Required For Data Loading per Table

## version from Community
(carbon.number.of.cores.while.loading) * (offheap.sort.chunk.size.inmb +
carbon.blockletgroup.size.in.mb + carbon.blockletgroup.size.in.mb/3.5 )

## version from proposal
memory_size_reqiured
 = max(sort_temp_memory_consumption, data_encoding_consumption)
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (TABLE_PAGE_SIZE)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + compress.temp.size)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + per.column.page.size/3.5)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * (32000 * 8 * 1.25) + (32000 * 8 * 1.25)/3.5)}

Note: 
1.  offheap.sort.chunk.size.inmb is the size for UnsafeCarbonRowPage
2.  column.page.size is the size for ColumnPage
3.  compress.temp.size is for temporay size for compressing using snappy (in
UnsafeFixLengthColumnPage.compress)

## problems of each version
1.  both do not consider the local dictionary which is disabled by default;
2.  both do not consider the in-memory intermediate merge which is disabled
by default;

### for Community version
1. For per loading, the sort-temp procedure finished before the
producer-consumer procedure, so we do not need to accumulate them.
2. During loading in the producer-consumer procedure, #numer.of.cores
TablePages will be generated, this may surpass the
#carbon.blockletgroup.size.in.mb, so just use
#carbon.blockletgroup.size.in.mb may also cause memory shortage especially
when #numer.of.cores TablePages is high.

### for proposed version
1. It roughly uses 8 bytes * 1.25 (factor in our code) to represent a value
size, which is inaccurate. Besides, 32000 is the max record number in one
page especially after adptive page size for longstring and complex is
implemented.
2. We can further decomposite the #per.column.page.size by identifying the
datatype and data length for string columns, but this may be too trivial for
user to calculate. We can also run the data loading once and get the
#TABLE_PAGE_SIZE or #per.column.page.size, this should be accurate.

## for example
number.of.cores = 15
offheap.sort.chunk.size.inmb = 64
number.of.fields = 300

### Community version
memory_size_reqiured
 = 15 * (64MB + 64MB + 64MB/3.5)
 = 2194MB

### proposed version
memory_size_reqiured
 = max{(15 + 1) * 64MB, 15 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 /
3.5)}
 = {1073741824, 15 * 108228023}
 = max{1073741824, 1623420343}
 = 1548MB





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: 回复： [Discussion] How to configure the unsafe working memory for dataloading

Posted by xuchuanyin <xu...@hust.edu.cn>.

Hi, What's the number of cores in your executor?

And is there only one loading while you encounter this failure?

Besides, can you check if the local dictionary is enabled for your table
using 'desc formatter table_name'? If it is enabled, more memory will be
needed and the provided formula does not consider this. So you can try to
set 'carbon.local.dictionary.decoder.fallback' to false and try again.






--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

回复： [Discussion] How to configure the unsafe working memory for dataloading

Posted by 喜之郎 <25...@qq.com>.

hi chuanyin, can you help to answer this question?
I think the formula maybe is not accurate。
And how many factors affect unsafe working memory?
Looking forward your reply, thanks!


------------------ 原始邮件 ------------------
发件人: "251922566"<25...@qq.com>;
发送时间: 2018年11月30日(星期五) 下午3:22
收件人: "dev"<de...@carbondata.apache.org>;

主题: Re: [Discussion] How to configure the unsafe working memory for dataloading



hi chuanyin, I found that this formula maybe is not correct.
when I do loading, I set spark.yarn.executor.memoryOverhead = 5120, and set these carbon properties as below:


carbon.number.of.cores.while.loading=5
carbon.lock.type=HDFSLOCK
enable.unsafe.sort=true
offheap.sort.chunk.size.inmb=64
sort.inmemory.size.inmb=4096
carbon.enable.vector.reader=true
enable.unsafe.in.query.processing=true
carbon.blockletgroup.size.in.mb=64
enable.unsafe.columnpage=true
carbon.unsafe.working.memory.in.mb=4096



But it still report unsafe memory is not enough. But as formula from community, it only need 
5 * (64MB + 64MB + 64MB/3.5) = 732MB.
if as formula you proposed , it only needs
 = max{(5 + 1) * 64MB, 5 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 /
3.5)}  = 530MB
ps:I have about 300 fields.


Spark version:2.2.1
carbon version:apache-carbondata-1.4.1-bin-spark2.2.1-hadoop2.7.2


Looking forward your reply.








------------------ Original ------------------
From:  "xuchuanyin"<xu...@hust.edu.cn>;
Date:  Tue, Oct 23, 2018 06:55 PM
To:  "dev"<de...@carbondata.apache.org>;

Subject:  [Discussion] How to configure the unsafe working memory for dataloading



Hi all,
I go through the code and get another formula to estimate the unsafe working
memory. It is inaccurate too but we can open this thread to optimize it.

# Memory Required For Data Loading per Table

## version from Community
(carbon.number.of.cores.while.loading) * (offheap.sort.chunk.size.inmb +
carbon.blockletgroup.size.in.mb + carbon.blockletgroup.size.in.mb/3.5 )

## version from proposal
memory_size_reqiured
 = max(sort_temp_memory_consumption, data_encoding_consumption)
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (TABLE_PAGE_SIZE)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + compress.temp.size)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + per.column.page.size/3.5)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * (32000 * 8 * 1.25) + (32000 * 8 * 1.25)/3.5)}

Note: 
1.  offheap.sort.chunk.size.inmb is the size for UnsafeCarbonRowPage
2.  column.page.size is the size for ColumnPage
3.  compress.temp.size is for temporay size for compressing using snappy (in
UnsafeFixLengthColumnPage.compress)

## problems of each version
1.  both do not consider the local dictionary which is disabled by default;
2.  both do not consider the in-memory intermediate merge which is disabled
by default;

### for Community version
1. For per loading, the sort-temp procedure finished before the
producer-consumer procedure, so we do not need to accumulate them.
2. During loading in the producer-consumer procedure, #numer.of.cores
TablePages will be generated, this may surpass the
#carbon.blockletgroup.size.in.mb, so just use
#carbon.blockletgroup.size.in.mb may also cause memory shortage especially
when #numer.of.cores TablePages is high.

### for proposed version
1. It roughly uses 8 bytes * 1.25 (factor in our code) to represent a value
size, which is inaccurate. Besides, 32000 is the max record number in one
page especially after adptive page size for longstring and complex is
implemented.
2. We can further decomposite the #per.column.page.size by identifying the
datatype and data length for string columns, but this may be too trivial for
user to calculate. We can also run the data loading once and get the
#TABLE_PAGE_SIZE or #per.column.page.size, this should be accurate.

## for example
number.of.cores = 15
offheap.sort.chunk.size.inmb = 64
number.of.fields = 300

### Community version
memory_size_reqiured
 = 15 * (64MB + 64MB + 64MB/3.5)
 = 2194MB

### proposed version
memory_size_reqiured
 = max{(15 + 1) * 64MB, 15 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 /
3.5)}
 = {1073741824, 15 * 108228023}
 = max{1073741824, 1623420343}
 = 1548MB





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/