You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "xuchuanyin (JIRA)" <ji...@apache.org> on 2017/09/20 12:39:02 UTC

[jira] [Comment Edited] (CARBONDATA-1281) Disk hotspot found during data loading

    [ https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104607#comment-16104607 ] 

xuchuanyin edited comment on CARBONDATA-1281 at 9/20/17 12:38 PM:
------------------------------------------------------------------

Here I will provide the configuration used in my test for others to reference.

# ENV

3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 Disks(SAS)

We use JDBCServer to do loading test. 
We have 4 executor in total (3 executor on each node + 1 driver executor).
executor: 20 cores, 128GB  per exector
driver executor: 1 core, 20GB

# USE CASE

88Million Recods with CSV format

340+ columns per record

NO Dictionary column

TABLE_BLOCKSIZE 64

INVERTED_INDEX about 9 columns

# CONF

parameter   value    origin-value
carbon.number.of.cores               	20	   
 carbon.number.of.cores.while.loading 	14	   
sort.inmemory.size.inmb              	2048	   1024
offheap.sort.chunk.size.inmb 	128	64
carbon.sort.intermediate.files.limit 	20	20
carbon.sort.file.buffer.size         	50	20
carbon.use.local.dir	true	false
carbon.use.multiple.dir true false

# RESULT

Using `LOAD  DATA INPATH `, the loading cost about 6min

Observing the NMON, each disk IO usage is quite average.


was (Author: xuchuanyin):
Here I will provide the configuration used in my test for others to reference.

# ENV

3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 Disks(SAS)

We use JDBCServer to do loading test. 
We have 4 executor in total (3 executor on each node + 1 driver executor).
executor: 20 cores, 128GB  per exector
driver executor: 1 core, 20GB

# USE CASE

88Billion Recods with CSV format

340+ columns per record

NO Dictionary column

TABLE_BLOCKSIZE 64

INVERTED_INDEX about 9 columns

# CONF

parameter   value    origin-value
carbon.number.of.cores               	20	   
 carbon.number.of.cores.while.loading 	14	   
sort.inmemory.size.inmb              	2048	   1024
offheap.sort.chunk.size.inmb 	128	64
carbon.sort.intermediate.files.limit 	20	20
carbon.sort.file.buffer.size         	50	20
carbon.use.local.dir	true	false
carbon.use.multiple.dir true false

# RESULT

Using `LOAD  DATA INPATH `, the loading cost about 6min

Observing the NMON, each disk IO usage is quite average.

> Disk hotspot found during data loading
> --------------------------------------
>
>                 Key: CARBONDATA-1281
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core, data-load
>    Affects Versions: 1.1.0
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>             Fix For: 1.2.0
>
>          Time Spent: 17h 40m
>  Remaining Estimate: 0h
>
> # Scenario
> Currently we have done a massive data loading. The input data is about 71GB in CSV format,and have about 88million records. When using carbondata, we do not use any dictionary encoding. Our testing environment has three nodes and each of them have 11 disks as yarn executor directory. We submit the loading command through JDBCServer.The JDBCServer instance have three executors in total, one on each node respectively. The loading takes about 10minutes (+-3min vary from each time).
> We have observed the nmon information during the loading and find:
> 1. lots of CPU waits in the first half of loading;
> 2. only one single disk has many writes and almost reaches its bottleneck (Avg. 80M/s, Max. 150M/s on SAS Disk)
> 3. the other disks are quite idel
> # Analyze
> When do data loading, carbondata read and sort data locally(default scope) and write the temp files to local disk. In my case, there is only one executor in one node, so carbondata write all the temp file to one disk(container directory or yarn local directory), thus resulting into single disk hotspot.
> # Modification
> We should support multiple directory for writing temp files to avoid disk hotspot.
> Ps: I have improved this in my environment and the result is pretty optimistic: the loading takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)