You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "xuchuanyin (JIRA)" <ji...@apache.org> on 2018/04/13 07:14:00 UTC

[jira] [Resolved] (CARBONDATA-2023) Optimization in data loading for skewed data

     [ https://issues.apache.org/jira/browse/CARBONDATA-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

xuchuanyin resolved CARBONDATA-2023.
------------------------------------
    Resolution: Fixed

> Optimization in data loading for skewed data
> --------------------------------------------
>
>                 Key: CARBONDATA-2023
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2023
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: data-load
>    Affects Versions: 1.3.0
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>            Priority: Major
>          Time Spent: 16h 40m
>  Remaining Estimate: 0h
>
> In one of my cases, carbondata has to load skewed data files. The size of data file ranges from 1KB to about 5GB.
> In current implementation, carbondata will distribute the file blocks(splits) among the nodes to maximum the data locality and data evenly distributed, we call it `block-node-assignment` for short.
> However, the current implementation has some problems.
> The assignment is block number based. The goal is to make sure that all the nodes deal the same amount number of blocks. In the skewed data scenario described above, the block of a small file and the block of a big file are very different from its size (1KB v.s. 64MB). As a result, the difference of total data size assigned for each data node is very large.
> In order to solve this problem, the size of block should be considered during the block-node-assignment. One node can deal more blocks than another as long as the total size of blocks are almost the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)