You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by "Jacky Li (JIRA)" <ji...@apache.org> on 2018/02/25 15:14:00 UTC

[jira] [Resolved] (CARBONDATA-2091) Enhance data loading performance by specifying range bounds for sort columns

     [ https://issues.apache.org/jira/browse/CARBONDATA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacky Li resolved CARBONDATA-2091.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.2

> Enhance data loading performance by specifying range bounds for sort columns
> ----------------------------------------------------------------------------
>
>                 Key: CARBONDATA-2091
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2091
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>            Priority: Major
>             Fix For: 1.3.2
>
>          Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Currently in carbondata, data loading using node_sort (also known as local_sort) has the following procedures:
>  # convert the input data in batch. (*Convert*)
>  # sort the batch and write to the sort temp files. (*TempSort*)
>  # combine the sort temp files and do merge sort to get a bigger ordered sort temp file. (*MergeSort*)
>  # combine all the sort temp files and do a final sort, its results will feed the next procedure. (*FinalSort*)
>  # get rows in order and convert rows to carbondata columnar format pages. (*produce*)
>  # Write bundles of pages to files and write the corresponding index file. (*consume*)
> The Step1~Step3 are done concurrently using multi-thread. The Step4 is done using only one thread. The Step5 is done using multi-thread. So the Step4 is the bottleneck among all the procedures. When observing the data loading performance, we can see that the CPU usage after Step3 is low.
>  
> We can enhance the data loading performance by parallelizing Step4.
>  
> User can specify range bounds for the sort columns and carbondata internally distributes the records to different ranges and process the data concurrently in different ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)