You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by xuchuanyin <gi...@git.apache.org> on 2018/01/16 03:37:42 UTC

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1808

    [CARBONDATA-2023][DataLoad] Add size base block allocation in data loading

    Carbondata assign blocks to nodes at the beginning of data loading.
    Previous block allocation strategy is block number based and it will
    shuffer skewed data problem if the size of input files differs a lot.
    
    We introduced a size based block allocation strategy to optimize data
    loading performance in skewed data scenario.
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [x] Any interfaces changed?
     `Only changed the internal interfaces`
     - [x] Any backward compatibility impacted?
     `No`
     - [x] Document update required?
    `Updated the document`
     - [x] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
    `Added tests to verify the block-allocation correctness`
            - How it is tested? Please attach test report.
    `Tested in local 3-node cluster`
            - Is it a performance related change? Please attach the performance test report.
    ```
    In my scenario, the size of input data file varies from 1KB to about 5GB. 
    Before enabling this feature, each executor processed the same number of blocks
     and the processed data size had a 5X gap.  --(block number based allocation)
    After enabling this feature, each executor processed almost the same size of data
     and the processed data blocks had 6X gap. -- (block size based allocation)
    
    The data loading performance had been promoted from 41MB/s/Node to 61MB/s/Node, 
    about 50% performance enhancement gained.
    ```
    
            - Any additional information to help reviewers in testing this change.
           `I refactored the code to make it more readable. The core code mainly lies in CarbonLoaderUtil`
     - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    `Not related`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata opt_size_base_block_allocation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1808.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1808
    
----
commit f9ee9eaa1d0289c958a0dcbc665a383ea190a812
Author: xuchuanyin <xu...@...>
Date:   2018-01-16T02:59:37Z

    Add size base block allocation in data loading
    
    Carbondata assign blocks to nodes at the beginning of data loading.
    Previous block allocation strategy is block number based and it will
    shuffer skewed data problem if the size of input files differs a lot.
    
    We introduced a size based block allocation strategy to optimize data
    loading performance in skewed data scenario.

----


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2959/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Merged to carbonstore branch


---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin closed the pull request at:

    https://github.com/apache/carbondata/pull/1808


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3598/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2339/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2909/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1660/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3609/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/3456/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1713/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167115897
  
    --- Diff: docs/useful-tips-on-carbondata.md ---
    @@ -169,5 +169,6 @@
       | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. |
       | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether to use multiple YARN local directories during table data loading for disk load balance | After enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use all YARN local directories during data load for disk load balance, that will improve the data load performance. Please enable this property when you encounter disk hotspot problem during data loading. |
       | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data loading | Specify the name of compressor to compress the intermediate sort temporary files during sort procedure in data loading. | The optional values are 'SNAPPY','GZIP','BZIP2','LZ4' and empty. By default, empty means that Carbondata will not compress the sort temp files. This parameter will be useful if you encounter disk bottleneck. |
    +  | carbon.load.skewed.data.optimization | spark/carbonlib/carbon.properties | Data loading | Whether to enable size based block allocation strategy for data loading. | Carbondata will use number based block allocation strategy by default and it will make sure that all the executors process the same number of blocks. If this value is set to true, Carbondata will make sure that all the executors process the same size of data -- It's useful if the size of your input data files varies widely, say 1MB~1GB. |
    --- End diff --
    
    `Carbondata will use number based block allocation strategy by default` change to 
    `When loading, carbondata will use file size based block allocation strategy for task distribution`


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2970/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2984/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1629/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3608/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3577/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3336/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2481/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167120778
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -606,122 +609,296 @@ private static void createTaskListForNode(Map<String, List<List<Distributable>>>
       }
     
       /**
    -   * If any left over data blocks are present then assign those to nodes in round robin way.
    +   * If any left over data blocks are present then assign those to nodes in round robin way. This
    +   * will not obey the data locality.
        *
        * @param outputMap
    -   * @param uniqueBlocks
    +   * @param leftOverBlocks
        */
    -  private static void assignLeftOverBlocks(Map<String, List<Distributable>> outputMap,
    -      Set<Distributable> uniqueBlocks, int noOfBlocksPerNode, List<String> activeNodes) {
    +  private static void assignLeftOverBlocks(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> leftOverBlocks, long expectedSizePerNode, List<String> activeNodes,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    Map<String, Integer> node2Idx = new HashMap<>(outputMap.size());
    +    for (int idx = 0; idx < outputMap.size(); idx++) {
    +      node2Idx.put(outputMap.get(idx).getNode(), idx);
    +    }
     
    +    // iterate all the nodes and try to allocate blocks to the nodes
         if (activeNodes != null) {
           for (String activeNode : activeNodes) {
    -        List<Distributable> blockLst = outputMap.get(activeNode);
    -        if (null == blockLst) {
    +        if (LOGGER.isDebugEnabled()) {
    +          LOGGER.debug("Second assignment iteration: assign for executor: " + activeNode);
    +        }
    +
    +        Integer idx;
    +        List<Distributable> blockLst;
    +        if (node2Idx.containsKey(activeNode)) {
    +          idx = node2Idx.get(activeNode);
    +          blockLst = outputMap.get(idx).getBlocks();
    +        } else {
    +          idx = node2Idx.size();
               blockLst = new ArrayList<Distributable>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
             }
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    -        if (blockLst.size() > 0) {
    -          outputMap.put(activeNode, blockLst);
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
    +
    +        if (!node2Idx.containsKey(activeNode) && blockLst.size() > 0) {
    +          outputMap.add(idx, new NodeMultiBlockRelation(activeNode, blockLst));
    +          node2Idx.put(activeNode, idx);
             }
           }
         } else {
    -      for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -        List<Distributable> blockLst = entry.getValue();
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    +      for (NodeMultiBlockRelation entry : outputMap) {
    +        List<Distributable> blockLst = entry.getBlocks();
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
           }
    -
         }
     
    -    for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -      Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -      if (blocks.hasNext()) {
    -        Distributable block = blocks.next();
    -        List<Distributable> blockLst = entry.getValue();
    -        blockLst.add(block);
    -        blocks.remove();
    -      }
    -    }
    +    // if there is still blocks left, allocate them in round robin manner to each nodes
    +    assignBlocksUseRoundRobin(outputMap, leftOverBlocks, blockAssignmentStrategy);
       }
     
       /**
    -   * The method populate the blockLst to be allocate to a specific node.
    -   * @param uniqueBlocks
    -   * @param noOfBlocksPerNode
    +   * assign blocks to nodes
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
    +   * @param blockLst
    +   * @param blockAssignmentStrategy
    +   */
    +  private static void populateBlocks(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    switch (blockAssignmentStrategy) {
    +      case BLOCK_NUM_FIRST:
    +        populateBlocksByNum(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      case BLOCK_SIZE_FIRST:
    +        populateBlocksBySize(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      default:
    +        throw new IllegalArgumentException(
    +            "Unsupported block assignment strategy: " + blockAssignmentStrategy);
    +    }
    +  }
    +  /**
    +   * allocate blocks by block num
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
        * @param blockLst
        */
    -  private static void populateBlocks(Set<Distributable> uniqueBlocks, int noOfBlocksPerNode,
    -      List<Distributable> blockLst) {
    -    Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -    //if the node is already having the per block nodes then avoid assign the extra blocks
    -    if (blockLst.size() == noOfBlocksPerNode) {
    +  private static void populateBlocksByNum(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst) {
    +    Iterator<Distributable> blocks = remainingBlocks.iterator();
    +    // if the node is already having the per block nodes then avoid assign the extra blocks
    +    if (blockLst.size() == expectedSizePerNode) {
           return;
         }
         while (blocks.hasNext()) {
           Distributable block = blocks.next();
           blockLst.add(block);
           blocks.remove();
    -      if (blockLst.size() >= noOfBlocksPerNode) {
    +      if (blockLst.size() >= expectedSizePerNode) {
    +        break;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * allocate blocks by block size
    --- End diff --
    
    Change to:
    Take N number of distributable blocks from remainingBlocks and add them to output parameter blockList. The number of blocks added is determined by accumulated block size is less than expectedSizePerNode



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1588/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3656/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167121062
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -606,122 +609,296 @@ private static void createTaskListForNode(Map<String, List<List<Distributable>>>
       }
     
       /**
    -   * If any left over data blocks are present then assign those to nodes in round robin way.
    +   * If any left over data blocks are present then assign those to nodes in round robin way. This
    +   * will not obey the data locality.
        *
        * @param outputMap
    -   * @param uniqueBlocks
    +   * @param leftOverBlocks
        */
    -  private static void assignLeftOverBlocks(Map<String, List<Distributable>> outputMap,
    -      Set<Distributable> uniqueBlocks, int noOfBlocksPerNode, List<String> activeNodes) {
    +  private static void assignLeftOverBlocks(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> leftOverBlocks, long expectedSizePerNode, List<String> activeNodes,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    Map<String, Integer> node2Idx = new HashMap<>(outputMap.size());
    +    for (int idx = 0; idx < outputMap.size(); idx++) {
    +      node2Idx.put(outputMap.get(idx).getNode(), idx);
    +    }
     
    +    // iterate all the nodes and try to allocate blocks to the nodes
         if (activeNodes != null) {
           for (String activeNode : activeNodes) {
    -        List<Distributable> blockLst = outputMap.get(activeNode);
    -        if (null == blockLst) {
    +        if (LOGGER.isDebugEnabled()) {
    +          LOGGER.debug("Second assignment iteration: assign for executor: " + activeNode);
    +        }
    +
    +        Integer idx;
    +        List<Distributable> blockLst;
    +        if (node2Idx.containsKey(activeNode)) {
    +          idx = node2Idx.get(activeNode);
    +          blockLst = outputMap.get(idx).getBlocks();
    +        } else {
    +          idx = node2Idx.size();
               blockLst = new ArrayList<Distributable>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
             }
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    -        if (blockLst.size() > 0) {
    -          outputMap.put(activeNode, blockLst);
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
    +
    +        if (!node2Idx.containsKey(activeNode) && blockLst.size() > 0) {
    +          outputMap.add(idx, new NodeMultiBlockRelation(activeNode, blockLst));
    +          node2Idx.put(activeNode, idx);
             }
           }
         } else {
    -      for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -        List<Distributable> blockLst = entry.getValue();
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    +      for (NodeMultiBlockRelation entry : outputMap) {
    +        List<Distributable> blockLst = entry.getBlocks();
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
           }
    -
         }
     
    -    for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -      Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -      if (blocks.hasNext()) {
    -        Distributable block = blocks.next();
    -        List<Distributable> blockLst = entry.getValue();
    -        blockLst.add(block);
    -        blocks.remove();
    -      }
    -    }
    +    // if there is still blocks left, allocate them in round robin manner to each nodes
    +    assignBlocksUseRoundRobin(outputMap, leftOverBlocks, blockAssignmentStrategy);
       }
     
       /**
    -   * The method populate the blockLst to be allocate to a specific node.
    -   * @param uniqueBlocks
    -   * @param noOfBlocksPerNode
    +   * assign blocks to nodes
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
    +   * @param blockLst
    +   * @param blockAssignmentStrategy
    +   */
    +  private static void populateBlocks(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    switch (blockAssignmentStrategy) {
    +      case BLOCK_NUM_FIRST:
    +        populateBlocksByNum(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      case BLOCK_SIZE_FIRST:
    +        populateBlocksBySize(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      default:
    +        throw new IllegalArgumentException(
    +            "Unsupported block assignment strategy: " + blockAssignmentStrategy);
    +    }
    +  }
    +  /**
    +   * allocate blocks by block num
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
        * @param blockLst
        */
    -  private static void populateBlocks(Set<Distributable> uniqueBlocks, int noOfBlocksPerNode,
    -      List<Distributable> blockLst) {
    -    Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -    //if the node is already having the per block nodes then avoid assign the extra blocks
    -    if (blockLst.size() == noOfBlocksPerNode) {
    +  private static void populateBlocksByNum(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst) {
    +    Iterator<Distributable> blocks = remainingBlocks.iterator();
    +    // if the node is already having the per block nodes then avoid assign the extra blocks
    +    if (blockLst.size() == expectedSizePerNode) {
           return;
         }
         while (blocks.hasNext()) {
           Distributable block = blocks.next();
           blockLst.add(block);
           blocks.remove();
    -      if (blockLst.size() >= noOfBlocksPerNode) {
    +      if (blockLst.size() >= expectedSizePerNode) {
    +        break;
    +      }
    +    }
    +  }
    +
    +  /**
    +   * allocate blocks by block size
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
    +   * @param blockLst
    +   */
    +  private static void populateBlocksBySize(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst) {
    +    Iterator<Distributable> blocks = remainingBlocks.iterator();
    +    //if the node is already having the avg node size then avoid assign the extra blocks
    +    long fileSize = 0;
    +    for (Distributable block : blockLst) {
    +      fileSize += ((TableBlockInfo) block).getBlockLength();
    +    }
    +    if (fileSize >= expectedSizePerNode) {
    +      LOGGER.debug("Capacity is full, skip allocate blocks on this node");
    +      return;
    +    }
    +
    +    while (blocks.hasNext()) {
    +      Distributable block = blocks.next();
    +      long thisBlockSize = ((TableBlockInfo) block).getBlockLength();
    +      if (fileSize < expectedSizePerNode) {
    +        // `fileSize==0` means there are no blocks assigned to this node before
    +        if (fileSize == 0 || fileSize + thisBlockSize <= expectedSizePerNode * 1.1D) {
    +          blockLst.add(block);
    +          if (LOGGER.isDebugEnabled()) {
    +            LOGGER.debug("Second Assignment iteration: "
    +                + ((TableBlockInfo) block).getFilePath() + "-"
    +                + ((TableBlockInfo) block).getBlockLength() + "-->currentNode");
    +          }
    +          fileSize += thisBlockSize;
    +          blocks.remove();
    +        }
    +      } else {
             break;
           }
         }
       }
     
    +  /**
    +   * allocate the blocks in round robin manner
    +   * @param node2Blocks
    +   * @param remainingBlocks
    +   * @param blockAssignmentStrategy
    +   */
    +  private static void assignBlocksUseRoundRobin(ArrayList<NodeMultiBlockRelation> node2Blocks,
    +      Set<Distributable> remainingBlocks, BlockAssignmentStrategy blockAssignmentStrategy) {
    +    switch (blockAssignmentStrategy) {
    +      case BLOCK_NUM_FIRST:
    +        roundRobinAssignBlocksByNum(node2Blocks, remainingBlocks);
    +        break;
    +      case BLOCK_SIZE_FIRST:
    +        roundRobinAssignBlocksBySize(node2Blocks, remainingBlocks);
    +        break;
    +      default:
    +        throw new IllegalArgumentException("Unsupported block assignment strategy: "
    +            + blockAssignmentStrategy);
    +    }
    +  }
    +
    +  private static void roundRobinAssignBlocksByNum(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> remainingBlocks) {
    +    for (NodeMultiBlockRelation relation: outputMap) {
    +      Iterator<Distributable> blocks = remainingBlocks.iterator();
    +      if (blocks.hasNext()) {
    +        Distributable block = blocks.next();
    +        List<Distributable> blockLst = relation.getBlocks();
    +        blockLst.add(block);
    +        blocks.remove();
    +      }
    +    }
    +  }
    +
    +  private static void roundRobinAssignBlocksBySize(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> remainingBlocks) {
    +    Iterator<Distributable> blocks = remainingBlocks.iterator();
    +    while (blocks.hasNext()) {
    +      // sort the allocated node-2-blocks in ascending order, the total data size of first one is
    +      // the smallest, so we assign this block to it.
    +      Collections.sort(outputMap, NodeMultiBlockRelation.DATA_SIZE_ASC_COMPARATOR);
    +      Distributable block = blocks.next();
    +      List<Distributable> blockLst = outputMap.get(0).getBlocks();
    +      blockLst.add(block);
    +      if (LOGGER.isDebugEnabled()) {
    +        LOGGER.debug("RoundRobin assignment iteration: "
    +            + ((TableBlockInfo) block).getFilePath() + "-"
    +            + ((TableBlockInfo) block).getBlockLength() + "-->" + outputMap.get(0).getNode());
    +      }
    +      blocks.remove();
    +    }
    +  }
       /**
        * To create the final output of the Node and Data blocks
        *
    -   * @param outputMap
    -   * @param blocksPerNode
    -   * @param uniqueBlocks
    -   * @param nodeAndBlockMapping
    +   * @param outputNode2Blocks
    +   * @param expectedSizePerNode
    +   * @param remainingBlocks
    +   * @param inputNode2Blocks
    --- End diff --
    
    move @param if no description


---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167119550
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -606,122 +609,296 @@ private static void createTaskListForNode(Map<String, List<List<Distributable>>>
       }
     
       /**
    -   * If any left over data blocks are present then assign those to nodes in round robin way.
    +   * If any left over data blocks are present then assign those to nodes in round robin way. This
    +   * will not obey the data locality.
        *
        * @param outputMap
    -   * @param uniqueBlocks
    +   * @param leftOverBlocks
    --- End diff --
    
    Please remove @param if no description


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2977/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    merged into carbonstore branch


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1677/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2891/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3721/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2360/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2952/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3686/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest sdv please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2862/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2947/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2911/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1696/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1575/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1581/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167120719
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -606,122 +609,296 @@ private static void createTaskListForNode(Map<String, List<List<Distributable>>>
       }
     
       /**
    -   * If any left over data blocks are present then assign those to nodes in round robin way.
    +   * If any left over data blocks are present then assign those to nodes in round robin way. This
    +   * will not obey the data locality.
        *
        * @param outputMap
    -   * @param uniqueBlocks
    +   * @param leftOverBlocks
        */
    -  private static void assignLeftOverBlocks(Map<String, List<Distributable>> outputMap,
    -      Set<Distributable> uniqueBlocks, int noOfBlocksPerNode, List<String> activeNodes) {
    +  private static void assignLeftOverBlocks(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> leftOverBlocks, long expectedSizePerNode, List<String> activeNodes,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    Map<String, Integer> node2Idx = new HashMap<>(outputMap.size());
    +    for (int idx = 0; idx < outputMap.size(); idx++) {
    +      node2Idx.put(outputMap.get(idx).getNode(), idx);
    +    }
     
    +    // iterate all the nodes and try to allocate blocks to the nodes
         if (activeNodes != null) {
           for (String activeNode : activeNodes) {
    -        List<Distributable> blockLst = outputMap.get(activeNode);
    -        if (null == blockLst) {
    +        if (LOGGER.isDebugEnabled()) {
    +          LOGGER.debug("Second assignment iteration: assign for executor: " + activeNode);
    +        }
    +
    +        Integer idx;
    +        List<Distributable> blockLst;
    +        if (node2Idx.containsKey(activeNode)) {
    +          idx = node2Idx.get(activeNode);
    +          blockLst = outputMap.get(idx).getBlocks();
    +        } else {
    +          idx = node2Idx.size();
               blockLst = new ArrayList<Distributable>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
             }
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    -        if (blockLst.size() > 0) {
    -          outputMap.put(activeNode, blockLst);
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
    +
    +        if (!node2Idx.containsKey(activeNode) && blockLst.size() > 0) {
    +          outputMap.add(idx, new NodeMultiBlockRelation(activeNode, blockLst));
    +          node2Idx.put(activeNode, idx);
             }
           }
         } else {
    -      for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -        List<Distributable> blockLst = entry.getValue();
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    +      for (NodeMultiBlockRelation entry : outputMap) {
    +        List<Distributable> blockLst = entry.getBlocks();
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
           }
    -
         }
     
    -    for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -      Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -      if (blocks.hasNext()) {
    -        Distributable block = blocks.next();
    -        List<Distributable> blockLst = entry.getValue();
    -        blockLst.add(block);
    -        blocks.remove();
    -      }
    -    }
    +    // if there is still blocks left, allocate them in round robin manner to each nodes
    +    assignBlocksUseRoundRobin(outputMap, leftOverBlocks, blockAssignmentStrategy);
       }
     
       /**
    -   * The method populate the blockLst to be allocate to a specific node.
    -   * @param uniqueBlocks
    -   * @param noOfBlocksPerNode
    +   * assign blocks to nodes
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
    +   * @param blockLst
    +   * @param blockAssignmentStrategy
    +   */
    +  private static void populateBlocks(Set<Distributable> remainingBlocks,
    +      long expectedSizePerNode, List<Distributable> blockLst,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    switch (blockAssignmentStrategy) {
    +      case BLOCK_NUM_FIRST:
    +        populateBlocksByNum(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      case BLOCK_SIZE_FIRST:
    +        populateBlocksBySize(remainingBlocks, expectedSizePerNode, blockLst);
    +        break;
    +      default:
    +        throw new IllegalArgumentException(
    +            "Unsupported block assignment strategy: " + blockAssignmentStrategy);
    +    }
    +  }
    +  /**
    +   * allocate blocks by block num
    --- End diff --
    
    Take N number of distributable blocks from `remainingBlocks` and add them to output parameter `blockList`. The number of blocks add accumulated size is less than `expectedSizePerNode` 


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/3427/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2811/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2894/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1702/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2929/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167119617
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -606,122 +609,296 @@ private static void createTaskListForNode(Map<String, List<List<Distributable>>>
       }
     
       /**
    -   * If any left over data blocks are present then assign those to nodes in round robin way.
    +   * If any left over data blocks are present then assign those to nodes in round robin way. This
    +   * will not obey the data locality.
        *
        * @param outputMap
    -   * @param uniqueBlocks
    +   * @param leftOverBlocks
        */
    -  private static void assignLeftOverBlocks(Map<String, List<Distributable>> outputMap,
    -      Set<Distributable> uniqueBlocks, int noOfBlocksPerNode, List<String> activeNodes) {
    +  private static void assignLeftOverBlocks(ArrayList<NodeMultiBlockRelation> outputMap,
    +      Set<Distributable> leftOverBlocks, long expectedSizePerNode, List<String> activeNodes,
    +      BlockAssignmentStrategy blockAssignmentStrategy) {
    +    Map<String, Integer> node2Idx = new HashMap<>(outputMap.size());
    +    for (int idx = 0; idx < outputMap.size(); idx++) {
    +      node2Idx.put(outputMap.get(idx).getNode(), idx);
    +    }
     
    +    // iterate all the nodes and try to allocate blocks to the nodes
         if (activeNodes != null) {
           for (String activeNode : activeNodes) {
    -        List<Distributable> blockLst = outputMap.get(activeNode);
    -        if (null == blockLst) {
    +        if (LOGGER.isDebugEnabled()) {
    +          LOGGER.debug("Second assignment iteration: assign for executor: " + activeNode);
    +        }
    +
    +        Integer idx;
    +        List<Distributable> blockLst;
    +        if (node2Idx.containsKey(activeNode)) {
    +          idx = node2Idx.get(activeNode);
    +          blockLst = outputMap.get(idx).getBlocks();
    +        } else {
    +          idx = node2Idx.size();
               blockLst = new ArrayList<Distributable>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
             }
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    -        if (blockLst.size() > 0) {
    -          outputMap.put(activeNode, blockLst);
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
    +
    +        if (!node2Idx.containsKey(activeNode) && blockLst.size() > 0) {
    +          outputMap.add(idx, new NodeMultiBlockRelation(activeNode, blockLst));
    +          node2Idx.put(activeNode, idx);
             }
           }
         } else {
    -      for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -        List<Distributable> blockLst = entry.getValue();
    -        populateBlocks(uniqueBlocks, noOfBlocksPerNode, blockLst);
    +      for (NodeMultiBlockRelation entry : outputMap) {
    +        List<Distributable> blockLst = entry.getBlocks();
    +        populateBlocks(leftOverBlocks, expectedSizePerNode, blockLst, blockAssignmentStrategy);
           }
    -
         }
     
    -    for (Map.Entry<String, List<Distributable>> entry : outputMap.entrySet()) {
    -      Iterator<Distributable> blocks = uniqueBlocks.iterator();
    -      if (blocks.hasNext()) {
    -        Distributable block = blocks.next();
    -        List<Distributable> blockLst = entry.getValue();
    -        blockLst.add(block);
    -        blocks.remove();
    -      }
    -    }
    +    // if there is still blocks left, allocate them in round robin manner to each nodes
    +    assignBlocksUseRoundRobin(outputMap, leftOverBlocks, blockAssignmentStrategy);
       }
     
       /**
    -   * The method populate the blockLst to be allocate to a specific node.
    -   * @param uniqueBlocks
    -   * @param noOfBlocksPerNode
    +   * assign blocks to nodes
    +   * @param remainingBlocks
    +   * @param expectedSizePerNode
    +   * @param blockLst
    +   * @param blockAssignmentStrategy
    --- End diff --
    
    Please remove @param if no description


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2370/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2945/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1620/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by xuchuanyin <gi...@git.apache.org>.
GitHub user xuchuanyin reopened a pull request:

    https://github.com/apache/carbondata/pull/1808

    [CARBONDATA-2023][DataLoad] Add size base block allocation in data loading

    Carbondata assign blocks to nodes at the beginning of data loading.
    Previous block allocation strategy is block number based and it will
    suffer skewed data problem if the size of input files differs a lot.
    
    We introduced a size based block allocation strategy to optimize data
    loading performance in skewed data scenario.
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [x] Any interfaces changed?
     `Only changed the internal interfaces`
     - [x] Any backward compatibility impacted?
     `No`
     - [x] Document update required?
    `Updated the document`
     - [x] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
    `Added tests to verify the block-allocation correctness`
            - How it is tested? Please attach test report.
    `Tested in local 3-node cluster`
            - Is it a performance related change? Please attach the performance test report.
    ```
    In my scenario, the size of input data file varies from 1KB to about 5GB. 
    Before enabling this feature, each executor processed the same number of blocks
     and the processed data size had a 5X gap.  --(block number based allocation)
    After enabling this feature, each executor processed almost the same size of data
     and the processed number of data blocks had 6X gap. -- (block size based allocation)
    
    The data loading performance had been promoted from 41MB/s/Node to 61MB/s/Node, 
    about 50% performance enhancement gained.
    ```
    
            - Any additional information to help reviewers in testing this change.
           `I refactored the code to make it more readable. The core code mainly lies in CarbonLoaderUtil`
     - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    `Not related`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata opt_size_base_block_allocation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1808.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1808
    
----
commit eec35eae52affdfdbd05915ece2f19bedec5e310
Author: xuchuanyin <xu...@...>
Date:   2018-02-08T06:42:39Z

    Add size based block allocation strategy in data loading
    
    Carbondata assign blocks to nodes at the beginning of data loading.
    Previous block allocation strategy is block number based and it will
    shuffer skewed data problem if the size of input files differs a lot.
    
    We introduced a size based block allocation strategy to optimize data
    loading performance in skewed data scenario.

commit da4f93dd7bbe9faa045c751cba7ae1dd22ce12e4
Author: xuchuanyin <xu...@...>
Date:   2018-02-09T02:25:42Z

    Fix review comments

----


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2133/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2944/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1723/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167119304
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -70,6 +79,22 @@
       private CarbonLoaderUtil() {
       }
     
    +  /**
    +   * strategy for assign blocks to nodes/executors
    +   */
    +  public enum BlockAssignmentStrategy {
    +    BLOCK_NUM_FIRST("Assign blocks to node base on number of blocks"),
    +    BLOCK_SIZE_FIRST("Assign blocks to node base on data size of blocks");
    +    private String name;
    +    BlockAssignmentStrategy(String name) {
    +      this.name = name;
    +    }
    +
    +    @Override public String toString() {
    --- End diff --
    
    move @Override to previous line, please follow this in future


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3619/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167115544
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/TableBlockInfo.java ---
    @@ -420,4 +436,16 @@ public String getDataMapWriterPath() {
       public void setDataMapWriterPath(String dataMapWriterPath) {
         this.dataMapWriterPath = dataMapWriterPath;
       }
    +
    +  @Override public String toString() {
    --- End diff --
    
    move @Override to previous line


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2901/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3610/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2478/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2446/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    this PR depends on #1952 


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/3246/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2373/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    LGTM


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3718/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167119457
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -455,83 +481,60 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
         return nodeBlockMapping(blockInfos, -1);
       }
     
    -  /**
    -   * the method returns the number of required executors
    -   *
    -   * @param blockInfos
    -   * @return
    -   */
    -  public static Map<String, List<Distributable>> getRequiredExecutors(
    -      List<Distributable> blockInfos) {
    -    List<NodeBlockRelation> flattenedList =
    -        new ArrayList<NodeBlockRelation>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
    -    for (Distributable blockInfo : blockInfos) {
    -      try {
    -        for (String eachNode : blockInfo.getLocations()) {
    -          NodeBlockRelation nbr = new NodeBlockRelation(blockInfo, eachNode);
    -          flattenedList.add(nbr);
    -        }
    -      } catch (IOException e) {
    -        throw new RuntimeException("error getting location of block: " + blockInfo.toString(), e);
    -      }
    -    }
    -    // sort the flattened data.
    -    Collections.sort(flattenedList);
    -    Map<String, List<Distributable>> nodeAndBlockMapping =
    -        new LinkedHashMap<String, List<Distributable>>(
    -            CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);
    -    // from the flattened list create a mapping of node vs Data blocks.
    -    createNodeVsBlockMapping(flattenedList, nodeAndBlockMapping);
    -    return nodeAndBlockMapping;
    -  }
    -
       /**
        * This method will divide the blocks among the nodes as per the data locality
        *
        * @param blockInfos
        * @param noOfNodesInput -1 if number of nodes has to be decided
        *                       based on block location information
    +   * @param blockAssignmentStrategy strategy used to assign blocks
        * @return
    --- End diff --
    
    please complete the description for @return


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2849/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2935/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2910/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3369/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2936/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    LGTM


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2816/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2824/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/3520/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2417/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2100/



---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin closed the pull request at:

    https://github.com/apache/carbondata/pull/1808


---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r166936712
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonLoadOptionConstants.java ---
    @@ -114,4 +114,14 @@
        */
       public static final int MAX_EXTERNAL_DICTIONARY_SIZE = 10000000;
     
    +  /**
    +   * enable block size based block allocation while loading data. By default, carbondata assigns
    +   * blocks to node based on block number. If this option is set to `true`, carbondata will
    +   * consider block size first and make sure that all the nodes will process almost equal size of
    +   * data. This option is especially useful when you encounter skewed data.
    +   */
    +  @CarbonProperty
    +  public static final String ENABLE_CARBON_LOAD_SKEWED_DATA_OPTIMIZATION
    +      = "carbon.load.skewed.data.optimization";
    --- End diff --
    
    change to `carbon.load.skewedDataOptimization.enabled`


---

[GitHub] carbondata pull request #1808: [CARBONDATA-2023][DataLoad] Add size base blo...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1808#discussion_r167115611
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
    @@ -1184,6 +1184,17 @@ public String getSortTempCompressor() {
           return CarbonCommonConstants.CARBON_SORT_TEMP_COMPRESSOR_DEFAULT;
         }
       }
    +
    +  /**
    +   * whether optimization for skewed data is enabled
    +   * @return true, if enabled; false for not enabled.
    +   */
    +  public boolean isCarbonLoadSkewedDataOptimizationEnabled() {
    --- End diff --
    
    change to `isLoadSkewedDataOptimizationEnabled`


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/3613/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    retest this please


---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/1740/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/3494/



---

[GitHub] carbondata issue #1808: [CARBONDATA-2023][DataLoad] Add size base block allo...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1808
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2379/



---