You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by manishgupta88 <gi...@git.apache.org> on 2018/04/23 07:06:22 UTC

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

GitHub user manishgupta88 opened a pull request:

    https://github.com/apache/carbondata/pull/2210

    [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level

    Problem: Compaction performance is slow as compared to data load
    Analysis:
    1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
    2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model.
    
    Solution:
    1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
    2. Perform IO at blocklet level and uncompression at page level.
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
    
     - [ ] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishgupta88/carbondata compaction_slow_fix_1.4

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2210
    
----
commit 857e5d2a75eaf16457023f5a04a2e420bb8cbff2
Author: manishgupta88 <to...@...>
Date:   2018-04-23T07:06:39Z

    Problem: Compaction performance is slow as compared to data load
    
    Analysis:
    1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time
    increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
    2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time
    increases in this model.
    
    Solution:
    1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
    2. Perform IO at blocklet level and uncompression at page level.

----


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4289/



---

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

Posted by manishgupta88 <gi...@git.apache.org>.

Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184326083
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/stats/QueryStatisticsConstants.java ---
    @@ -58,6 +58,28 @@
     
       String PAGE_SCANNED = "The number of page scanned";
     
    +  /**
    +   * measure filling time includes time taken for reading all measures data from a given offset
    +   * and adding each column data to an array. Includes total time for 1 query result iterator.
    +   */
    +  String MEASURE_FILLING_TIME = "measure filling time";
    +
    +  /**
    +   * dimension filling time includes time taken for reading all dimensions data from a given offset
    +   * and filling each column data to byte array. Includes total time for 1 query result iterator.
    +   */
    +  String DIMENSION_FILLING_TIME = "dimension filling time";
    --- End diff --
    
    ok


---

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

Posted by manishgupta88 <gi...@git.apache.org>.

Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184610515
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
    
    ok


---

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2210


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4575/



---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4150/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by manishgupta88 <gi...@git.apache.org>.

Github user manishgupta88 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    retest sdv please


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4558/



---

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

Posted by kumarvishal09 <gi...@git.apache.org>.

Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r183822226
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
    
    Instead of changing constructor, can we add statistics model to block execution info??


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5477/



---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5348/



---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by kumarvishal09 <gi...@git.apache.org>.

Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    I think we need these changes in case of DictionaryBasedResultCollector also, as when number of column is more than 100 it goes to dictionarybasedresult collector, it will improve the query performance when number of projection columns is more than 100 


---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4491/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5454/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4310/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4301/



---

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

Posted by kumarvishal09 <gi...@git.apache.org>.

Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r183822969
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/stats/QueryStatisticsConstants.java ---
    @@ -58,6 +58,28 @@
     
       String PAGE_SCANNED = "The number of page scanned";
     
    +  /**
    +   * measure filling time includes time taken for reading all measures data from a given offset
    +   * and adding each column data to an array. Includes total time for 1 query result iterator.
    +   */
    +  String MEASURE_FILLING_TIME = "measure filling time";
    +
    +  /**
    +   * dimension filling time includes time taken for reading all dimensions data from a given offset
    +   * and filling each column data to byte array. Includes total time for 1 query result iterator.
    +   */
    +  String DIMENSION_FILLING_TIME = "dimension filling time";
    --- End diff --
    
    Change this to key column filling time


---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by manishgupta88 <gi...@git.apache.org>.

Github user manishgupta88 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    @kumarvishal09 ...I agree with you that we need these changes in DictionaryBasedResultCollector also to improve the query performance when number of columns are greater than 100.
    As this PR is specifically for compaction performance fix, I will raise a separate PR for this. I have raised a sub jira task under same jira to track the issue.
    https://issues.apache.org/jira/browse/CARBONDATA-2405



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    retest this please


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5420/



---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4176/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4625/



---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4145/



---

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

Posted by kumarvishal09 <gi...@git.apache.org>.

Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184449793
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
    
    For one task we are creating only one QueryStatisticsModel even one task is handling multiple block (in case of merge small files) . Passing in constructor or keeping in blockexecutioninfo is same cost(same reference). Just to hold the reference we are changing multiple classes constructor. BlockExecutionInfo is also just a holder. If we add in abstract class constructor In future every concrete class needs to maintain querystatisticsmodel even if it doesn't use


---

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5329/



---

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

Posted by manishgupta88 <gi...@git.apache.org>.

Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184327009
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
    
    I think it is better to pass in constructor instead of making the blockExecutionInfo object more heavier. Also because the QueryStatisticsModel is at the reader level and one reader can have multiple blocks so in my opinion keeping in blockExecutionInfo will not be a good idea.


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4265/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by kumarvishal09 <gi...@git.apache.org>.

Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    LGTM


---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5432/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4252/



---

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5464/



---