You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by kunal642 <gi...@git.apache.org> on 2019/01/02 10:11:52 UTC

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

GitHub user kunal642 opened a pull request:

    https://github.com/apache/carbondata/pull/3046

    [WIP] Added check to start fallback based on size

    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
    
     - [ ] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kunal642/carbondata oom_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/3046.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3046
    
----
commit cfb870789eaa93ebe03393323456cdf2ed4daf17
Author: kunal642 <ku...@...>
Date:   2019-01-02T10:09:20Z

    fixed

----


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2116/



---

[GitHub] carbondata issue #3046: [WIP] Fix OOM exception when dictionary map size is ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10442/



---

[GitHub] carbondata issue #3046: [WIP] Fix OOM exception when dictionary map size is ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2402/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2404/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2188/



---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245510098
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() {
        */
       public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512";
     
    +  public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD =
    --- End diff --
    
    I think we should optimize this variable name.
    The first time I saw this I thought it was duplicated with another threshold for local dictionary. One is number based, another is storage size based. Please take care of the readability.


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2119/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    @xuchuanyin In near future we are planing to change threshold(currently based on number) to size based local dictionary. Size based threshold will give more control.
    Current changes in the PR is helping in doing that.
    Later Just have to expose the table property in create table command for user to control the size threshold.
    
    Also didn't get the meaning of your comment. these changes are minimal now also.


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2113/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2204/



---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by kunal642 <gi...@git.apache.org>.
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245555497
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() {
        */
       public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512";
     
    +  public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD =
    --- End diff --
    
    changed to CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD


---

[GitHub] carbondata pull request #3046: [CARBONDATA-3231] Fix OOM exception when dict...

Posted by kunal642 <gi...@git.apache.org>.
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245877242
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
    @@ -1491,6 +1491,27 @@ private void validateSortMemorySpillPercentage() {
         }
       }
     
    +  public int getMaxDictionaryThreshold() {
    +    int localDictionaryMaxThreshold = Integer.parseInt(carbonProperties
    +        .getProperty(CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD,
    +            CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD_DEFAULT));
    +    if (localDictionaryMaxThreshold
    --- End diff --
    
    ok


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10374/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10461/



---

[GitHub] carbondata pull request #3046: [CARBONDATA-3231] Fix OOM exception when dict...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r246056127
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() {
        */
       public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512";
     
    +  public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD =
    --- End diff --
    
    It still failed to make it clear that what kind of size it supposed to be since we have a storage size and a counting size.


---

[GitHub] carbondata pull request #3046: [CARBONDATA-3231] Fix OOM exception when dict...

Posted by kunal642 <gi...@git.apache.org>.
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245653091
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/DecoderBasedFallbackEncoder.java ---
    @@ -57,10 +57,7 @@ public DecoderBasedFallbackEncoder(EncodedColumnPage encodedColumnPage, int page
         int pageSize =
             encodedColumnPage.getActualPage().getPageSize();
         int offset = 0;
    -    int[] reverseInvertedIndex = new int[pageSize];
    --- End diff --
    
    In case of No_Sort where inverted index is not there, this variable was being created unnecessarily.


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10367/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10380/



---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r244708172
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() {
        */
       public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512";
     
    +  public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD =
    --- End diff --
    
    add a comment currently it is internal 


---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r244708432
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
    @@ -1491,6 +1491,16 @@ private void validateSortMemorySpillPercentage() {
         }
       }
     
    +  public int getMaxDictionaryThreshold() {
    +    int localDictionaryMaxThreshold =  Integer.parseInt(carbonProperties
    --- End diff --
    
    add min max validation


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2126/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10370/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Hi @kunal642 ,in your PR, the threshold size for storage of the local dictionary is specified by system (maybe later can be specified by user). But it will come up with an obvious problem that how can the use know the exactly value.
    
    I've read about Parquet that it will compare the dictionary encoded size with the original encoded size, only if the dictionary encoded size is smaller, will Parquet use it, otherwise it will fall back.
    
    So can the current implementation suite this scenario well?



---

[GitHub] carbondata issue #3046: [WIP] Fix OOM exception when dictionary map size is ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2186/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2332/



---

[GitHub] carbondata pull request #3046: [CARBONDATA-3231] Fix OOM exception when dict...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245630539
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
    @@ -1491,6 +1491,27 @@ private void validateSortMemorySpillPercentage() {
         }
       }
     
    +  public int getMaxDictionaryThreshold() {
    +    int localDictionaryMaxThreshold = Integer.parseInt(carbonProperties
    +        .getProperty(CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD,
    +            CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD_DEFAULT));
    +    if (localDictionaryMaxThreshold
    --- End diff --
    
    add min check also


---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r244708667
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/steps/CarbonRowDataWriterProcessorStepImpl.java ---
    @@ -128,16 +128,18 @@ public CarbonRowDataWriterProcessorStepImpl(CarbonDataLoadConfiguration configur
           CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
               .recordDictionaryValue2MdkAdd2FileTime(CarbonTablePath.DEPRECATED_PARTITION_ID,
                   System.currentTimeMillis());
    -
    +      ExecutorService fallBackExecutorService =
    +          Executors.newFixedThreadPool(1, new CarbonThreadFactory("FallBackPool:"));
    --- End diff --
    
    pool size cannot be always one ...please refer 
    org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java:203


---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2318/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by kunal642 <gi...@git.apache.org>.
Github user kunal642 commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    @xuchuanyin The problem was that when using varchar column with email data the key for the dictionary map is very huge. When fallback happens the same data is kept in memory twice, which causes the application to throw OOM exception.
    As this is a minor version therefore we do not want to expose this property to user, we can take in the next major version.
    
    Plus parquet also has a size based limitation mechanism which ensures the size does not grow more than what the system can handle.
    
    This PR is just to add the size based limitation so that the map size can be controlled.
    



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2421/



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2326/



---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    > This PR is just to add the size based limitation so that the map size can be controlled.
    @kunal642 Yeah, I noticed that.  So my proposal is that please make a reservation for minimal changes when we want to implement that feature (automatically size detect and fall back) later.


---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r245510146
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/DecoderBasedFallbackEncoder.java ---
    @@ -57,10 +57,7 @@ public DecoderBasedFallbackEncoder(EncodedColumnPage encodedColumnPage, int page
         int pageSize =
             encodedColumnPage.getActualPage().getPageSize();
         int offset = 0;
    -    int[] reverseInvertedIndex = new int[pageSize];
    --- End diff --
    
    What does these changes for?


---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    We do not need to expose this threshold to the user. Instead, we can judge ourselves in carbondata.
    
    Step1. We can get the size of non-dictionary-encoded page (say M) and the size of dictionary-encoded page (say N). 
    Step2: if M/N >=1 (or M/N >= 0.9), we can fallback automatically.
    
    Parquet (maybe ORC) behaves like this.



---

[GitHub] carbondata issue #3046: [WIP] Added check to start fallback based on size

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2321/



---

[GitHub] carbondata pull request #3046: [WIP] Added check to start fallback based on ...

Posted by qiuchenjian <gi...@git.apache.org>.
Github user qiuchenjian commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3046#discussion_r244708339
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java ---
    @@ -205,8 +205,10 @@ public AbstractFactDataWriter(CarbonFactDataHandlerModel model) {
           if (model.getNumberOfCores() > 1) {
             numberOfCores = model.getNumberOfCores() / 2;
           }
    -      fallbackExecutorService = Executors.newFixedThreadPool(numberOfCores, new CarbonThreadFactory(
    -          "FallbackPool:" + model.getTableName() + ", range: " + model.getBucketId()));
    +      fallbackExecutorService = model.getFallBackExecutorService() != null ?
    +          model.getFallBackExecutorService() :
    +          Executors.newFixedThreadPool(numberOfCores, new CarbonThreadFactory(
    --- End diff --
    
    Better to use "public CarbonThreadFactory(String name, boolean withTime)", so that diff threads have diff names


---

[GitHub] carbondata issue #3046: [CARBONDATA-3231] Fix OOM exception when dictionary ...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3046
  
    Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10444/



---