You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by zzcclp <gi...@git.apache.org> on 2017/08/08 09:45:50 UTC

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]Change rdd storage level to ...

GitHub user zzcclp opened a pull request:

    https://github.com/apache/carbondata/pull/1245

    [CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort

    My testing env and configs are as followings:
    
    **Env:**
    6 executors, 9G mem + 6 cores per executor 
    
    **Configs:**
    SINGLE_PASS=true
    SORT_SCOPE=GLOBAL_SORT
    spark.memory.fraction=0.5
    
    if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about **7.2** min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about **9.5** min to load 144136697 lines.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zzcclp/carbondata CARBONDATA-1366

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1245
    
----
commit b50a3bb590d53e7749665e2f20b6a78ca963561f
Author: Zhang Zhichao <44...@qq.com>
Date:   2017-08-08T09:42:40Z

    [CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort
    
    Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/157/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1245#discussion_r132364955
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -1304,6 +1304,18 @@
        */
       public static final String CARBON_USE_MULTI_TEMP_DIR_DEFAULT = "false";
     
    +  /**
    +   * Which storage level to persist rdd when sort_scope=global_sort
    +   */
    +  @CarbonProperty
    +  public static final String CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL =
    +      "carbon.global.sort.rdd.storage.level";
    +
    +  /**
    +   * default value for carbon.global.sort.rdd.storage.level
    --- End diff --
    
    In the comment, mention that the default value is designed for machine with big memory, if user environment has less memory, set the  `CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL` according to `MEMORY_AND_DISK` (see spark recommendation) and give a URL in the comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    @jackylk Agreed with @xuchuanyin, Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. So different environment correspond to different storage level. So here we'd better make a conf named 'storage_level', and the default value of it is MEMORY_ONLY(the same to default value in spark). We can get more info about storage level in spark website.(http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/163/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    @jackylk thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/171/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/170/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    @jackylk please review, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/172/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1245#discussion_r132364666
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java ---
    @@ -1714,6 +1714,45 @@ public static boolean isValidSortOption(String sortScope) {
       }
     
       /**
    +   * validate the storage level
    +   * @param storageLevel
    +   * @return boolean
    +   */
    +  public static boolean isValidStorageLevel(String storageLevel) {
    +    if (null == storageLevel || storageLevel.trim().equals("")) {
    +      return false;
    +    }
    +    switch (storageLevel.toUpperCase()) {
    +      case "DISK_ONLY":
    +        return true;
    --- End diff --
    
    I think for all cases you can keep one `return true` is enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/1245


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    Would it be better to make a configuration (or read configuration from spark) for this?cc @jackylk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/166/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1245#discussion_r132364478
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
    @@ -876,6 +876,24 @@ public boolean isUseMultiTempDir() {
       }
     
       /**
    +   * Return valid storage level
    +   * @return String
    +   */
    +  public String getGlobalSortRddStorageLevel() {
    +    String storageLevel = getProperty(CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL,
    +        CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL_DEFAULT);
    +    boolean validateStorageLevel = CarbonUtil.isValidStorageLevel(storageLevel);
    +    if (!validateStorageLevel) {
    +      LOGGER.info("The " + CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL
    --- End diff --
    
    change to error log


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/174/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    @jackylk @watermen @xuchuanyin thanks for your suggestion, I have added option 'carbon.global.sort.rdd.storage.level', please review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

Posted by zzcclp <gi...@git.apache.org>.
Github user zzcclp commented on the issue:

    https://github.com/apache/carbondata/pull/1245
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---