You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by sraghunandan <gi...@git.apache.org> on 2018/08/01 07:39:12 UTC

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

GitHub user sraghunandan opened a pull request:

    https://github.com/apache/carbondata/pull/2592

    [WIP]Updated & enhanced Documentation of CarbonData

    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [X] Any interfaces changed?
    NO 
     - [X] Any backward compatibility impacted?
    NO 
     - [X] Document update required?
    YES
     - [X] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
         NA 
     - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    NA


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sraghunandan/carbondata-1 update_documentation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2592.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2592
    
----
commit bc18916c91949707ec45a62dc46abcb8bdbf5253
Author: Raghunandan S <ca...@...>
Date:   2018-07-02T14:23:16Z

    Updated & enhanced Documentation of CarbonData

----


---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7699/



---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7703/



---

[GitHub] carbondata issue #2592: [CARBONDATA-2915] Updated & enhanced Documentation o...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/36/



---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6429/



---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7702/



---

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r207084014
  
    --- Diff: docs/useful-tips-on-carbondata.md ---
    @@ -30,16 +30,16 @@
     
       - **Table Column Description**
     
    -  | Column Name | Data Type     | Cardinality | Attribution |
    -  |-------------|---------------|-------------|-------------|
    -  | msisdn      | String        | 30 million  | Dimension   |
    -  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    -  | HOST        | String        | 1 million   | Dimension   |
    -  | Dime_1      | String        | 1 Thousand  | Dimension   |
    -  | counter_1   | Decimal       | NA          | Measure     |
    -  | counter_2   | Numeric(20,0) | NA          | Measure     |
    -  | ...         | ...           | NA          | Measure     |
    -  | counter_100 | Decimal       | NA          | Measure     |
    +| Column Name | Data Type     | Cardinality | Attribution |
    +|-------------|---------------|-------------|-------------|
    +| msisdn      | String        | 30 million  | Dimension   |
    +| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    +| HOST        | String        | 1 million   | Dimension   |
    +| Dime_1      | String        | 1 Thousand  | Dimension   |
    +| counter_1   | Decimal       | NA          | Measure     |
    +| counter_2   | Numeric(20,0) | NA          | Measure     |
    +| ...         | ...           | NA          | Measure     |
    +| counter_100 | Decimal       | NA          | Measure     |
     
     
       - **Put the frequently-used column filter in the beginning**
    --- End diff --
    
    Since we have changed the default behavior of sort_columns, I think this section can be removed. OR we can change it to `Put the fequently-used column filter in the beginning of sort_columns`


---

[GitHub] carbondata pull request #2592: [CARBONDATA-2915] Updated & enhanced Document...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2592


---

[GitHub] carbondata pull request #2592: [CARBONDATA-2915] Updated & enhanced Document...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r215659429
  
    --- Diff: docs/configuration-parameters.md ---
    @@ -16,152 +16,135 @@
     -->
     
     # Configuring CarbonData
    - This tutorial guides you through the advanced configurations of CarbonData :
    - 
    + This guide explains the configurations that can be used to tune CarbonData to achieve better performance.Some of the properties can be set dynamically and are explained in the section Dynamic Configuration In CarbonData Using SET-RESET.Most of the properties that control the internal settings have reasonable default values.They are listed along with the properties along with explanation.
    --- End diff --
    
    suggest removing this sentence : Some of the properties can be set dynamically and are explained in the section Dynamic Configuration In CarbonData Using SET-RESET


---

[GitHub] carbondata pull request #2592: [CARBONDATA-2915] Updated & enhanced Document...

Posted by sraghunandan <gi...@git.apache.org>.

Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r215672435
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -470,15 +447,6 @@
        */
       @CarbonProperty
       public static final String CARBON_DATE_FORMAT = "carbon.date.format";
    -  /**
    -   * STORE_LOCATION_HDFS
    -   */
    -  @CarbonProperty
    -  public static final String STORE_LOCATION_HDFS = "carbon.storelocation.hdfs";
    --- End diff --
    
    it is not being used


---

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

Posted by sraghunandan <gi...@git.apache.org>.

Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r214955716
  
    --- Diff: docs/useful-tips-on-carbondata.md ---
    @@ -30,16 +30,16 @@
     
       - **Table Column Description**
     
    -  | Column Name | Data Type     | Cardinality | Attribution |
    -  |-------------|---------------|-------------|-------------|
    -  | msisdn      | String        | 30 million  | Dimension   |
    -  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    -  | HOST        | String        | 1 million   | Dimension   |
    -  | Dime_1      | String        | 1 Thousand  | Dimension   |
    -  | counter_1   | Decimal       | NA          | Measure     |
    -  | counter_2   | Numeric(20,0) | NA          | Measure     |
    -  | ...         | ...           | NA          | Measure     |
    -  | counter_100 | Decimal       | NA          | Measure     |
    +| Column Name | Data Type     | Cardinality | Attribution |
    +|-------------|---------------|-------------|-------------|
    +| msisdn      | String        | 30 million  | Dimension   |
    +| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    +| HOST        | String        | 1 million   | Dimension   |
    +| Dime_1      | String        | 1 Thousand  | Dimension   |
    +| counter_1   | Decimal       | NA          | Measure     |
    +| counter_2   | Numeric(20,0) | NA          | Measure     |
    +| ...         | ...           | NA          | Measure     |
    +| counter_100 | Decimal       | NA          | Measure     |
     
     
       - **Put the frequently-used column filter in the beginning**
    --- End diff --
    
    updated


---

[GitHub] carbondata issue #2592: [CARBONDATA-2915] Updated & enhanced Documentation o...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/237/



---

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

Posted by sraghunandan <gi...@git.apache.org>.

Github user sraghunandan commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2592#discussion_r214954243

--- Diff: docs/useful-tips-on-carbondata.md ---
@@ -158,18 +156,18 @@
Recently we did some performance POC on CarbonData for Finance and telecommunication Field. It involved detailed queries and aggregation
scenarios. After the completion of POC, some of the configurations impacting the performance have been identified and tabulated below :

- | Parameter | Location | Used For | Description | Tuning |
- |----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- | carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | Data loading | During the loading of data, local temp is used to sort the data. This number specifies the minimum number of intermediate files after which the merge sort has to be initiated. | Increasing the parameter to a higher value will improve the load performance. For example, when we increase the value from 20 to 100, it increases the data load performance from 35MB/S to more than 50MB/S. Higher values of this parameter consumes more memory during the load. |
- | carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | Data loading | Specifies the number of cores used for data processing during data loading in CarbonData. | If you have more number of CPUs, then you can increase the number of CPUs, which will increase the performance. For example if we increase the value from 2 to 4 then the CSV reading performance can increase about 1 times |
- | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data loading and Querying | For minor compaction, specifies the number of segments to be merged in stage 1 and number of compacted segments to be merged in stage 2. | Each CarbonData load will create one segment, if every load is small in size it will generate many small file over a period of time impacting the query performance. Configuring this parameter will merge the small segment to one big segment which will sort the data and improve the performance. For Example in one telecommunication scenario, the performance improves about 2 times after minor compaction. |
- | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | The number of task started when spark shuffle. | The value can be 1 to 2 times as much as the executor cores. In an aggregation scenario, reducing the number from 200 to 32 reduced the query time from 17 to 9 seconds. |
- | spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good performance. This 2 value does not mean more the better. It needs to be configured properly in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query which impact the query performance very much from the 3 second to more than 15 seconds. In this scenario need to increase the memory or decrease the CPU cores. |
- | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer size to store records, returned from the block scan. | In limit scenario this parameter is very important. For example your query limit is 1000. But if we set this value to 3000 that means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the performance increase about 2 times in comparison to if we set this value to 12000. |
- | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. |
- | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether to use multiple YARN local directories during table data loading for disk load balance | After enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use all YARN local directories during data load for disk load balance, that will improve the data load performance. Please enable this property when you encounter disk hotspot problem during data loading. |
- | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data loading | Specify the name of compressor to compress the intermediate sort temporary files during sort procedure in data loading. | The optional values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty means that Carbondata will not compress the sort temp files. This parameter will be useful if you encounter disk bottleneck. |
- | carbon.load.skewedDataOptimization.enabled | spark/carbonlib/carbon.properties | Data loading | Whether to enable size based block allocation strategy for data loading. | When loading, carbondata will use file size based block allocation strategy for task distribution. It will make sure that all the executors process the same size of data -- It's useful if the size of your input data files varies widely, say 1MB~1GB. |
- | carbon.load.min.size.enabled | spark/carbonlib/carbon.properties | Data loading | Whether to enable node minumun input data size allocation strategy for data loading.| When loading, carbondata will use node minumun input data size allocation strategy for task distribution. It will make sure the node load the minimum amount of data -- It's useful if the size of your input data files very small, say 1MB~256MB,Avoid generating a large number of small files. |
-
+| Parameter | Location | Used For | Description | Tuning |
+|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | Data loading | During the loading of data, local temp is used to sort the data. This number specifies the minimum number of intermediate files after which the merge sort has to be initiated. | Increasing the parameter to a higher value will improve the load performance. For example, when we increase the value from 20 to 100, it increases the data load performance from 35MB/S to more than 50MB/S. Higher values of this parameter consumes more memory during the load. |
+| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | Data loading | Specifies the number of cores used for data processing during data loading in CarbonData. | If you have more number of CPUs, then you can increase the number of CPUs, which will increase the performance. For example if we increase the value from 2 to 4 then the CSV reading performance can increase about 1 times |
+| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data loading and Querying | For minor compaction, specifies the number of segments to be merged in stage 1 and number of compacted segments to be merged in stage 2. | Each CarbonData load will create one segment, if every load is small in size it will generate many small file over a period of time impacting the query performance. Configuring this parameter will merge the small segment to one big segment which will sort the data and improve the performance. For Example in one telecommunication scenario, the performance improves about 2 times after minor compaction. |
+| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | The number of task started when spark shuffle. | The value can be 1 to 2 times as much as the executor cores. In an aggregation scenario, reducing the number from 200 to 32 reduced the query time from 17 to 9 seconds. |
+| spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good performance. This 2 value does not mean more the better. It needs to be configured properly in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query which impact the query performance very much from the 3 second to more than 15 seconds. In this scenario need to increase the memory or decrease the CPU cores. |
+| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer size to store records, returned from the block scan. | In limit scenario this parameter is very important. For example your query limit is 1000. But if we set this value to 3000 that means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the performance increase about 2 times in comparison to if we set this value to 12000. |
+| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether to use multiple YARN local directories during table data loading for disk load balance | After enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use all YARN local directories during data load for disk load balance, that will improve the data load performance. Please enable this property when you encounter disk hotspot problem during data loading. |
+| carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data loading | Specify the name of compressor to compress the intermediate sort temporary files during sort procedure in data loading. | The optional values are 'SNAPPY','GZIP','BZIP2','LZ4' and empty. By default, empty means that Carbondata will not compress the sort temp files. This parameter will be useful if you encounter disk bottleneck. |
--- End diff --

missed during rebase, have added it

---

[GitHub] carbondata pull request #2592: [CARBONDATA-2915] Updated & enhanced Document...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r215655761
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -470,15 +447,6 @@
        */
       @CarbonProperty
       public static final String CARBON_DATE_FORMAT = "carbon.date.format";
    -  /**
    -   * STORE_LOCATION_HDFS
    -   */
    -  @CarbonProperty
    -  public static final String STORE_LOCATION_HDFS = "carbon.storelocation.hdfs";
    --- End diff --
    
    can you please explain, why need to remove :  "STORE_LOCATION_HDFS" ?


---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6428/



---

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2592#discussion_r207084144
  
    --- Diff: docs/useful-tips-on-carbondata.md ---
    @@ -30,16 +30,16 @@
     
       - **Table Column Description**
     
    -  | Column Name | Data Type     | Cardinality | Attribution |
    -  |-------------|---------------|-------------|-------------|
    -  | msisdn      | String        | 30 million  | Dimension   |
    -  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    -  | HOST        | String        | 1 million   | Dimension   |
    -  | Dime_1      | String        | 1 Thousand  | Dimension   |
    -  | counter_1   | Decimal       | NA          | Measure     |
    -  | counter_2   | Numeric(20,0) | NA          | Measure     |
    -  | ...         | ...           | NA          | Measure     |
    -  | counter_100 | Decimal       | NA          | Measure     |
    +| Column Name | Data Type     | Cardinality | Attribution |
    +|-------------|---------------|-------------|-------------|
    +| msisdn      | String        | 30 million  | Dimension   |
    +| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
    +| HOST        | String        | 1 million   | Dimension   |
    +| Dime_1      | String        | 1 Thousand  | Dimension   |
    +| counter_1   | Decimal       | NA          | Measure     |
    +| counter_2   | Numeric(20,0) | NA          | Measure     |
    +| ...         | ...           | NA          | Measure     |
    +| counter_100 | Decimal       | NA          | Measure     |
     
     
       - **Put the frequently-used column filter in the beginning**
    --- End diff --
    
    For the following section, it is the similar with it.


---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6425/



---

[GitHub] carbondata pull request #2592: [WIP]Updated & enhanced Documentation of Carb...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2592#discussion_r207083747

why we remove the 'zstd' in the supported compressor list?

---

[GitHub] carbondata issue #2592: [CARBONDATA-2915] Updated & enhanced Documentation o...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8307/



---

[GitHub] carbondata issue #2592: [CARBONDATA-2915] Updated & enhanced Documentation o...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    LGTM


---

[GitHub] carbondata issue #2592: [WIP]Updated & enhanced Documentation of CarbonData

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2592
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6095/



---