You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ak...@apache.org on 2021/10/28 05:52:24 UTC

[carbondata] branch master updated: [CARBONDATA-4240]: Added missing properties on the configurations page

This is an automated email from the ASF dual-hosted git repository.

akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new 7d94691  [CARBONDATA-4240]: Added missing properties on the configurations page
7d94691 is described below

commit 7d94691deb3300624ce4b22c4563cb4b9da776fa
Author: pratyakshsharma <pr...@gmail.com>
AuthorDate: Wed Oct 27 13:51:07 2021 +0530

    [CARBONDATA-4240]: Added missing properties on the configurations page
    
    Why is this PR needed?
    Few properties which were not present on configurations page but are
    user facing properties have been added.
    
    What changes were proposed in this PR?
    Addition of missing properties
    
    Does this PR introduce any user interface change?
    No
    
    Is any new testcase added?
    No
    
    This Closes #4210
---
 docs/configuration-parameters.md | 31 +++++++++++++++++++++++++++----
 docs/ddl-of-carbondata.md        |  2 +-
 docs/quick-start-guide.md        |  4 ++--
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 73bf2ce..c24518a 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -52,6 +52,11 @@ This section provides the details of all the configurations required for the Car
 | carbon.trash.retention.days | 7 | This parameter specifies the number of days after which the timestamp based subdirectories are expired in the trash folder. Allowed Min value = 0, Allowed Max Value = 365 days|
 | carbon.clean.file.force.allowed | false | This parameter specifies if the clean files operation with force option is allowed or not.|
 | carbon.cdc.minmax.pruning.enabled | false | This parameter defines whether the min max pruning to be performed on the target table based on the source data. It will be useful when data is not sparse across target table which results in better pruning.|
+| carbon.blocklet.size | 64 MB | Carbondata file consists of blocklets which further consists of column pages. As per the latest V3 format, the default size of a blocklet is 64 MB. It is recommended not to change this value except for some specific use case. |
+| carbon.date.format | yyyy-MM-dd | This property specifies the format which is used for parsing the incoming date field values. |
+| carbon.lock.class | (none) | This specifies the implementation of ICarbonLock interface to be used for acquiring the locks in case of concurrent operations |
+| carbon.data.file.version | V3 | This specifies carbondata file format version. Carbondata file format has evolved with time from V1 to V3 in terms of metadata storage and IO level pruning capabilities. You can find more details [here](https://carbondata.apache.org/file-structure-of-carbondata.html#carbondata-file-format). |
+| spark.carbon.hive.schema.store | false | Carbondata currently supports 2 different types of metastores for storing schemas. This property specifies if Hive metastore is to be used for storing and retrieving table schemas |
 
 ## Data Loading Configuration
 
@@ -70,6 +75,7 @@ This section provides the details of all the configurations required for the Car
 | carbon.load.global.sort.partitions | 0 | The number of partitions to use when shuffling data for global sort. Default value 0 means to use same number of map tasks as reduce tasks. **NOTE:** In general, it is recommended to have 2-3 tasks per CPU core in your cluster. |
 | carbon.sort.size | 100000 | Number of records to hold in memory to sort and write intermediate sort temp files. **NOTE:** Memory required for data loading will increase if you turn this value bigger. Besides each thread will cache this amout of records. The number of threads is configured by *carbon.number.of.cores.while.loading*. |
 | carbon.options.bad.records.logger.enable | false | CarbonData can identify the records that are not conformant to schema and isolate them as bad records. Enabling this configuration will make CarbonData to log such bad records. **NOTE:** If the input data contains many bad records, logging them will slow down the over all data loading throughput. The data load operation status would depend on the configuration in ***carbon.bad.records.action***. |
+| carbon.options.bad.records.action | FAIL | This property has four types of  bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it auto-corrects the data by storing the bad records as NULL. If set to REDIRECT then bad records are written to the raw CSV instead of being loaded. If set to IGNORE then bad records are neither loaded nor written to the raw CSV. If set to FAIL then data loading fails if any bad records are found. Also this property can be set at differ [...]
 | carbon.bad.records.action | FAIL | CarbonData in addition to identifying the bad records, can take certain actions on such data. This configuration can have four types of actions for bad records namely FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it auto-corrects the data by storing the bad records as NULL. If set to REDIRECT then bad records are written to the raw CSV instead of being loaded. If set to IGNORE then bad records are neither loaded nor written to the raw CSV. If [...]
 | carbon.options.is.empty.data.bad.record | false | Based on the business scenarios, empty("" or '' or ,,) data can be valid or invalid. This configuration controls how empty data should be treated by CarbonData. If false, then empty ("" or '' or ,,) data will not be considered as bad record and vice versa. |
 | carbon.options.bad.record.path | (none) | Specifies the HDFS path where bad records are to be stored. By default the value is Null. This path must be configured by the user if ***carbon.options.bad.records.logger.enable*** is **true** or ***carbon.bad.records.action*** is **REDIRECT**. |
@@ -93,12 +99,15 @@ This section provides the details of all the configurations required for the Car
 | carbon.options.serialization.null.format | \N | Based on the business scenarios, some columns might need to be loaded with null values. As null value cannot be written in csv files, some special characters might be adopted to specify null values. This configuration can be used to specify the null values format in the data being loaded. |
 | carbon.column.compressor | snappy | CarbonData will compress the column values using the compressor specified by this configuration. Currently CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
 | carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max values for string/varchar types column using the byte count specified by this configuration. Max value is 1000 bytes(500 characters) and Min value is 10 bytes(5 characters). **NOTE:** This property is useful for reducing the store size thereby improving the query performance but can lead to query degradation if value is not configured properly. | |
-| carbon.merge.index.failure.throw.exception | true | It is used to configure whether or not merge index failure should result in data load failure also. |
 | carbon.binary.decoder | None | Support configurable decode for loading. Two decoders supported: base64 and hex |
 | carbon.local.dictionary.size.threshold.inmb | 4 | size based threshold for local dictionary in MB, maximum allowed size is 16 MB. |
-| carbon.enable.bad.record.handling.for.insert | false | by default, disable the bad record and converter step during "insert into" |
-| carbon.load.si.repair | true | by default, enable loading for failed segments in SI during load/insert command |
+| carbon.enable.bad.record.handling.for.insert | false | By default, disable the bad record and converter step during "insert into" |
+| carbon.load.si.repair | true | By default, enable loading for failed segments in SI during load/insert command |
 | carbon.si.repair.limit | (none) | Number of failed segments to be loaded in SI when repairing missing segments in SI, by default load all the missing segments. Supports value from 0 to 2147483646 |
+| carbon.complex.delimiter.level.1 | # | This delimiter is used for parsing complex data type columns. Level 1 delimiter splits the complex type data column in a row (eg., a\001b\001c --> Array = {a,b,c}). |
+| carbon.complex.delimiter.level.2 | $ | This delimiter splits the complex type nested data column in a row. Applies level_1 delimiter & applies level_2 based on complex data type (eg., a\002b\001c\002d --> Array> = {{a,b},{c,d}}). |
+| carbon.complex.delimiter.level.3 | @ | This delimiter splits the complex type nested data column in a row. Applies level_1 delimiter, applies level_2 and then level_3 delimiter based on complex data type. Used in case of nested Complex Map type. (eg., 'a\003b\002b\003c\001aa\003bb\002cc\003dd' --> Array Of Map> = {{a -> b, b -> c},{aa -> bb, cc -> dd}}). |
+| carbon.complex.delimiter.level.4 | (none) | All the levels of delimiters are used for parsing complex data type columns. All the delimiters are applied depending on the complexity of the given data type. Level 4 delimiter will be used for parsing the complex values after level 3 delimiter has been applied already. |
 
 ## Compaction Configuration
 
@@ -113,12 +122,13 @@ This section provides the details of all the configurations required for the Car
 | carbon.numberof.preserve.segments | 0 | If the user wants to preserve some number of segments from being compacted then he can set this configuration. Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will always be excluded from the compaction. No segments will be preserved by default. **NOTE:** This configuration is useful when the chances of input data can be wrong due to environment scenarios. Preserving some of the latest segments from being compacted can help  [...]
 | carbon.allowed.compaction.days | 0 | This configuration is used to control on the number of recent segments that needs to be compacted, ignoring the older ones. This configuration is in days. For Example: If the configuration is 2, then the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. This configuration is disabled by default. **NOTE:** This configuration is useful when a bulk of histo [...]
 | carbon.enable.auto.load.merge | false | Compaction can be automatically triggered once data load completes. This ensures that the segments are merged in time and thus query times does not increase with increase in segments. This configuration enables to do compaction along with data loading. **NOTE:** Compaction will be triggered once the data load completes. But the status of data load wait till the compaction is completed. Hence it might look like data loading time has increased, but [...]
-| carbon.enable.page.level.reader.in.compaction|false|Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory. **NOTE:** Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData and concepts of pages.|
+| carbon.enable.page.level.reader.in.compaction | false | Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory. **NOTE:** Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData and concepts of pages.|
 | carbon.concurrent.compaction | true | Compaction of different tables can be executed concurrently. This configuration determines whether to compact all qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a resource demanding operation and needs more resources there by affecting the query performance also. This configuration is **deprecated** and might be removed in future releases. |
 | carbon.compaction.prefetch.enable | false | Compaction operation is similar to Query + data load where in data from qualifying segments are queried and data loading performed to generate a new single segment. This configuration determines whether to query ahead data from segments and feed it for data loading. **NOTE:** This configuration is disabled by default as it needs extra resources for querying extra data. Based on the memory availability on the cluster, user can enable it to imp [...]
 | carbon.enable.range.compaction | true | To configure Range-based Compaction to be used or not for RANGE_COLUMN. If true after compaction also the data would be present in ranges. |
 | carbon.si.segment.merge | false | Making this true degrades the LOAD performance. When the number of small files increase for SI segments(it can happen as number of columns will be less and we store position id and reference columns), user can either set to true which will merge the data files for upcoming loads or run SI refresh command which does this job for all segments. (REFRESH INDEX <index_table>) |
 | carbon.partition.data.on.tasklevel | false | When enabled, tasks launched for Local sort partition load will be based on one node one task. Compaction will be performed based on task level for a partition. Load performance might be degraded, because, the number of tasks launched is equal to number of nodes in case of local sort. For compaction, memory consumption will be less, as more number of tasks will be launched for a partition |
+| carbon.minor.compaction.size | (none) | Minor compaction originally worked based on the number of segments (by default 4). However in that scenario, there was no control over the size of segments to be compacted. This parameter was introduced to exclude segments whose size is greater than the configured threshold so that the overall IO and time taken decreases | 
 
 ## Query Configuration
 
@@ -151,6 +161,16 @@ This section provides the details of all the configurations required for the Car
 | carbon.partition.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)** upto which driver can cache partition metadata. Beyond this, least recently used data will be removed from cache before loading new set of values.
 | carbon.mapOrderPushDown.<db_name>_<table_name>.column| empty | If order by column is in sort column, specify that sort column here to avoid ordering at map task . |
 | carbon.metacache.expiration.seconds | Long.MAX_VALUE | Expiration time **(in seconds)** for tableInfo cache in CarbonMetadata and tableModifiedTime in CarbonFileMetastore, after the time configured since last access to the cache entry, tableInfo and tableModifiedTime will be removed from each cache. Recent access will refresh the timer. Default value of Long.MAX_VALUE means the cache will not be expired by time. **NOTE:** At the time when cache is being expired, queries on the table ma [...]
+| is.driver.instance | false | This parameter decides if LRU cache for storing indexes need to be created on driver. By default, it is created on executors. |
+| carbon.input.metrics.update.interval | 500000 | This property determines the number of records queried after which input metrics are updated to spark. It can be set dynamically within spark session itself as well. |
+| carbon.use.bitset.pipe.line | true | Carbondata has various optimizations for faster query execution. Setting this property acts like a catalyst for filter queries. If set to true, the bitset is passed from one filter to another, resulting in incremental filtering and improving overall performance |
+
+## Index Configuration
+| Parameter | Default Value | Description |
+|--------------------------------------|---------------|---------------------------------------------------|
+| carbon.lucene.index.stop.words | false | By default, lucene does not create index for stop words like 'is', 'the' etc. This flag is used to override this behaviour |
+| carbon.load.dateformat.setlenient.enable | false | This property enables parsing of timestamp/date data in load flow if the parsing fails with invalid timestamp data error. For example: 1941-03-15 00:00:00 is valid time in Asia/Calcutta zone and is invalid and will fail to parse in Asia/Shanghai zone as DST is observed and clocks were turned forward 1 hour to 1941-03-15 01:00:00 |
+| carbon.indexserver.tempfolder.deletetime | 10800000 | This specifies the time period in milliseconds after which temp folder gets deleted from index server |
 
 ## Data Mutation Configuration
 | Parameter | Default Value | Description |
@@ -237,6 +257,9 @@ RESET
 | carbon.enable.index.server                | To use index server for caching and pruning. This property can be used for a session or for a particular table with ***carbon.enable.index.server.<db_name>.<table_name>***. |
 | carbon.reorder.filter                     | This property can be used to enabled/disable filter reordering. Should be disabled only when the user has optimized the filter condition. | 
 | carbon.mapOrderPushDown.<db_name>_<table_name>.column | If order by column is in sort column, specify that sort column here to avoid ordering at map task . |
+| carbon.load.dateformat.setlenient.enable | To enable parsing of timestamp/date data in load flow if the parsing fails with invalid timestamp data error. **NOTE** Refer to [Index Configuration](#index-configuration)#carbon.load.dateformat.setlenient.enable for detailed information. |
+| carbon.minor.compaction.size | Puts an upper limit on the size of segments to be included for compaction. **NOTE** Refer to [Compaction Configuration](#compaction-configuration)#carbon.minor.compaction.size for detailed information. |
+| carbon.input.metrics.update.interval | Determines the number of records queried after which input metrics are updated to spark. **NOTE** Refer to [Query Configuration](#query-configuration)#carbon.minor.compaction.size for detailed information. |
 **Examples:**
 
 * Add or Update:
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index b37b3ab..dbf616b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -641,7 +641,7 @@ CarbonData DDL statements are documented here,which includes:
   This function creates a new database. By default the database is created in location 'spark.sql.warehouse.dir', but you can also specify custom location by configuring 'spark.sql.warehouse.dir', the configuration 'carbon.storelocation' has been deprecated.
 
   **Note:**
-    For simplicity, we recommended you remove the configuration of carbon.storelocation. If carbon.storelocaiton and spark.sql.warehouse.dir are configured to different paths, exception will be thrown when CREATE DATABASE and DROP DATABASE to avoid inconsistent database location.
+    For simplicity, we recommended you remove the configuration of carbon.storelocation. If carbon.storelocation and spark.sql.warehouse.dir are configured to different paths, exception will be thrown when CREATE DATABASE and DROP DATABASE to avoid inconsistent database location.
 
 
   ```
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 4782917..0d9cee1 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -259,7 +259,7 @@ carbon.sql(
 
 3. Add the carbonlib folder path in the Spark classpath. (Edit `$SPARK_HOME/conf/spark-env.sh` file and modify the value of `SPARK_CLASSPATH` by appending `$SPARK_HOME/carbonlib/*` to the existing value)
 
-4. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`.
+4. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`. All the carbondata related properties are configured in this file.
 
 5. Repeat Step 2 to Step 5 in all the nodes of the cluster.
 
@@ -304,7 +304,7 @@ carbon.sql(
 
    **NOTE**: Create the carbonlib folder if it does not exists inside `$SPARK_HOME` path.
 
-2. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`.
+2. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`. All the carbondata related properties are configured in this file.
 
 3. Create `tar.gz` file of carbonlib folder and move it inside the carbonlib folder.