You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ch...@apache.org on 2021/12/20 14:55:46 UTC

[carbondata] 02/14: Add FAQ How to manage mix file format in carbondata table.

This is an automated email from the ASF dual-hosted git repository.

chenliang613 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git

commit 598d1cee348887538a0d5a20dce6c2106a2b23f6
Author: bieremayi <li...@sina.com>
AuthorDate: Mon Nov 29 18:18:05 2021 +0800

    Add FAQ How to manage mix file format in carbondata table.
    
    1. add segment example.
    2. faq.md link to addsegment-guide.md
---
 docs/addsegment-guide.md | 94 +++++++++++++++++++++++++++++++++++++++++++++++-
 docs/faq.md              |  6 ++++
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/docs/addsegment-guide.md b/docs/addsegment-guide.md
index 648c426..3985beb 100644
--- a/docs/addsegment-guide.md
+++ b/docs/addsegment-guide.md
@@ -54,7 +54,7 @@ Users can add the existing data as a segment to the carbon table provided the sc
     ```
 
   - ##### FORMAT:
-   User old table file format. eg : json, parquet, jdbc, orc, csv, text
+   User old table file format. eg : parquet, orc
 
     ```
     OPTIONS('FORMAT'='parquet')
@@ -73,6 +73,98 @@ In the above command user can add the existing data to the carbon table as a new
 During add segment, it will infer the schema from data and validates the schema against the carbon table. 
 If the schema doesn’t match it throws an exception.
 
+**Example:**
+
+Exist old hive partition table , stored as orc or parquet file format:
+
+
+```sql
+CREATE TABLE default.log_parquet_par (
+	id BIGINT,
+	event_time BIGINT,
+	ip STRING
+)PARTITIONED BY (                              
+	day INT,                                    
+	hour INT,                                   
+	type INT                                    
+)                                              
+STORED AS parquet
+LOCATION 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par';
+```
+
+Parquet File Location : 
+
+```
+25.1 K  75.2 K  /user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0
+8.7 K   26.2 K  /user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=1
+```
+
+
+```sql
+CREATE TABLE default.log_orc_par (
+	id BIGINT,
+	event_time BIGINT,
+	ip STRING
+)PARTITIONED BY (                              
+	day INT,                                    
+	hour INT,                                   
+	type INT                                    
+)                                              
+STORED AS orc
+LOCATION 'hdfs://bieremayi/user/hive/warehouse/log_orc_par';
+```
+
+Orc File Location : 
+
+```
+19.4 K  58.2 K  /user/hive/warehouse/log_orc_par/day=20211123/hour=12/type=0
+7.1 K   21.4 K  /user/hive/warehouse/log_orc_par/day=20211123/hour=12/type=1
+```
+
+**Steps:**
+
+step1: Create carbondata format table with the same schema.
+
+```sql
+CREATE TABLE default.log_carbon_par (
+	id BIGINT,
+	event_time BIGINT,
+	ip STRING
+)PARTITIONED BY (                              
+	day INT,                                    
+	hour INT,                                   
+	type INT                                    
+)                                              
+STORED AS carbondata
+LOCATION 'hdfs://bieremayi/user/hive/warehouse/log_carbon_par';
+```
+
+step2: Execute add segment sql.
+
+|SQL|Is Success|Error Message|
+|:---|:---|:---|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/','format'='parquet','partition'='day:int,hour:int,type:int');|Yes|/|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_orc_par/','format'='orc','partition'='day:int,hour:int,type:int');|Yes|/|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0','format'='parquet','partition'='day=20211123 and hour=12 and type=0');|No|Error in query: invalid partition option: Map(path -> hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0, format -> parquet, partition -> day=20211123 and hour=12 and type=0)|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0','format'='parquet','partition'='day=20211123,hour=12,type=0');|No|Error in query: invalid partition option: Map(path -> hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0, format -> parquet, partition -> day=20211123,hour=12,type=0)|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0','format'='parquet');|No|Error in query: partition option is required when adding segment to partition table|
+|alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0','format'='parquet','partition'='day:int,hour:int,type:int');|No|ERROR SparkSQLDriver: Failed in [alter table default.log_carbon_par add segment options ('path'= 'hdfs://bieremayi/user/hive/warehouse/log_parquet_par/day=20211123/hour=12/type=0','format'='parquet','partition'='day:int,hour:int,type:int')]|
+
+step3:  Result check.
+
+```sql
+SHOW SEGMENTS FOR TABLE default.log_carbon_par;
+```
+
+|SegmentSequenceId|Status|Load Time|Elapsed Time |Partition|Data Size|Index Size|File Format|
+|:---|:---|:---|:---|:---|:---|:---|:---|
+|4       |Success |2021-11-29 17:59:40.819 |7.026S  |{day=20211123,hour=12,type=1}, ...      |29.86KB| 4.80KB  |columnar_v3|
+|3       |Success |2021-11-29 16:34:28.106 |0.418S  |{day=20211123,hour=12,type=0}   |19.40KB |NA     | orc|
+|2       |Success |2021-11-29 16:34:27.733 |0.222S  |{day=20211123,hour=12,type=1}   |7.12KB  |NA     | orc|
+|1       |Success |2021-11-29 16:30:17.207 |0.275S  |{day=20211123,hour=12,type=0}   |25.08KB |NA     | parquet|
+|0       |Success |2021-11-29 16:30:16.48  |0.452S  |{day=20211123,hour=12,type=1}   |8.72KB  |NA     | parquet|
+
+
 ### Changes to tablestatus file
 Carbon adds the new segment by adding segment information to tablestatus file. In order to add the path and format information to tablestatus, we are going to add `segmentPath`  and `format`  to the tablestatus file. 
 And any extra `options` will be added to the segment file.
diff --git a/docs/faq.md b/docs/faq.md
index a330e24..d31aad0 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -29,6 +29,7 @@
 * [Why different time zone result for select query output when query SDK writer output?](#why-different-time-zone-result-for-select-query-output-when-query-sdk-writer-output)
 * [How to check LRU cache memory footprint?](#how-to-check-lru-cache-memory-footprint)
 * [How to deal with the trailing task in query?](#How-to-deal-with-the-trailing-task-in-query)
+* [How to manage mix file format in carbon table?](#How-to-manage-mix-file-format-in-carbon-table)
 
 # TroubleShooting
 
@@ -213,6 +214,11 @@ spark.speculation is a group of configuration, that can monitor trailing tasks a
 
 spark.blacklist.enabled, avoid reduction of available executors due to blacklist mechanism.
 
+## How to manage mix file format in carbondata table
+
+[Heterogeneous format segments in carbondata](./addsegment-guide.md)
+
+
 ## Getting tablestatus.lock issues When loading data
 
   **Symptom**