You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ch...@apache.org on 2017/02/04 02:38:24 UTC
[25/35] incubator-carbondata-site git commit: Updated website for
CarbonData release 1.0.0
http://git-wip-us.apache.org/repos/asf/incubator-carbondata-site/blob/0d4cdb1c/content/docs/latest/dml-operation-on-carbondata.html
----------------------------------------------------------------------
diff --git a/content/docs/latest/dml-operation-on-carbondata.html b/content/docs/latest/dml-operation-on-carbondata.html
new file mode 100644
index 0000000..a6a6d75
--- /dev/null
+++ b/content/docs/latest/dml-operation-on-carbondata.html
@@ -0,0 +1,446 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+<h1>DML Operations on CarbonData</h1><p>This tutorial guides you through the data manipulation
+ language support provided by CarbonData.</p><h2>Overview</h2><p>The following DML operations are
+ supported in CarbonData :</p>
+<ul>
+ <li><a href="#load-data">LOAD DATA</a></li>
+ <li><a href="#insert-data">INSERT DATA INTO A CARBONDATA TABLE</a></li>
+ <li><a href="#show-segments">SHOW SEGMENTS</a></li>
+ <li><a href="#delete-id">DELETE SEGMENT BY ID</a></li>
+ <li><a href="#delete-date">DELETE SEGMENT BY DATE</a></li>
+ <li><a href="#update-carbondata">UPDATE CARBONDATA TABLE</a></li>
+ <li><a href="#delete-table">DELETE RECORDS FROM CARBONDATA TABLE</a></li>
+</ul><h2 id="load-data">LOAD DATA</h2><p>This command loads the user data in raw format to the
+ CarbonData specific data format store, this allows CarbonData to provide good performance while
+ querying the data. Please visit Data Management for more
+ details on LOAD.</p><h3>Syntax</h3><p><code>
+ LOAD DATA [LOCAL] INPATH 'folder_path'
+ INTO TABLE [db_name.]table_name
+ OPTIONS(property_name=property_value, ...)
+</code></p><p>OPTIONS are not mandatory for data loading process. Inside OPTIONS user can provide
+ either of any options like DELIMITER, QUOTECHAR, ESCAPECHAR, MULTILINE as per requirement.</p>
+<p>NOTE: The path shall be canonical path.</p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Optional</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>folder_path</td>
+ <td>Path of raw csv data folder or file.</td>
+ <td>NO</td>
+ </tr>
+ <tr>
+ <td>db_name</td>
+ <td>Database name, if it is not specified then it uses the current database.</td>
+ <td>YES</td>
+ </tr>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the table in provided database.</td>
+ <td>NO</td>
+ </tr>
+ <tr>
+ <td>OPTIONS</td>
+ <td>Extra options provided to Load</td>
+ <td>YES</td>
+ </tr>
+ </tbody>
+</table><h3>Usage Guidelines</h3><p>You can use the following options to load data:</p>
+<ul>
+ <li><p><strong>DELIMITER:</strong> Delimiters can be provided in the load command.</p>
+ <p><code>
+ OPTIONS('DELIMITER'=',')
+ </code></p></li>
+ <li><p><strong>QUOTECHAR:</strong> Quote Characters can be provided in the load command.</p>
+ <p><code>
+ OPTIONS('QUOTECHAR'='"')
+ </code></p></li>
+ <li><p><strong>COMMENTCHAR:</strong> Comment Characters can be provided in the load command if
+ user want to comment lines.</p>
+ <p><code>
+ OPTIONS('COMMENTCHAR'='#')
+ </code></p></li>
+ <li><p><strong>FILEHEADER:</strong> Headers can be provided in the LOAD DATA command if headers
+ are missing in the source files.</p>
+ <p><code>
+ OPTIONS('FILEHEADER'='column1,column2')
+ </code></p></li>
+ <li><p><strong>MULTILINE:</strong> CSV with new line character in quotes.</p>
+ <p><code>
+ OPTIONS('MULTILINE'='true')
+ </code></p></li>
+ <li><p><strong>ESCAPECHAR:</strong> Escape char can be provided if user want strict validation
+ of escape character on CSV.</p>
+ <p><code>
+ OPTIONS('ESCAPECHAR'='\')
+ </code></p></li>
+ <li><p><strong>COMPLEX_DELIMITER_LEVEL_1:</strong> Split the complex type data column in a row
+ (eg., a$b$c --> Array = {a,b,c}).</p>
+ <p><code>
+ OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
+ </code></p></li>
+ <li><p><strong>COMPLEX_DELIMITER_LEVEL_2:</strong> Split the complex type nested data column in
+ a row. Applies level_1 delimiter & applies level_2 based on complex data type (eg.,
+ a:b$c:d --> Array> = {{a,b},{c,d}}).</p>
+ <p><code>
+ OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
+ </code></p></li>
+ <li><p><strong>ALL_DICTIONARY_PATH:</strong> All dictionary files path.</p>
+ <p><code>
+ OPTIONS('ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary')
+ </code></p></li>
+ <li><p><strong>COLUMNDICT:</strong> Dictionary file path for specified column.</p>
+ <p><code>
+ OPTIONS('COLUMNDICT'='column1:dictionaryFilePath1,
+ column2:dictionaryFilePath2')
+ </code></p>
+ <p>NOTE: ALL_DICTIONARY_PATH and COLUMNDICT can not be used together.</p></li>
+ <li><p><strong>DATEFORMAT:</strong> Date format for specified column.</p>
+ <p><code>
+ OPTIONS('DATEFORMAT'='column1:dateFormat1, column2:dateFormat2')
+ </code></p>
+ <p>NOTE: Date formats are specified by date pattern strings. The date pattern letters in
+ CarbonData are same as in JAVA. Refer to <a
+ href="http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html">SimpleDateFormat</a>.
+ </p></li>
+ <li><p><strong>USE_KETTLE:</strong> This option is used to specify whether to use kettle for
+ loading data or not. By default kettle is not used for data loading.</p>
+ <p><code>
+ OPTIONS('USE_KETTLE'='FALSE')
+ </code></p></li>
+ <p>Note : It is recommended to set the value for this option as false.</p>
+ <li><p><strong>SINGLE_PASS:</strong> Single Pass Loading enables single job to finish data
+ loading with dictionary generation on the fly. It enhances performance in the scenarios
+ where the subsequent data loading after initial load involves fewer incremental updates on
+ the dictionary.</p>
+
+ <p>This option specifies whether to use single pass for loading data or not. By default this
+ option is set to FALSE.</p>
+ <p><code>
+ OPTIONS('SINGLE_PASS'='TRUE')
+ </code></p>
+ <p>Note :</p>
+ <ul>
+ <li><p>If this option is set to TRUE then data loading will take less time.</p></li>
+ <li><p>If this option is set to some invalid value other than TRUE or FALSE then it uses
+ the default value.</p></li>
+ </ul>
+ </li>
+</ul>
+
+<h3>Example:</h3><p><pre><code>
+ LOAD DATA local inpath '/opt/rawdata/data.csv' INTO table carbontable
+ options('DELIMITER'=',', 'QUOTECHAR'='"','COMMENTCHAR'='#',
+ 'FILEHEADER'='empno,empname,designation,doj,workgroupcategory,workgroupcategoryname,deptno,deptname,projectcode,
+ projectjoindate,projectenddate,attendance,utilization,salary',
+ 'MULTILINE'='true','ESCAPECHAR'='\','COMPLEX_DELIMITER_LEVEL_1'='$',
+ 'COMPLEX_DELIMITER_LEVEL_2'=':',
+ 'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary')
+</code></pre></p><h2 id="insert-data">INSERT DATA INTO A CARBONDATA TABLE</h2><p>This command inserts data
+ into a CarbonData table. It is defined as a combination of two queries Insert and Select query
+ respectively. It inserts records from a source table into a target CarbonData table. The source
+ table can be a Hive table, Parquet table or a CarbonData table itself. It comes with the
+ functionality to aggregate the records of a table by performing Select query on source table and
+ load its corresponding resultant records into a CarbonData table.</p><p><strong>NOTE</strong> :
+ The client node where the INSERT command is executing, must be part of the cluster.</p><h3>
+ Syntax</h3><p><code>
+ INSERT INTO TABLE <CARBONDATA TABLE> SELECT * FROM sourceTableName
+ [ WHERE { <filter_condition> } ];
+</code></p><p>You can also omit the <code>table</code> keyword and write your query as:</p><p><code>
+ INSERT INTO <CARBONDATA TABLE> SELECT * FROM sourceTableName
+ [ WHERE { <filter_condition> } ];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>CARBON TABLE</td>
+ <td>The name of the Carbon table in which you want to perform the insert operation.</td>
+ </tr>
+ <tr>
+ <td>sourceTableName</td>
+ <td>The table from which the records are read and inserted into destination CarbonData
+ table.
+ </td>
+ </tr>
+ </tbody>
+</table><h3>Usage Guidelines</h3><p>The following condition must be met for successful insert
+ operation :</p>
+<ul>
+ <li>The source table and the CarbonData table must have the same table schema.</li>
+ <li>The table must be created.</li>
+ <li>Overwrite is not supported for CarbonData table.</li>
+ <li>The data type of source and destination table columns should be same, else the data from
+ source table will be treated as bad records and the INSERT command fails.
+ </li>
+ <li>INSERT INTO command does not support partial success if bad records are found, it will
+ fail.
+ </li>
+ <li>Data cannot be loaded or updated in source table while insert from source table to target
+ table is in progress.
+ </li>
+</ul><p>To enable data load or update during insert operation, configure the following property to
+ true.</p><p><code>
+ carbon.insert.persist.enable=true
+</code></p><p>By default the above configuration will be false.</p><p><strong>NOTE</strong>:
+ Enabling this property will reduce the performance.</p><h3>Examples</h3><p><code>
+ INSERT INTO table1 SELECT item1 ,sum(item2 + 1000) as result FROM
+ table2 group by item1;
+</code></p><p><code>
+ INSERT INTO table1 SELECT item1, item2, item3 FROM table2
+ where item2='xyz';
+</code></p><p><code>
+ INSERT INTO table1 SELECT * FROM table2
+ where exists (select * from table3
+ where table2.item1 = table3.item1);
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log.</strong></p>
+<h2 id="show-segments">SHOW SEGMENTS</h2><p>This command is used to get the segments of CarbonData
+ table.</p><p><code>
+ SHOW SEGMENTS FOR TABLE [db_name.]table_name
+ LIMIT number_of_segments;
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Optional</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>db_name</td>
+ <td>Database name, if it is not specified then it uses the current database.</td>
+ <td>YES</td>
+ </tr>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the table in provided database.</td>
+ <td>NO</td>
+ </tr>
+ <tr>
+ <td>number_of_segments</td>
+ <td>Limit the output to this number.</td>
+ <td>YES</td>
+ </tr>
+ </tbody>
+</table><h3>Example:</h3><p><code>
+ SHOW SEGMENTS FOR TABLE CarbonDatabase.CarbonTable LIMIT 4;
+</code></p><h2 id="delete-id">DELETE SEGMENT BY ID</h2><p>This command is used to delete segment by
+ using the segment ID. Each segment has a unique segment ID associated with it. Using this
+ segment ID, you can remove the segment.</p><p>The following command will get the segmentID.</p>
+<p><code>
+ SHOW SEGMENTS FOR Table dbname.tablename LIMIT number_of_segments
+</code></p><p>After you retrieve the segment ID of the segment that you want to delete, execute the
+ following command to delete the selected segment.</p><p><code>
+ DELETE SEGMENT segment_sequence_id1, segments_sequence_id2, ....
+ FROM TABLE tableName
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Optional</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>segment_id</td>
+ <td>Segment Id of the load.</td>
+ <td>NO</td>
+ </tr>
+ <tr>
+ <td>db_name</td>
+ <td>Database name, if it is not specified then it uses the current database.</td>
+ <td>YES</td>
+ </tr>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the table in provided database.</td>
+ <td>NO</td>
+ </tr>
+ </tbody>
+</table><h3>Example:</h3><p><code>
+ DELETE SEGMENT 0 FROM TABLE CarbonDatabase.CarbonTable;
+ DELETE SEGMENT 0.1,5,8 FROM TABLE CarbonDatabase.CarbonTable;
+</code> NOTE: Here 0.1 is compacted segment sequence id. </p><h2 id="delete-date">DELETE SEGMENT BY
+ DATE</h2><p>This command will allow to delete the CarbonData segment(s) from the store based on
+ the date provided by the user in the DML command. The segment created before the particular date
+ will be removed from the specific stores.</p><p><code>
+ DELETE FROM TABLE [schema_name.]table_name
+ WHERE[DATE_FIELD]BEFORE [DATE_VALUE]
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Optional</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>DATE_VALUE</td>
+ <td>Valid segment load start time value. All the segments before this specified date will be
+ deleted.
+ </td>
+ <td>NO</td>
+ </tr>
+ <tr>
+ <td>db_name</td>
+ <td>Database name, if it is not specified then it uses the current database.</td>
+ <td>YES</td>
+ </tr>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the table in provided database.</td>
+ <td>NO</td>
+ </tr>
+ </tbody>
+</table><h3>Example:</h3><p><code>
+ DELETE SEGMENTS FROM TABLE CarbonDatabase.CarbonTable
+ WHERE STARTTIME BEFORE '2017-06-01 12:05:06';
+</code></p><h2 id="update-carbondata">Update CarbonData Table</h2><p>This command will allow to
+ update the carbon table based on the column expression and optional filter conditions.</p><h3>
+ Syntax</h3><p><code>
+ UPDATE <table_name>
+ SET (column_name1, column_name2, ... column_name n) =
+ (column1_expression , column2_expression . .. column n_expression )
+ [ WHERE { <filter_condition> } ];
+</code></p><p>alternatively the following the command can also be used for updating the CarbonData
+ Table :</p><p><code>
+ UPDATE <table_name>
+ SET (column_name1, column_name2,) =
+ (select sourceColumn1, sourceColumn2 from sourceTable
+ [ WHERE { <filter_condition> } ] )
+ [ WHERE { <filter_condition> } ];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the Carbon table in which you want to perform the update operation.</td>
+ </tr>
+ <tr>
+ <td>column_name</td>
+ <td>The destination columns to be updated.</td>
+ </tr>
+ <tr>
+ <td>sourceColumn</td>
+ <td>The source table column values to be updated in destination table.</td>
+ </tr>
+ <tr>
+ <td>sourceTable</td>
+ <td>The table from which the records are updated into destination Carbon table.</td>
+ </tr>
+ </tbody>
+</table><h3>Usage Guidelines</h3><p>The following conditions must be met for successful updation
+ :</p>
+<ul>
+ <li>The update command fails if multiple input rows in source table are matched with single row
+ in destination table.
+ </li>
+ <li>If the source table generates empty records, the update operation will complete successfully
+ without updating the table.
+ </li>
+ <li>If a source table row does not correspond to any of the existing rows in a destination
+ table, the update operation will complete successfully without updating the table.
+ </li>
+ <li>In sub-query, if the source table and the target table are same, then the update operation
+ fails.
+ </li>
+ <li>If the sub-query used in UPDATE statement contains aggregate method or group by query, then
+ the UPDATE operation fails.
+ </li>
+</ul><h3>Examples</h3><p>Update is not supported for queries that contain aggregate or group by.</p>
+<p><code>
+ UPDATE t_carbn01 a
+ SET (a.item_type_code, a.profit) = ( SELECT b.item_type_cd,
+ sum(b.profit) from t_carbn01b b
+ WHERE item_type_cd =2 group by item_type_code);
+</code></p><p>Here the Update Operation fails as the query contains aggregate function sum(b.profit)
+ and group by clause in the sub-query.</p><p><code>
+ UPDATE carbonTable1 d
+ SET(d.column3,d.column5 ) = (SELECT s.c33 ,s.c55
+ FROM sourceTable1 s WHERE d.column1 = s.c11)
+ WHERE d.column1 = 'china' EXISTS( SELECT * from table3 o where o.c2 > 1);
+</code></p><p><code>
+ UPDATE carbonTable1 d SET (c3) = (SELECT s.c33 from sourceTable1 s
+ WHERE d.column1 = s.c11)
+ WHERE exists( select * from iud.other o where o.c2 > 1);
+</code></p><p><code>
+ UPDATE carbonTable1 SET (c2, c5 ) = (c2 + 1, concat(c5 , "y" ));
+</code></p><p><code>
+ UPDATE carbonTable1 d SET (c2, c5 ) = (c2 + 1, "xyx")
+ WHERE d.column1 = 'india';
+</code></p><p><code>
+ UPDATE carbonTable1 d SET (c2, c5 ) = (c2 + 1, "xyx")
+ WHERE d.column1 = 'india'
+ and EXISTS( SELECT * FROM table3 o WHERE o.column2 > 1);
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log and the
+ client.</strong></p><h2 id="delete-table">Delete Records from CarbonData Table</h2><p>This
+ command allows us to delete records from CarbonData table.</p><h3>Syntax</h3><p><code>
+ DELETE FROM table_name [WHERE expression];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+ <thead>
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>table_name</td>
+ <td>The name of the Carbon table in which you want to perform the delete.</td>
+ </tr>
+ </tbody>
+</table><h3>Examples</h3><p><code>
+ DELETE FROM columncarbonTable1 d WHERE d.column1 = 'china';
+</code></p><p><code>
+ DELETE FROM dest WHERE column1 IN ('china', 'USA');
+</code></p><p><code>
+ DELETE FROM columncarbonTable1
+ WHERE column1 IN (SELECT column11 FROM sourceTable2);
+</code></p><p><code>
+ DELETE FROM columncarbonTable1
+ WHERE column1 IN (SELECT column11 FROM sourceTable2 WHERE
+ column1 = 'USA');
+</code></p><p><code>
+ DELETE FROM columncarbonTable1 WHERE column2 >= 4
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log and the
+ client.</strong></p>
\ No newline at end of file