You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ch...@apache.org on 2017/02/04 02:38:24 UTC

[25/35] incubator-carbondata-site git commit: Updated website for CarbonData release 1.0.0

http://git-wip-us.apache.org/repos/asf/incubator-carbondata-site/blob/0d4cdb1c/content/docs/latest/dml-operation-on-carbondata.html
----------------------------------------------------------------------
diff --git a/content/docs/latest/dml-operation-on-carbondata.html b/content/docs/latest/dml-operation-on-carbondata.html
new file mode 100644
index 0000000..a6a6d75
--- /dev/null
+++ b/content/docs/latest/dml-operation-on-carbondata.html
@@ -0,0 +1,446 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+<h1>DML Operations on CarbonData</h1><p>This tutorial guides you through the data manipulation
+    language support provided by CarbonData.</p><h2>Overview</h2><p>The following DML operations are
+    supported in CarbonData :</p>
+<ul>
+    <li><a href="#load-data">LOAD DATA</a></li>
+    <li><a href="#insert-data">INSERT DATA INTO A CARBONDATA TABLE</a></li>
+    <li><a href="#show-segments">SHOW SEGMENTS</a></li>
+    <li><a href="#delete-id">DELETE SEGMENT BY ID</a></li>
+    <li><a href="#delete-date">DELETE SEGMENT BY DATE</a></li>
+    <li><a href="#update-carbondata">UPDATE CARBONDATA TABLE</a></li>
+    <li><a href="#delete-table">DELETE RECORDS FROM CARBONDATA TABLE</a></li>
+</ul><h2 id="load-data">LOAD DATA</h2><p>This command loads the user data in raw format to the
+    CarbonData specific data format store, this allows CarbonData to provide good performance while
+    querying the data. Please visit Data Management for more
+    details on LOAD.</p><h3>Syntax</h3><p><code>
+    LOAD DATA [LOCAL] INPATH &#39;folder_path&#39;
+    INTO TABLE [db_name.]table_name
+    OPTIONS(property_name=property_value, ...)
+</code></p><p>OPTIONS are not mandatory for data loading process. Inside OPTIONS user can provide
+    either of any options like DELIMITER, QUOTECHAR, ESCAPECHAR, MULTILINE as per requirement.</p>
+<p>NOTE: The path shall be canonical path.</p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Optional</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>folder_path</td>
+        <td>Path of raw csv data folder or file.</td>
+        <td>NO</td>
+    </tr>
+    <tr>
+        <td>db_name</td>
+        <td>Database name, if it is not specified then it uses the current database.</td>
+        <td>YES</td>
+    </tr>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the table in provided database.</td>
+        <td>NO</td>
+    </tr>
+    <tr>
+        <td>OPTIONS</td>
+        <td>Extra options provided to Load</td>
+        <td>YES</td>
+    </tr>
+    </tbody>
+</table><h3>Usage Guidelines</h3><p>You can use the following options to load data:</p>
+<ul>
+    <li><p><strong>DELIMITER:</strong> Delimiters can be provided in the load command.</p>
+        <p><code>
+            OPTIONS(&#39;DELIMITER&#39;=&#39;,&#39;)
+        </code></p></li>
+    <li><p><strong>QUOTECHAR:</strong> Quote Characters can be provided in the load command.</p>
+        <p><code>
+            OPTIONS(&#39;QUOTECHAR&#39;=&#39;&quot;&#39;)
+        </code></p></li>
+    <li><p><strong>COMMENTCHAR:</strong> Comment Characters can be provided in the load command if
+        user want to comment lines.</p>
+        <p><code>
+            OPTIONS(&#39;COMMENTCHAR&#39;=&#39;#&#39;)
+        </code></p></li>
+    <li><p><strong>FILEHEADER:</strong> Headers can be provided in the LOAD DATA command if headers
+        are missing in the source files.</p>
+        <p><code>
+            OPTIONS(&#39;FILEHEADER&#39;=&#39;column1,column2&#39;)
+        </code></p></li>
+    <li><p><strong>MULTILINE:</strong> CSV with new line character in quotes.</p>
+        <p><code>
+            OPTIONS(&#39;MULTILINE&#39;=&#39;true&#39;)
+        </code></p></li>
+    <li><p><strong>ESCAPECHAR:</strong> Escape char can be provided if user want strict validation
+        of escape character on CSV.</p>
+        <p><code>
+            OPTIONS(&#39;ESCAPECHAR&#39;=&#39;\&#39;)
+        </code></p></li>
+    <li><p><strong>COMPLEX_DELIMITER_LEVEL_1:</strong> Split the complex type data column in a row
+        (eg., a$b$c --&gt; Array = {a,b,c}).</p>
+        <p><code>
+            OPTIONS(&#39;COMPLEX_DELIMITER_LEVEL_1&#39;=&#39;$&#39;)
+        </code></p></li>
+    <li><p><strong>COMPLEX_DELIMITER_LEVEL_2:</strong> Split the complex type nested data column in
+        a row. Applies level_1 delimiter &amp; applies level_2 based on complex data type (eg.,
+        a:b$c:d --&gt; Array&gt; = {{a,b},{c,d}}).</p>
+        <p><code>
+            OPTIONS(&#39;COMPLEX_DELIMITER_LEVEL_2&#39;=&#39;:&#39;)
+        </code></p></li>
+    <li><p><strong>ALL_DICTIONARY_PATH:</strong> All dictionary files path.</p>
+        <p><code>
+            OPTIONS(&#39;ALL_DICTIONARY_PATH&#39;=&#39;/opt/alldictionary/data.dictionary&#39;)
+        </code></p></li>
+    <li><p><strong>COLUMNDICT:</strong> Dictionary file path for specified column.</p>
+        <p><code>
+            OPTIONS(&#39;COLUMNDICT&#39;=&#39;column1:dictionaryFilePath1,
+            column2:dictionaryFilePath2&#39;)
+        </code></p>
+        <p>NOTE: ALL_DICTIONARY_PATH and COLUMNDICT can not be used together.</p></li>
+    <li><p><strong>DATEFORMAT:</strong> Date format for specified column.</p>
+        <p><code>
+            OPTIONS(&#39;DATEFORMAT&#39;=&#39;column1:dateFormat1, column2:dateFormat2&#39;)
+        </code></p>
+        <p>NOTE: Date formats are specified by date pattern strings. The date pattern letters in
+            CarbonData are same as in JAVA. Refer to <a
+                    href="http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html">SimpleDateFormat</a>.
+        </p></li>
+    <li><p><strong>USE_KETTLE:</strong> This option is used to specify whether to use kettle for
+        loading data or not. By default kettle is not used for data loading.</p>
+        <p><code>
+            OPTIONS(&#39;USE_KETTLE&#39;=&#39;FALSE&#39;)
+        </code></p></li>
+    <p>Note : It is recommended to set the value for this option as false.</p>
+    <li><p><strong>SINGLE_PASS:</strong> Single Pass Loading enables single job to finish data
+        loading with dictionary generation on the fly. It enhances performance in the scenarios
+        where the subsequent data loading after initial load involves fewer incremental updates on
+        the dictionary.</p>
+
+        <p>This option specifies whether to use single pass for loading data or not. By default this
+            option is set to FALSE.</p>
+        <p><code>
+            OPTIONS(&#39;SINGLE_PASS&#39;=&#39;TRUE&#39;)
+        </code></p>
+        <p>Note :</p>
+        <ul>
+            <li><p>If this option is set to TRUE then data loading will take less time.</p></li>
+            <li><p>If this option is set to some invalid value other than TRUE or FALSE then it uses
+                the default value.</p></li>
+        </ul>
+    </li>
+</ul>
+
+<h3>Example:</h3><p><pre><code>
+    LOAD DATA local inpath &#39;/opt/rawdata/data.csv&#39; INTO table carbontable
+    options(&#39;DELIMITER&#39;=&#39;,&#39;, &#39;QUOTECHAR&#39;=&#39;&quot;&#39;,&#39;COMMENTCHAR&#39;=&#39;#&#39;,
+    &#39;FILEHEADER&#39;=&#39;empno,empname,designation,doj,workgroupcategory,workgroupcategoryname,deptno,deptname,projectcode,
+    projectjoindate,projectenddate,attendance,utilization,salary&#39;,
+    &#39;MULTILINE&#39;=&#39;true&#39;,&#39;ESCAPECHAR&#39;=&#39;\&#39;,&#39;COMPLEX_DELIMITER_LEVEL_1&#39;=&#39;$&#39;,
+    &#39;COMPLEX_DELIMITER_LEVEL_2&#39;=&#39;:&#39;,
+    &#39;ALL_DICTIONARY_PATH&#39;=&#39;/opt/alldictionary/data.dictionary&#39;)
+</code></pre></p><h2 id="insert-data">INSERT DATA INTO A CARBONDATA TABLE</h2><p>This command inserts data
+    into a CarbonData table. It is defined as a combination of two queries Insert and Select query
+    respectively. It inserts records from a source table into a target CarbonData table. The source
+    table can be a Hive table, Parquet table or a CarbonData table itself. It comes with the
+    functionality to aggregate the records of a table by performing Select query on source table and
+    load its corresponding resultant records into a CarbonData table.</p><p><strong>NOTE</strong> :
+    The client node where the INSERT command is executing, must be part of the cluster.</p><h3>
+    Syntax</h3><p><code>
+    INSERT INTO TABLE &lt;CARBONDATA TABLE&gt; SELECT * FROM sourceTableName
+    [ WHERE { &lt;filter_condition&gt; } ];
+</code></p><p>You can also omit the <code>table</code> keyword and write your query as:</p><p><code>
+    INSERT INTO &lt;CARBONDATA TABLE&gt; SELECT * FROM sourceTableName
+    [ WHERE { &lt;filter_condition&gt; } ];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>CARBON TABLE</td>
+        <td>The name of the Carbon table in which you want to perform the insert operation.</td>
+    </tr>
+    <tr>
+        <td>sourceTableName</td>
+        <td>The table from which the records are read and inserted into destination CarbonData
+            table.
+        </td>
+    </tr>
+    </tbody>
+</table><h3>Usage Guidelines</h3><p>The following condition must be met for successful insert
+    operation :</p>
+<ul>
+    <li>The source table and the CarbonData table must have the same table schema.</li>
+    <li>The table must be created.</li>
+    <li>Overwrite is not supported for CarbonData table.</li>
+    <li>The data type of source and destination table columns should be same, else the data from
+        source table will be treated as bad records and the INSERT command fails.
+    </li>
+    <li>INSERT INTO command does not support partial success if bad records are found, it will
+        fail.
+    </li>
+    <li>Data cannot be loaded or updated in source table while insert from source table to target
+        table is in progress.
+    </li>
+</ul><p>To enable data load or update during insert operation, configure the following property to
+    true.</p><p><code>
+    carbon.insert.persist.enable=true
+</code></p><p>By default the above configuration will be false.</p><p><strong>NOTE</strong>:
+    Enabling this property will reduce the performance.</p><h3>Examples</h3><p><code>
+    INSERT INTO table1 SELECT item1 ,sum(item2 + 1000) as result FROM
+    table2 group by item1;
+</code></p><p><code>
+    INSERT INTO table1 SELECT item1, item2, item3 FROM table2
+    where item2=&#39;xyz&#39;;
+</code></p><p><code>
+    INSERT INTO table1 SELECT * FROM table2
+    where exists (select * from table3
+    where table2.item1 = table3.item1);
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log.</strong></p>
+<h2 id="show-segments">SHOW SEGMENTS</h2><p>This command is used to get the segments of CarbonData
+    table.</p><p><code>
+    SHOW SEGMENTS FOR TABLE [db_name.]table_name
+    LIMIT number_of_segments;
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Optional</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>db_name</td>
+        <td>Database name, if it is not specified then it uses the current database.</td>
+        <td>YES</td>
+    </tr>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the table in provided database.</td>
+        <td>NO</td>
+    </tr>
+    <tr>
+        <td>number_of_segments</td>
+        <td>Limit the output to this number.</td>
+        <td>YES</td>
+    </tr>
+    </tbody>
+</table><h3>Example:</h3><p><code>
+    SHOW SEGMENTS FOR TABLE CarbonDatabase.CarbonTable LIMIT 4;
+</code></p><h2 id="delete-id">DELETE SEGMENT BY ID</h2><p>This command is used to delete segment by
+    using the segment ID. Each segment has a unique segment ID associated with it. Using this
+    segment ID, you can remove the segment.</p><p>The following command will get the segmentID.</p>
+<p><code>
+    SHOW SEGMENTS FOR Table dbname.tablename LIMIT number_of_segments
+</code></p><p>After you retrieve the segment ID of the segment that you want to delete, execute the
+    following command to delete the selected segment.</p><p><code>
+    DELETE SEGMENT segment_sequence_id1, segments_sequence_id2, ....
+    FROM TABLE tableName
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Optional</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>segment_id</td>
+        <td>Segment Id of the load.</td>
+        <td>NO</td>
+    </tr>
+    <tr>
+        <td>db_name</td>
+        <td>Database name, if it is not specified then it uses the current database.</td>
+        <td>YES</td>
+    </tr>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the table in provided database.</td>
+        <td>NO</td>
+    </tr>
+    </tbody>
+</table><h3>Example:</h3><p><code>
+    DELETE SEGMENT 0 FROM TABLE CarbonDatabase.CarbonTable;
+    DELETE SEGMENT 0.1,5,8 FROM TABLE CarbonDatabase.CarbonTable;
+</code> NOTE: Here 0.1 is compacted segment sequence id. </p><h2 id="delete-date">DELETE SEGMENT BY
+    DATE</h2><p>This command will allow to delete the CarbonData segment(s) from the store based on
+    the date provided by the user in the DML command. The segment created before the particular date
+    will be removed from the specific stores.</p><p><code>
+    DELETE FROM TABLE [schema_name.]table_name
+    WHERE[DATE_FIELD]BEFORE [DATE_VALUE]
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Optional</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>DATE_VALUE</td>
+        <td>Valid segment load start time value. All the segments before this specified date will be
+            deleted.
+        </td>
+        <td>NO</td>
+    </tr>
+    <tr>
+        <td>db_name</td>
+        <td>Database name, if it is not specified then it uses the current database.</td>
+        <td>YES</td>
+    </tr>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the table in provided database.</td>
+        <td>NO</td>
+    </tr>
+    </tbody>
+</table><h3>Example:</h3><p><code>
+    DELETE SEGMENTS FROM TABLE CarbonDatabase.CarbonTable
+    WHERE STARTTIME BEFORE &#39;2017-06-01 12:05:06&#39;;
+</code></p><h2 id="update-carbondata">Update CarbonData Table</h2><p>This command will allow to
+    update the carbon table based on the column expression and optional filter conditions.</p><h3>
+    Syntax</h3><p><code>
+    UPDATE &lt;table_name&gt;
+    SET (column_name1, column_name2, ... column_name n) =
+    (column1_expression , column2_expression . .. column n_expression )
+    [ WHERE { &lt;filter_condition&gt; } ];
+</code></p><p>alternatively the following the command can also be used for updating the CarbonData
+    Table :</p><p><code>
+    UPDATE &lt;table_name&gt;
+    SET (column_name1, column_name2,) =
+    (select sourceColumn1, sourceColumn2 from sourceTable
+    [ WHERE { &lt;filter_condition&gt; } ] )
+    [ WHERE { &lt;filter_condition&gt; } ];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the Carbon table in which you want to perform the update operation.</td>
+    </tr>
+    <tr>
+        <td>column_name</td>
+        <td>The destination columns to be updated.</td>
+    </tr>
+    <tr>
+        <td>sourceColumn</td>
+        <td>The source table column values to be updated in destination table.</td>
+    </tr>
+    <tr>
+        <td>sourceTable</td>
+        <td>The table from which the records are updated into destination Carbon table.</td>
+    </tr>
+    </tbody>
+</table><h3>Usage Guidelines</h3><p>The following conditions must be met for successful updation
+    :</p>
+<ul>
+    <li>The update command fails if multiple input rows in source table are matched with single row
+        in destination table.
+    </li>
+    <li>If the source table generates empty records, the update operation will complete successfully
+        without updating the table.
+    </li>
+    <li>If a source table row does not correspond to any of the existing rows in a destination
+        table, the update operation will complete successfully without updating the table.
+    </li>
+    <li>In sub-query, if the source table and the target table are same, then the update operation
+        fails.
+    </li>
+    <li>If the sub-query used in UPDATE statement contains aggregate method or group by query, then
+        the UPDATE operation fails.
+    </li>
+</ul><h3>Examples</h3><p>Update is not supported for queries that contain aggregate or group by.</p>
+<p><code>
+    UPDATE t_carbn01 a
+    SET (a.item_type_code, a.profit) = ( SELECT b.item_type_cd,
+    sum(b.profit) from t_carbn01b b
+    WHERE item_type_cd =2 group by item_type_code);
+</code></p><p>Here the Update Operation fails as the query contains aggregate function sum(b.profit)
+    and group by clause in the sub-query.</p><p><code>
+    UPDATE carbonTable1 d
+    SET(d.column3,d.column5 ) = (SELECT s.c33 ,s.c55
+    FROM sourceTable1 s WHERE d.column1 = s.c11)
+    WHERE d.column1 = &#39;china&#39; EXISTS( SELECT * from table3 o where o.c2 &gt; 1);
+</code></p><p><code>
+    UPDATE carbonTable1 d SET (c3) = (SELECT s.c33 from sourceTable1 s
+    WHERE d.column1 = s.c11)
+    WHERE exists( select * from iud.other o where o.c2 &gt; 1);
+</code></p><p><code>
+    UPDATE carbonTable1 SET (c2, c5 ) = (c2 + 1, concat(c5 , &quot;y&quot; ));
+</code></p><p><code>
+    UPDATE carbonTable1 d SET (c2, c5 ) = (c2 + 1, &quot;xyx&quot;)
+    WHERE d.column1 = &#39;india&#39;;
+</code></p><p><code>
+    UPDATE carbonTable1 d SET (c2, c5 ) = (c2 + 1, &quot;xyx&quot;)
+    WHERE d.column1 = &#39;india&#39;
+    and EXISTS( SELECT * FROM table3 o WHERE o.column2 &gt; 1);
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log and the
+    client.</strong></p><h2 id="delete-table">Delete Records from CarbonData Table</h2><p>This
+    command allows us to delete records from CarbonData table.</p><h3>Syntax</h3><p><code>
+    DELETE FROM table_name [WHERE expression];
+</code></p><h3>Parameter Description</h3>
+<table class="table table-striped table-bordered">
+    <thead>
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>table_name</td>
+        <td>The name of the Carbon table in which you want to perform the delete.</td>
+    </tr>
+    </tbody>
+</table><h3>Examples</h3><p><code>
+    DELETE FROM columncarbonTable1 d WHERE d.column1 = &#39;china&#39;;
+</code></p><p><code>
+    DELETE FROM dest WHERE column1 IN (&#39;china&#39;, &#39;USA&#39;);
+</code></p><p><code>
+    DELETE FROM columncarbonTable1
+    WHERE column1 IN (SELECT column11 FROM sourceTable2);
+</code></p><p><code>
+    DELETE FROM columncarbonTable1
+    WHERE column1 IN (SELECT column11 FROM sourceTable2 WHERE
+    column1 = &#39;USA&#39;);
+</code></p><p><code>
+    DELETE FROM columncarbonTable1 WHERE column2 &gt;= 4
+</code></p><p><strong>The Status Success/Failure shall be captured in the driver log and the
+    client.</strong></p>
\ No newline at end of file