You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ta...@apache.org on 2019/06/25 15:45:26 UTC

[impala] 14/20: IMPALA-7291: [DOCS] Note about no codegen support for CHAR

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 3d6a8959f8077a53194dd5a7b9e339b09046fa49
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Mon Jul 23 15:22:42 2018 -0700

    IMPALA-7291: [DOCS] Note about no codegen support for CHAR
    
    Also, cleaned up confusing examples.
    
    Change-Id: Id89dcf44e31f1bc56d888527585b3ec90229981a
    Reviewed-on: http://gerrit.cloudera.org:8080/11022
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 docs/topics/impala_char.xml | 223 +++++++++++++++++---------------------------
 1 file changed, 87 insertions(+), 136 deletions(-)

diff --git a/docs/topics/impala_char.xml b/docs/topics/impala_char.xml
index 9204812..5286a3c 100644
--- a/docs/topics/impala_char.xml
+++ b/docs/topics/impala_char.xml
@@ -21,7 +21,13 @@ under the License.
 <concept id="char" rev="2.0.0">
 
   <title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
-  <titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>CHAR</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -36,9 +42,9 @@ under the License.
   <conbody>
 
     <p rev="2.0.0">
-      <indexterm audience="hidden">CHAR data type</indexterm>
-      A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
-      values are longer than the specified length, Impala truncates any trailing characters.
+      A fixed-length character type, padded with trailing spaces if necessary to achieve the
+      specified length. If values are longer than the specified length, Impala truncates any
+      trailing characters.
     </p>
 
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
@@ -50,7 +56,7 @@ under the License.
 <codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
 
     <p>
-      The maximum length you can specify is 255.
+      The maximum <varname>length</varname> you can specify is 255.
     </p>
 
     <p>
@@ -59,21 +65,54 @@ under the License.
 
     <ul>
       <li>
-        When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
-        the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
-        the column definition.
+        When you store a <codeph>CHAR</codeph> value shorter than the specified length in a
+        table, queries return the value padded with trailing spaces if necessary; the resulting
+        value has the same length as specified in the column definition.
+      </li>
+
+      <li>
+        Leading spaces in <codeph>CHAR</codeph> are preserved within the data file.
+      </li>
+
+      <li>
+        If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those
+        trailing spaces are not stored in the data file. When the value is retrieved by a query,
+        the result could have a different number of trailing spaces. That is, the value includes
+        however many spaces are needed to pad it to the specified length of the column.
       </li>
 
       <li>
-        If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
-        not stored in the data file. When the value is retrieved by a query, the result could have a different
-        number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
-        specified length of the column.
+        If you compare two <codeph>CHAR</codeph> values that differ only in the number of
+        trailing spaces, those values are considered identical.
       </li>
 
       <li>
-        If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
-        values are considered identical.
+        When comparing or processing <codeph>CHAR</codeph> values:
+        <ul>
+          <li>
+            <codeph>CAST()</codeph> truncates any longer string to fit within
+            the defined length. For example:
+<codeblock>SELECT CAST('x' AS CHAR(4)) = CAST('x        ' AS CHAR(4)); -- Returns TRUE.
+</codeblock>
+          </li>
+          <li>
+            If a <codeph>CHAR</codeph> value is shorter than the specified
+            length, it is padded on the right with spaces until it matches the
+            specified length.
+          </li>
+          <li>
+            <codeph>CHAR_LENGTH()</codeph> returns the length including any
+            trailing spaces.
+          </li>
+          <li>
+            <codeph>LENGTH()</codeph> returns the length excluding trailing
+            spaces.
+          </li>
+          <li>
+            <codeph>CONCAT()</codeph> returns the length including trailing
+            spaces.
+          </li>
+        </ul>
       </li>
     </ul>
 
@@ -93,18 +132,19 @@ under the License.
       </li>
 
       <li>
-        Parquet files generated by Impala and containing this type can be freely interchanged with other components
-        such as Hive and MapReduce.
+        Parquet files generated by Impala and containing this type can be freely interchanged
+        with other components such as Hive and MapReduce.
       </li>
 
       <li>
-        Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
+        Any trailing spaces, whether implicitly or explicitly specified, are not written to the
+        Parquet data files.
       </li>
 
       <li>
         Parquet data files might contain values that are longer than allowed by the
-        <codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
-        it processes those values during a query.
+        <codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra
+        trailing characters when it processes those values during a query.
       </li>
     </ul>
 
@@ -112,14 +152,18 @@ under the License.
 
     <p>
       Text data files might contain values that are longer than allowed for a particular
-      <codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
-      processes those values during a query. Text data files can also contain values that are shorter than the
-      defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
-      files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
+      <codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are
+      ignored when Impala processes those values during a query. Text data files can also
+      contain values that are shorter than the defined length limit, and Impala pads them with
+      trailing spaces up to the specified length. Any text data files produced by Impala
+      <codeph>INSERT</codeph> statements do not include any trailing blanks for
       <codeph>CHAR</codeph> columns.
     </p>
 
-    <p><b>Avro considerations:</b></p>
+    <p>
+      <b>Avro considerations:</b>
+    </p>
+
     <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
 
     <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
@@ -129,7 +173,8 @@ under the License.
     </p>
 
     <p>
-      Some other database systems make the length specification optional. For Impala, the length is required.
+      Some other database systems make the length specification optional. For Impala, the length
+      is required.
     </p>
 
 <!--
@@ -146,142 +191,46 @@ it silently treats the value as length 255.
 
     <p conref="../shared/impala_common.xml#common/column_stats_constant"/>
 
-<!-- Seems like a logical design decision but don't think it's currently implemented like this.
-<p>
-Because both the maximum and average length are always known and always the same for
-any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
-in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
-<codeph>COMPUTE STATS</codeph> on the table.
-</p>
--->
-
     <p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
 
-    <p conref="../shared/impala_common.xml#common/example_blurb"/>
-
-    <p>
-      These examples show how trailing spaces are not considered significant when comparing or processing
-      <codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
-      length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
-      spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
-      including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
-      trailing spaces.
-    </p>
-
-<codeblock>select cast('x' as char(4)) = cast('x   ' as char(4)) as "unpadded equal to padded";
-+--------------------------+
-| unpadded equal to padded |
-+--------------------------+
-| true                     |
-+--------------------------+
-
-create table char_length(c char(3));
-insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
-select concat("[",c,"]") as c, length(c) from char_length;
-+-------+-----------+
-| c     | length(c) |
-+-------+-----------+
-| [1  ] | 3         |
-| [12 ] | 3         |
-| [123] | 3         |
-| [123] | 3         |
-+-------+-----------+
-</codeblock>
-
-    <p>
-      This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
-      is a logical data type to use.
-<!--
-Because all the <codeph>CHAR</codeph> values have a constant predictable length,
-Impala can efficiently analyze how best to use these values in join queries,
-aggregation queries, and other contexts where column length is significant.
--->
-    </p>
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
 
-<codeblock>create table addresses
-  (id bigint,
-   street_name string,
-   state_abbreviation char(2),
-   country_abbreviation char(2));
-</codeblock>
+    <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
 
     <p>
-      The following example shows how values written by Impala do not physically include the trailing spaces. It
-      creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
-      and then prints the resulting data file to show that the delimited values are not separated by spaces. The
-      same behavior applies to binary-format Parquet data files.
+      <b>Performance consideration:</b>
     </p>
 
-<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
-  row format delimited fields terminated by ',';
-
-insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-
--- Running this Linux command inside impala-shell using the ! shortcut.
-!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
-foo,bar,baz
-hello,goodbye,aloha
-</codeblock>
-
     <p>
-      The following example further illustrates the treatment of spaces. It replaces the contents of the previous
-      table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
-      within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
-      leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
+      The <codeph>CHAR</codeph> type currently does not have the Impala Codegen support, and we
+      recommend using <codeph>VARCHAR</codeph> or <codeph>STRING</codeph> over
+      <codeph>CHAR</codeph> as the performance gain of Codegen outweighs the benefits of fixed
+      width <codeph>CHAR</codeph>.
     </p>
 
-<codeblock>insert overwrite char_in_text values (cast('trailing   ' as char(20)), cast('   leading and trailing   ' as char(30)), cast('   leading' as char(40)));
-!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
-trailing,   leading and trailing,   leading
-
-select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
-+------------------------+----------------------------------+--------------------------------------------+
-| a                      | b                                | c                                          |
-+------------------------+----------------------------------+--------------------------------------------+
-| [trailing            ] | [   leading and trailing       ] | [   leading                              ] |
-+------------------------+----------------------------------+--------------------------------------------+
-</codeblock>
-
-    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
-    <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
-
     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
 
     <p>
-      Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
-      scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
-      that column.
+      Because the blank-padding behavior requires allocating the maximum length for each value
+      in memory, for scalability reasons, you should avoid declaring <codeph>CHAR</codeph>
+      columns that are much longer than typical values in that column.
     </p>
 
     <p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
 
     <p>
       When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
-      <codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
-      first, with trailing spaces preserved.
+      <codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to
+      <codeph>STRING</codeph> first, with trailing spaces preserved.
     </p>
 
-<codeblock>select cast("foo  " as char(5)) = 'foo' as "char equal to string";
-+----------------------+
-| char equal to string |
-+----------------------+
-| false                |
-+----------------------+
-</codeblock>
-
     <p>
       This behavior differs from other popular database systems. To get the expected result of
-      <codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
-      length:
+      <codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values
+      of the appropriate length. For example:
     </p>
 
-<codeblock>select cast("foo  " as char(5)) = cast('foo' as char(3)) as "char equal to string";
-+----------------------+
-| char equal to string |
-+----------------------+
-| true                 |
-+----------------------+
-</codeblock>
+<codeblock>SELECT CAST("foo  " AS CHAR(5)) = CAST('foo' AS CHAR(3)); -- Returns TRUE.</codeblock>
 
     <p>
       This behavior is subject to change in future releases.
@@ -294,5 +243,7 @@ select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c fr
       <xref href="impala_literals.xml#string_literals"/>,
       <xref href="impala_string_functions.xml#string_functions"/>
     </p>
+
   </conbody>
+
 </concept>