You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ta...@apache.org on 2019/06/25 15:45:26 UTC
[impala] 14/20: IMPALA-7291: [DOCS] Note about no codegen support
for CHAR
This is an automated email from the ASF dual-hosted git repository.
tarmstrong pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 3d6a8959f8077a53194dd5a7b9e339b09046fa49
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Mon Jul 23 15:22:42 2018 -0700
IMPALA-7291: [DOCS] Note about no codegen support for CHAR
Also, cleaned up confusing examples.
Change-Id: Id89dcf44e31f1bc56d888527585b3ec90229981a
Reviewed-on: http://gerrit.cloudera.org:8080/11022
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
docs/topics/impala_char.xml | 223 +++++++++++++++++---------------------------
1 file changed, 87 insertions(+), 136 deletions(-)
diff --git a/docs/topics/impala_char.xml b/docs/topics/impala_char.xml
index 9204812..5286a3c 100644
--- a/docs/topics/impala_char.xml
+++ b/docs/topics/impala_char.xml
@@ -21,7 +21,13 @@ under the License.
<concept id="char" rev="2.0.0">
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
- <titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
+
+ <titlealts audience="PDF">
+
+ <navtitle>CHAR</navtitle>
+
+ </titlealts>
+
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -36,9 +42,9 @@ under the License.
<conbody>
<p rev="2.0.0">
- <indexterm audience="hidden">CHAR data type</indexterm>
- A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
- values are longer than the specified length, Impala truncates any trailing characters.
+ A fixed-length character type, padded with trailing spaces if necessary to achieve the
+ specified length. If values are longer than the specified length, Impala truncates any
+ trailing characters.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
@@ -50,7 +56,7 @@ under the License.
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
<p>
- The maximum length you can specify is 255.
+ The maximum <varname>length</varname> you can specify is 255.
</p>
<p>
@@ -59,21 +65,54 @@ under the License.
<ul>
<li>
- When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
- the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
- the column definition.
+ When you store a <codeph>CHAR</codeph> value shorter than the specified length in a
+ table, queries return the value padded with trailing spaces if necessary; the resulting
+ value has the same length as specified in the column definition.
+ </li>
+
+ <li>
+ Leading spaces in <codeph>CHAR</codeph> are preserved within the data file.
+ </li>
+
+ <li>
+ If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those
+ trailing spaces are not stored in the data file. When the value is retrieved by a query,
+ the result could have a different number of trailing spaces. That is, the value includes
+ however many spaces are needed to pad it to the specified length of the column.
</li>
<li>
- If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
- not stored in the data file. When the value is retrieved by a query, the result could have a different
- number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
- specified length of the column.
+ If you compare two <codeph>CHAR</codeph> values that differ only in the number of
+ trailing spaces, those values are considered identical.
</li>
<li>
- If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
- values are considered identical.
+ When comparing or processing <codeph>CHAR</codeph> values:
+ <ul>
+ <li>
+ <codeph>CAST()</codeph> truncates any longer string to fit within
+ the defined length. For example:
+<codeblock>SELECT CAST('x' AS CHAR(4)) = CAST('x ' AS CHAR(4)); -- Returns TRUE.
+</codeblock>
+ </li>
+ <li>
+ If a <codeph>CHAR</codeph> value is shorter than the specified
+ length, it is padded on the right with spaces until it matches the
+ specified length.
+ </li>
+ <li>
+ <codeph>CHAR_LENGTH()</codeph> returns the length including any
+ trailing spaces.
+ </li>
+ <li>
+ <codeph>LENGTH()</codeph> returns the length excluding trailing
+ spaces.
+ </li>
+ <li>
+ <codeph>CONCAT()</codeph> returns the length including trailing
+ spaces.
+ </li>
+ </ul>
</li>
</ul>
@@ -93,18 +132,19 @@ under the License.
</li>
<li>
- Parquet files generated by Impala and containing this type can be freely interchanged with other components
- such as Hive and MapReduce.
+ Parquet files generated by Impala and containing this type can be freely interchanged
+ with other components such as Hive and MapReduce.
</li>
<li>
- Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
+ Any trailing spaces, whether implicitly or explicitly specified, are not written to the
+ Parquet data files.
</li>
<li>
Parquet data files might contain values that are longer than allowed by the
- <codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
- it processes those values during a query.
+ <codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra
+ trailing characters when it processes those values during a query.
</li>
</ul>
@@ -112,14 +152,18 @@ under the License.
<p>
Text data files might contain values that are longer than allowed for a particular
- <codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
- processes those values during a query. Text data files can also contain values that are shorter than the
- defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
- files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
+ <codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are
+ ignored when Impala processes those values during a query. Text data files can also
+ contain values that are shorter than the defined length limit, and Impala pads them with
+ trailing spaces up to the specified length. Any text data files produced by Impala
+ <codeph>INSERT</codeph> statements do not include any trailing blanks for
<codeph>CHAR</codeph> columns.
</p>
- <p><b>Avro considerations:</b></p>
+ <p>
+ <b>Avro considerations:</b>
+ </p>
+
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
@@ -129,7 +173,8 @@ under the License.
</p>
<p>
- Some other database systems make the length specification optional. For Impala, the length is required.
+ Some other database systems make the length specification optional. For Impala, the length
+ is required.
</p>
<!--
@@ -146,142 +191,46 @@ it silently treats the value as length 255.
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
-<!-- Seems like a logical design decision but don't think it's currently implemented like this.
-<p>
-Because both the maximum and average length are always known and always the same for
-any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
-in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
-<codeph>COMPUTE STATS</codeph> on the table.
-</p>
--->
-
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
- <p conref="../shared/impala_common.xml#common/example_blurb"/>
-
- <p>
- These examples show how trailing spaces are not considered significant when comparing or processing
- <codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
- length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
- spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
- including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
- trailing spaces.
- </p>
-
-<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
-+--------------------------+
-| unpadded equal to padded |
-+--------------------------+
-| true |
-+--------------------------+
-
-create table char_length(c char(3));
-insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
-select concat("[",c,"]") as c, length(c) from char_length;
-+-------+-----------+
-| c | length(c) |
-+-------+-----------+
-| [1 ] | 3 |
-| [12 ] | 3 |
-| [123] | 3 |
-| [123] | 3 |
-+-------+-----------+
-</codeblock>
-
- <p>
- This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
- is a logical data type to use.
-<!--
-Because all the <codeph>CHAR</codeph> values have a constant predictable length,
-Impala can efficiently analyze how best to use these values in join queries,
-aggregation queries, and other contexts where column length is significant.
--->
- </p>
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
-<codeblock>create table addresses
- (id bigint,
- street_name string,
- state_abbreviation char(2),
- country_abbreviation char(2));
-</codeblock>
+ <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
<p>
- The following example shows how values written by Impala do not physically include the trailing spaces. It
- creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
- and then prints the resulting data file to show that the delimited values are not separated by spaces. The
- same behavior applies to binary-format Parquet data files.
+ <b>Performance consideration:</b>
</p>
-<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
- row format delimited fields terminated by ',';
-
-insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-
--- Running this Linux command inside impala-shell using the ! shortcut.
-!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
-foo,bar,baz
-hello,goodbye,aloha
-</codeblock>
-
<p>
- The following example further illustrates the treatment of spaces. It replaces the contents of the previous
- table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
- within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
- leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
+ The <codeph>CHAR</codeph> type currently does not have the Impala Codegen support, and we
+ recommend using <codeph>VARCHAR</codeph> or <codeph>STRING</codeph> over
+ <codeph>CHAR</codeph> as the performance gain of Codegen outweighs the benefits of fixed
+ width <codeph>CHAR</codeph>.
</p>
-<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
-!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
-trailing, leading and trailing, leading
-
-select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
-+------------------------+----------------------------------+--------------------------------------------+
-| a | b | c |
-+------------------------+----------------------------------+--------------------------------------------+
-| [trailing ] | [ leading and trailing ] | [ leading ] |
-+------------------------+----------------------------------+--------------------------------------------+
-</codeblock>
-
- <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
- <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
-
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
- Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
- scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
- that column.
+ Because the blank-padding behavior requires allocating the maximum length for each value
+ in memory, for scalability reasons, you should avoid declaring <codeph>CHAR</codeph>
+ columns that are much longer than typical values in that column.
</p>
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
<p>
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
- <codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
- first, with trailing spaces preserved.
+ <codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to
+ <codeph>STRING</codeph> first, with trailing spaces preserved.
</p>
-<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
-+----------------------+
-| char equal to string |
-+----------------------+
-| false |
-+----------------------+
-</codeblock>
-
<p>
This behavior differs from other popular database systems. To get the expected result of
- <codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
- length:
+ <codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values
+ of the appropriate length. For example:
</p>
-<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
-+----------------------+
-| char equal to string |
-+----------------------+
-| true |
-+----------------------+
-</codeblock>
+<codeblock>SELECT CAST("foo " AS CHAR(5)) = CAST('foo' AS CHAR(3)); -- Returns TRUE.</codeblock>
<p>
This behavior is subject to change in future releases.
@@ -294,5 +243,7 @@ select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c fr
<xref href="impala_literals.xml#string_literals"/>,
<xref href="impala_string_functions.xml#string_functions"/>
</p>
+
</conbody>
+
</concept>