You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/07/26 23:05:09 UTC
[17/22] incubator-impala git commit: First try at porting over the
source files necessary for the Impala SQL Reference.
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_complex_types.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_complex_types.xml b/docs/topics/impala_complex_types.xml
new file mode 100644
index 0000000..9fe7362
--- /dev/null
+++ b/docs/topics/impala_complex_types.xml
@@ -0,0 +1,2725 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.3.0" id="complex_types">
+
+ <title id="nested_types">Complex Types (CDH 5.5 and higher only)</title>
+
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Data Types"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">complex types</indexterm>
+
+ <indexterm audience="Cloudera">nested types</indexterm>
+ <term>Complex types</term> (also referred to as <term>nested types</term>) let you represent multiple data values within a single
+ row/column position. They differ from the familiar column types such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>, known as
+ <term>scalar types</term> or <term>primitive types</term>, which represent a single data value within a given row/column position.
+ Impala supports the complex types <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> in Impala 2.3 / CDH 5.5
+ and higher. The Hive <codeph>UNION</codeph> type is not currently supported.
+ </p>
+
+ <p outputclass="toc inpage"/>
+
+ <p>
+ Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your memory about syntax
+ and examples:
+ </p>
+
+ <ul>
+ <li>
+ <xref href="impala_array.xml#array"/>
+ </li>
+
+ <li>
+ <xref href="impala_struct.xml#struct"/>
+ </li>
+
+ <li>
+ <xref href="impala_map.xml#map"/>
+ </li>
+ </ul>
+
+ </conbody>
+
+ <concept id="complex_types_benefits">
+
+ <title>Benefits of Impala Complex Types</title>
+
+ <conbody>
+
+ <p>
+ The reasons for using Impala complex types include the following:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ You already have data produced by Hive or other non-Impala component that uses the complex type column names. You might need to
+ convert the underlying data to Parquet to use it with Impala.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are
+ representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala
+ <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more
+ related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional
+ Hadoop shuffle or broadcast join techniques.
+ </p>
+ </li>
+ </ul>
+
+ <p>
+ The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used
+ with all SQL clauses, such as <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, all kinds of joins, subqueries, and inline
+ views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other
+ programming languages to deconstruct the underlying data structures.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_overview">
+
+ <title>Overview of Impala Complex Types</title>
+
+ <conbody>
+
+ <p>
+<!--
+ Each <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or <codeph>STRUCT</codeph> column can include multiple instances of scalar types
+ such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>.
+-->
+ The <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types are closely related: they represent collections with arbitrary numbers of
+ elements, where each element is the same type. In contrast, <codeph>STRUCT</codeph> groups together a fixed number of items into a
+ single element. The parts of a <codeph>STRUCT</codeph> element (the <term>fields</term>) can be of different types, and each field
+ has a name.
+ </p>
+
+ <p>
+ The elements of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>, can also be other
+ complex types. You can construct elaborate data structures with up to 100 levels of nesting. For example, you can make an
+ <codeph>ARRAY</codeph> whose elements are <codeph>STRUCT</codeph>s. Within each <codeph>STRUCT</codeph>, you can have some fields
+ that are <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or another kind of <codeph>STRUCT</codeph>. The Impala documentation uses the
+ terms complex and nested types interchangeably; for simplicity, it primarily uses the term complex types, to encompass all the
+ properties of these types.
+ </p>
+
+ <p>
+ When visualizing your data model in familiar SQL terms, you can think of each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> as a
+ miniature table, and each <codeph>STRUCT</codeph> as a row within such a table. By default, the table represented by an
+ <codeph>ARRAY</codeph> has two columns, <codeph>POS</codeph> to represent ordering of elements, and <codeph>ITEM</codeph>
+ representing the value of each element. Likewise, by default, the table represented by a <codeph>MAP</codeph> encodes key-value
+ pairs, and therefore has two columns, <codeph>KEY</codeph> and <codeph>VALUE</codeph>.
+<!--
+ When you use a <codeph>STRUCT</codeph> as an
+ <codeph>ARRAY</codeph> element or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>, the field names of the
+ <codeph>STRUCT</codeph> become additional columns in the result set.
+-->
+ </p>
+
+ <p>
+ The <codeph>ITEM</codeph> and <codeph>VALUE</codeph> names are only required for the very simplest kinds of <codeph>ARRAY</codeph>
+ and <codeph>MAP</codeph> columns, ones that hold only scalar values. When the elements within the <codeph>ARRAY</codeph> or
+ <codeph>MAP</codeph> are of type <codeph>STRUCT</codeph> rather than a scalar type, then the result set contains columns with names
+ corresponding to the <codeph>STRUCT</codeph> fields rather than <codeph>ITEM</codeph> or <codeph>VALUE</codeph>.
+ </p>
+
+<!--
+ <p>
+ <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are both <term>collection</term> types, which can have a variable number of
+ elements; <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are typically used as the top-level type of a table column.
+ <codeph>STRUCT</codeph> represents a single element and has a fixed number of fields; <codeph>STRUCT</codeph> is typically used as
+ the final, lowest level of a nested type definition.
+ </p>
+-->
+
+ <p>
+ You write most queries that process complex type columns using familiar join syntax, even though the data for both sides of the join
+ resides in a single table. The join notation brings together the scalar values from a row with the values from the complex type
+ columns for that same row. The final result set contains all scalar values, allowing you to do all the familiar filtering,
+ aggregation, ordering, and so on for the complex data entirely in SQL or using business intelligence tools that issue SQL queries.
+<!--
+ Instead of pulling together values from different tables, the join selects the specified values from both
+ the scalar columns, and from inside the complex type columns, producing a flattened result set consisting of all scalar values. When
+ doing a join query involving a complex type column, Impala derives the join key automatically, without the need to create additional
+ ID columns in the table.
+-->
+ </p>
+
+ <p>
+ Behind the scenes, Impala ensures that the processing for each row is done efficiently on a single host, without the network traffic
+ involved in broadcast or shuffle joins. The most common type of join query for tables with complex type columns is <codeph>INNER
+ JOIN</codeph>, which returns results only in those cases where the complex type contains some elements. Therefore, most query
+ examples in this section use either the <codeph>INNER JOIN</codeph> clause or the equivalent comma notation.
+ </p>
+
+ <note>
+ <p>
+ Although Impala can query complex types that are present in Parquet files, Impala currently cannot create new Parquet files
+ containing complex types. Therefore, the discussion and examples presume that you are working with existing Parquet data produced
+ through Hive, Spark, or some other source. See <xref href="#complex_types_ex_hive_etl"/> for examples of constructing Parquet data
+ files with complex type columns.
+ </p>
+
+ <p>
+ For learning purposes, you can create empty tables with complex type columns and practice query syntax, even if you do not have
+ sample data with the required structure.
+ </p>
+ </note>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_design">
+
+ <title>Design Considerations for Complex Types</title>
+
+ <conbody>
+
+ <p>
+ When planning to use Impala complex types, and designing the Impala schema, first learn how this kind of schema differs from
+ traditional table layouts from the relational database and data warehousing fields. Because you might have already encountered
+ complex types in a Hadoop context while using Hive for ETL, also learn how to write high-performance analytic queries for complex
+ type data using Impala SQL syntax.
+ </p>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="complex_types_vs_rdbms">
+
+ <title>How Complex Types Differ from Traditional Data Warehouse Schemas</title>
+
+ <conbody>
+
+ <p>
+ Complex types let you associate arbitrary data structures with a particular row. If you are familiar with schema design for
+ relational database management systems or data warehouses, a schema with complex types has the following differences:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ Logically, related values can now be grouped tightly together in the same table.
+ </p>
+
+ <p>
+ In traditional data warehousing, related values were typically arranged in one of two ways:
+ </p>
+ <ul>
+ <li>
+ <p>
+ Split across multiple normalized tables. Foreign key columns specified which rows from each table were associated with
+ each other. This arrangement avoided duplicate data and therefore the data was compact, but join queries could be
+ expensive because the related data had to be retrieved from separate locations. (In the case of distributed Hadoop
+ queries, the joined tables might even be transmitted between different hosts in a cluster.)
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Flattened into a single denormalized table. Although this layout eliminated some potential performance issues by removing
+ the need for join queries, the table typically became larger because values were repeated. The extra data volume could
+ cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans
+ during queries.
+ </p>
+ </li>
+ </ul>
+ <p>
+ Complex types represent a middle ground that addresses these performance and volume concerns. By physically locating related
+ data within the same data files, complex types increase locality and reduce the expense of join queries. By associating an
+ arbitrary amount of data with a single row, complex types avoid the need to repeat lengthy values such as strings. Because
+ Impala knows which complex type values are associated with each row, you can save storage by avoiding artificial foreign key
+ values that are only used for joins. The flexibility of the <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and
+ <codeph>MAP</codeph> types lets you model familiar constructs such as fact and dimension tables from a data warehouse, and
+ wide tables representing sparse matrixes.
+ </p>
+ </li>
+ </ul>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_physical">
+
+ <title>Physical Storage for Complex Types</title>
+
+ <conbody>
+
+ <p>
+ Physically, the scalar and complex columns in each row are located adjacent to each other in the same Parquet data file, ensuring
+ that they are processed on the same host rather than being broadcast across the network when cross-referenced within a query. This
+ co-location simplifies the process of copying, converting, and backing all the columns up at once. Because of the column-oriented
+ layout of Parquet files, you can still query only the scalar columns of a table without imposing the I/O penalty of reading the
+ (possibly large) values of the composite columns.
+ </p>
+
+ <p>
+ Within each Parquet data file, the constituent parts of complex type columns are stored in column-oriented format:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ Each field of a <codeph>STRUCT</codeph> type is stored like a column, with all the scalar values adjacent to each other and
+ encoded, compressed, and so on using the Parquet space-saving techniques.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For an <codeph>ARRAY</codeph> containing scalar values, all those values (represented by the <codeph>ITEM</codeph>
+ pseudocolumn) are stored adjacent to each other.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For a <codeph>MAP</codeph>, the values of the <codeph>KEY</codeph> pseudocolumn are stored adjacent to each other. If the
+ <codeph>VALUE</codeph> pseudocolumn is a scalar type, its values are also stored adjacent to each other.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If an <codeph>ARRAY</codeph> element, <codeph>STRUCT</codeph> field, or <codeph>MAP</codeph> <codeph>VALUE</codeph> part is
+ another complex type, the column-oriented storage applies to the next level down (or the next level after that, and so on for
+ deeply nested types) where the final elements, fields, or values are of scalar types.
+ </p>
+ </li>
+ </ul>
+
+ <p>
+ The numbers represented by the <codeph>POS</codeph> pseudocolumn of an <codeph>ARRAY</codeph> are not physically stored in the
+ data files. They are synthesized at query time based on the order of the <codeph>ARRAY</codeph> elements associated with each row.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_file_formats">
+
+ <title>File Format Support for Impala Complex Types</title>
+
+ <conbody>
+
+ <p>
+ Currently, Impala queries support complex type data only in the Parquet file format. See <xref href="impala_parquet.xml#parquet"/>
+ for details about the performance benefits and physical layout of this file format.
+ </p>
+
+ <p>
+ Each table, or each partition within a table, can have a separate file format, and you can change file format at the table or
+ partition level through an <codeph>ALTER TABLE</codeph> statement. Because this flexibility makes it difficult to guarantee ahead
+ of time that all the data files for a table or partition are in a compatible format, Impala does not throw any errors when you
+ change the file format for a table or partition using <codeph>ALTER TABLE</codeph>. Any errors come at runtime when Impala
+ actually processes a table or partition that contains nested types and is not in one of the supported formats. If a query on a
+ partitioned table only processes some partitions, and all those partitions are in one of the supported formats, the query
+ succeeds.
+ </p>
+
+ <p>
+ Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro,
+ SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the nested
+ type columns. Also, if a table using an unsupported format originally contained nested type columns, and then those columns were
+ dropped from the table using <codeph>ALTER TABLE ... DROP COLUMN</codeph>, any existing data files in the table still contain the
+ nested type data and Impala queries on that table will generate errors.
+ </p>
+
+ <p>
+ You can perform DDL operations (even <codeph>CREATE TABLE</codeph>) for tables involving complex types in file formats other than
+ Parquet. The DDL support lets you set up intermediate tables in your ETL pipeline, to be populated by Hive, before the final stage
+ where the data resides in a Parquet table and is queryable by Impala. Also, you can have a partitioned table with complex type
+ columns that uses a non-Parquet format, and use <codeph>ALTER TABLE</codeph> to change the file format to Parquet for individual
+ partitions. When you put Parquet data files into those partitions, Impala can execute queries against that data as long as the
+ query does not involve any of the non-Parquet partitions.
+ </p>
+
+ <p>
+ If you use the <cmdname>parquet-tools</cmdname> command to examine the structure of a Parquet data file that includes complex
+ types, you see that both <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are represented as a <codeph>Bag</codeph> in Parquet
+ terminology, with all fields marked <codeph>Optional</codeph> because Impala allows any column to be nullable.
+ </p>
+
+ <p>
+ Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data files outside
+ Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_vs_normalization">
+
+ <title>Choosing Between Complex Types and Normalized Tables</title>
+
+ <conbody>
+
+ <p>
+ Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an important design
+ decision.
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ If you are coming from a traditional database or data warehousing background, you might be familiar with how to split up data
+ between tables. Your business intelligence tools might already be optimized for dealing with this kind of multi-table scenario
+ through join queries.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If you are pulling data from Impala into an application written in a programming language that has data structures analogous
+ to the complex types, such as Python or Java, complex types in Impala could simplify data interchange and improve
+ understandability and reliability of your program logic.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ You might already be faced with existing infrastructure or receive high volumes of data that assume one layout or the other.
+ For example, complex types are popular with web-oriented applications, for example to keep information about an online user
+ all in one place for convenient lookup and analysis, or to deal with sparse or constantly evolving data fields.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If some parts of the data change over time while related data remains constant, using multiple normalized tables lets you
+ replace certain parts of the data without reloading the entire data set. Conversely, if you receive related data all bundled
+ together, such as in JSON files, using complex types can save the overhead of splitting the related items across multiple
+ tables.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ From a performance perspective:
+ </p>
+ <ul>
+ <li>
+ <p>
+ In Parquet tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of reading the
+ embedded data. When complex types are nested within a column, the data is physically divided at a very granular level; for
+ example, a query referring to data nested multiple levels deep in a complex type column does not have to read all the data
+ from that column, only the data for the relevant parts of the column type hierarchy.
+<!-- Avro not supported in 5.5 / 2.3: Avro tables might experience some performance overhead due to
+ the need to skip past the complex type columns in each row when reading the data. -->
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is processed in
+ parallel across multiple hosts. All the information for a row containing complex types is typically to be in the same data
+ block, and therefore does not need to be transmitted across the network when joining fields that are all part of the same
+ row.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more data blocks
+ with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data and the characteristics of
+ your query workload. If the complex columns are rarely referenced, using them might lower efficiency. If you are seeing
+ low parallelism due to a small volume of data (relatively few data blocks) in each table partition, increasing the row
+ size by including complex columns might produce more data blocks and thus spread the work more evenly across the cluster.
+ See <xref href="impala_scalability.xml#scalability"/> for more on this advanced topic.
+ </p>
+ </li>
+ </ul>
+ </li>
+ </ul>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_hive">
+
+ <title>Differences Between Impala and Hive Complex Types</title>
+
+ <conbody>
+
+<!-- HiveQL functions like nested type constructors and posexplode(): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -->
+
+<!-- HiveQL complex types: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes -->
+
+<!-- HiveQL lateral views: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -->
+
+ <p>
+ Impala can query Parquet tables containing <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> columns
+ produced by Hive. There are some differences to be aware of between the Impala SQL and HiveQL syntax for complex types, primarily
+ for queries.
+ </p>
+
+ <p>
+ The syntax for specifying <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types in a <codeph>CREATE
+ TABLE</codeph> statement is compatible between Impala and Hive.
+ </p>
+
+ <p>
+ Because Impala <codeph>STRUCT</codeph> columns include user-specified field names, you use the <codeph>NAMED_STRUCT()</codeph>
+ constructor in Hive rather than the <codeph>STRUCT()</codeph> constructor when you populate an Impala <codeph>STRUCT</codeph>
+ column using a Hive <codeph>INSERT</codeph> statement.
+ </p>
+
+ <p>
+ The Hive <codeph>UNION</codeph> type is not currently supported in Impala.
+ </p>
+
+ <p>
+ While Impala usually aims for a high degree of compatibility with HiveQL query syntax, Impala syntax differs from Hive for queries
+ involving complex types. The differences are intended to provide extra flexibility for queries involving these kinds of tables.
+ </p>
+
+ <ul>
+ <li>
+ Impala uses dot notation for referring to element names or elements within complex types, and join notation for
+ cross-referencing scalar columns with the elements of complex types within the same row, rather than the <codeph>LATERAL
+ VIEW</codeph> clause and <codeph>EXPLODE()</codeph> function of HiveQL.
+ </li>
+
+ <li>
+ Using join notation lets you use all the kinds of join queries with complex type columns. For example, you can use a
+ <codeph>LEFT OUTER JOIN</codeph>, <codeph>LEFT ANTI JOIN</codeph>, or <codeph>LEFT SEMI JOIN</codeph> query to evaluate
+ different scenarios where the complex columns do or do not contain any elements.
+ </li>
+
+ <li>
+ You can include references to collection types inside subqueries and inline views. For example, you can construct a
+ <codeph>FROM</codeph> clause where one of the <q>tables</q> is a subquery against a complex type column, or use a subquery
+ against a complex type column as the argument to an <codeph>IN</codeph> or <codeph>EXISTS</codeph> clause.
+ </li>
+
+ <li>
+ The Impala pseudocolumn <codeph>POS</codeph> lets you retrieve the position of elements in an array along with the elements
+ themselves, equivalent to the <codeph>POSEXPLODE()</codeph> function of HiveQL. You do not use index notation to retrieve a
+ single array element in a query; the join query loops through the array elements and you use <codeph>WHERE</codeph> clauses to
+ specify which elements to return.
+ </li>
+
+ <li>
+ <p>
+ Join clauses involving complex type columns do not require an <codeph>ON</codeph> or <codeph>USING</codeph> clause. Impala
+ implicitly applies the join key so that the correct array entries or map elements are associated with the correct row from the
+ table.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala does not currently support the <codeph>UNION</codeph> complex type.
+ </p>
+ </li>
+ </ul>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_limits">
+
+ <title>Limitations and Restrictions for Complex Types</title>
+
+ <conbody>
+
+ <p>
+ Complex type columns can only be used in tables or partitions with the Parquet file format.
+ </p>
+
+ <p>
+ Complex type columns cannot be used as partition key columns in a partitioned table.
+ </p>
+
+ <p>
+ When you use complex types with the <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph>, or
+ <codeph>WHERE</codeph> clauses, you cannot refer to the column name by itself. Instead, you refer to the names of the scalar
+ values within the complex type, such as the <codeph>ITEM</codeph>, <codeph>POS</codeph>, <codeph>KEY</codeph>, or
+ <codeph>VALUE</codeph> pseudocolumns, or the field names from a <codeph>STRUCT</codeph>.
+ </p>
+
+ <p>
+ The maximum depth of nesting for complex types is 100 levels.
+ </p>
+
+ <p>
+ For ideal performance and scalability, use small or medium-sized collections, where all the complex columns contain at most a few
+ hundred megabytes per row. Remember, all the columns of a row are stored in the same HDFS data block, whose size in Parquet files
+ typically ranges from 256 MB to 1 GB.
+ </p>
+
+ <p>
+ Including complex type columns in a table introduces some overhead that might make queries that do not reference those columns
+ somewhat slower than Impala queries against tables without any complex type columns. Expect at most a 2x slowdown compared to
+ tables that do not have any complex type columns.
+ </p>
+
+ <p>
+ Currently, the <codeph>COMPUTE STATS</codeph> statement does not collect any statistics for columns containing complex types.
+ Impala uses heuristics to construct execution plans involving complex type columns.
+ </p>
+
+ <p>
+ Currently, Impala built-in functions and user-defined functions cannot accept complex types as parameters or produce them as
+ function return values. (When the complex type values are materialized in an Impala result set, the result set contains the scalar
+ components of the values, such as the <codeph>POS</codeph> or <codeph>ITEM</codeph> for an <codeph>ARRAY</codeph>, the
+ <codeph>KEY</codeph> or <codeph>VALUE</codeph> for a <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>; these
+ scalar data items <i>can</i> be used with built-in functions and UDFs as usual.)
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
+
+ <p>
+ Currently, Impala can query complex type columns only from Parquet tables or Parquet partitions within partitioned tables.
+ Although you can use complex types in tables with Avro, text, and other file formats as part of your ETL pipeline, for example as
+ intermediate tables populated through Hive, doing analytics through Impala requires that the data eventually ends up in a Parquet
+ table. The requirement for Parquet data files means that you can use complex types with Impala tables hosted on other kinds of
+ file storage systems such as Isilon and Amazon S3, but you cannot use Impala to query complex types from HBase tables. See
+ <xref href="impala_complex_types.xml#complex_types_file_formats"/> for more details.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="complex_types_using">
+
+ <title>Using Complex Types from SQL</title>
+
+ <conbody>
+
+ <p>
+ When using complex types through SQL in Impala, you learn the notation for <codeph>< ></codeph> delimiters for the complex
+ type columns in <codeph>CREATE TABLE</codeph> statements, and how to construct join queries to <q>unpack</q> the scalar values
+ nested inside the complex data structures. You might need to condense a traditional RDBMS or data warehouse schema into a smaller
+ number of Parquet tables, and use Hive, Spark, Pig, or other mechanism outside Impala to populate the tables with data.
+ </p>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="nested_types_ddl">
+
+ <title>Complex Type Syntax for DDL Statements</title>
+
+ <conbody>
+
+ <p>
+ The definition of <varname>data_type</varname>, as seen in the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+ statements, now includes complex types in addition to primitive types:
+ </p>
+
+<codeblock> primitive_type
+| array_type
+| map_type
+| struct_type
+</codeblock>
+
+ <p>
+ Unions are not currently supported.
+ </p>
+
+ <p>
+ Array, struct, and map column type declarations are specified in the <codeph>CREATE TABLE</codeph> statement. You can also add or
+ change the type of complex columns through the <codeph>ALTER TABLE</codeph> statement.
+ </p>
+
+ <note>
+ <p>
+ Currently, Impala queries allow complex types only in tables that use the Parquet format. If an Impala query encounters complex
+ types in a table or partition using another file format, the query returns a runtime error.
+ </p>
+
+ <p>
+ The Impala DDL support for complex types works for all file formats, so that you can create tables using text or other
+ non-Parquet formats for Hive to use as staging tables in an ETL cycle that ends with the data in a Parquet table. You can also
+ use <codeph>ALTER TABLE ... SET FILEFORMAT PARQUET</codeph> to change the file format of an existing table containing complex
+ types to Parquet, after which Impala can query it. Make sure to load Parquet files into the table after changing the file
+ format, because the <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> statement does not convert existing data to the new file
+ format.
+ </p>
+ </note>
+
+ <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
+
+ <p>
+ Because use cases for Impala complex types require that you already have Parquet data files produced outside of Impala, you can
+ use the Impala <codeph>CREATE TABLE LIKE PARQUET</codeph> syntax to produce a table with columns that match the structure of an
+ existing Parquet file, including complex type columns for nested data structures. Remember to include the <codeph>STORED AS
+ PARQUET</codeph> clause in this case, because even with <codeph>CREATE TABLE LIKE PARQUET</codeph>, the default file format of the
+ resulting table is still text.
+ </p>
+
+ <p>
+ Because the complex columns are omitted from the result set of an Impala <codeph>SELECT *</codeph> or <codeph>SELECT
+ <varname>col_name</varname></codeph> query, and because Impala currently does not support writing Parquet files with complex type
+ columns, you cannot use the <codeph>CREATE TABLE AS SELECT</codeph> syntax to create a table with nested type columns.
+ </p>
+
+ <note>
+ <p>
+ Once you have a table set up with complex type columns, use the <codeph>DESCRIBE</codeph> and <codeph>SHOW CREATE TABLE</codeph>
+ statements to see the correct notation with <codeph><</codeph> and <codeph>></codeph> delimiters and comma and colon
+ separators within the complex type definitions. If you do not have existing data with the same layout as the table, you can
+ query the empty table to practice with the notation for the <codeph>SELECT</codeph> statement. In the <codeph>SELECT</codeph>
+ list, you use dot notation and pseudocolumns such as <codeph>ITEM</codeph>, <codeph>KEY</codeph>, and <codeph>VALUE</codeph> for
+ referring to items within the complex type columns. In the <codeph>FROM</codeph> clause, you use join notation to construct
+ table aliases for any referenced <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns.
+ </p>
+ </note>
+
+<!-- To do: show some simple CREATE TABLE statements for each of the complex types, without so much backstory for the schema. -->
+
+ <p>
+ For example, when defining a table that holds contact information, you might represent phone numbers differently depending on the
+ expected layout and relationships of the data, and how well you can predict those properties in advance.
+ </p>
+
+ <p>
+ Here are different ways that you might represent phone numbers in a traditional relational schema, with equivalent representations
+ using complex types.
+ </p>
+
+ <fig id="complex_types_phones_flat_fixed">
+
+ <title>Traditional Relational Representation of Phone Numbers: Single Table</title>
+
+ <p>
+ The traditional, simplest way to represent phone numbers in a relational table is to store all contact info in a single table,
+ with all columns having scalar types, and each potential phone number represented as a separate column. In this example, each
+ person can only have these 3 types of phone numbers. If the person does not have a particular kind of phone number, the
+ corresponding column is <codeph>NULL</codeph> for that row.
+ </p>
+
+<codeblock>
+CREATE TABLE contacts_fixed_phones
+(
+ id BIGINT
+ , name STRING
+ , address STRING
+ , home_phone STRING
+ , work_phone STRING
+ , mobile_phone STRING
+) STORED AS PARQUET;
+</codeblock>
+
+ </fig>
+
+ <fig id="complex_types_phones_array">
+
+ <title>An Array of Phone Numbers</title>
+
+ <p>
+ Using a complex type column to represent the phone numbers adds some extra flexibility. Now there could be an unlimited number
+ of phone numbers. Because the array elements have an order but not symbolic names, you could decide in advance that
+ phone_number[0] is the home number, [1] is the work number, [2] is the mobile number, and so on. (In subsequent examples, you
+ will see how to create a more flexible naming scheme using other complex type variations, such as a <codeph>MAP</codeph> or an
+ <codeph>ARRAY</codeph> where each element is a <codeph>STRUCT</codeph>.)
+ </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_array_of_phones
+(
+ id BIGINT
+ , name STRING
+ , address STRING
+ , phone_number ARRAY < STRING >
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+ </fig>
+
+ <fig id="complex_types_phones_map">
+
+ <title>A Map of Phone Numbers</title>
+
+ <p>
+ Another way to represent an arbitrary set of phone numbers is with a <codeph>MAP</codeph> column. With a <codeph>MAP</codeph>,
+ each element is associated with a key value that you specify, which could be a numeric, string, or other scalar type. This
+ example uses a <codeph>STRING</codeph> key to give each phone number a name, such as <codeph>'home'</codeph> or
+ <codeph>'mobile'</codeph>. A query could filter the data based on the key values, or display the key values in reports.
+ </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_unlimited_phones
+(
+ id BIGINT, name STRING, address STRING, phone_number MAP < STRING,STRING >
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+ </fig>
+
+ <fig id="complex_types_phones_flat_normalized">
+
+ <title>Traditional Relational Representation of Phone Numbers: Normalized Tables</title>
+
+ <p>
+ If you are an experienced database designer, you already know how to work around the limitations of the single-table schema from
+ <xref href="#nested_types_ddl/complex_types_phones_flat_fixed"/>. By normalizing the schema, with the phone numbers in their own
+ table, you can associate an arbitrary set of phone numbers with each person, and associate additional details with each phone
+ number, such as whether it is a home, work, or mobile phone.
+ </p>
+
+ <p>
+ The flexibility of this approach comes with some drawbacks. Reconstructing all the data for a particular person requires a join
+ query, which might require performance tuning on Hadoop because the data from each table might be transmitted from a different
+ host. Data management tasks such as backups and refreshing the data require dealing with multiple tables instead of a single
+ table.
+ </p>
+
+ <p>
+ This example illustrates a traditional database schema to store contact info normalized across 2 tables. The fact table
+ establishes the identity and basic information about person. A dimension table stores information only about phone numbers,
+ using an ID value to associate each phone number with a person ID from the fact table. Each person can have 0, 1, or many
+ phones; the categories are not restricted to a few predefined ones; and the phone table can contain as many columns as desired,
+ to represent all sorts of details about each phone number.
+ </p>
+
+<codeblock>
+CREATE TABLE fact_contacts (id BIGINT, name STRING, address STRING) STORED AS PARQUET;
+CREATE TABLE dim_phones
+(
+ contact_id BIGINT
+ , category STRING
+ , international_code STRING
+ , area_code STRING
+ , exchange STRING
+ , extension STRING
+ , mobile BOOLEAN
+ , carrier STRING
+ , current BOOLEAN
+ , service_start_date TIMESTAMP
+ , service_end_date TIMESTAMP
+)
+STORED AS PARQUET;
+</codeblock>
+
+ </fig>
+
+ <fig id="complex_types_phones_array_struct">
+
+ <title>Phone Numbers Represented as an Array of Structs</title>
+
+ <p>
+ To represent a schema equivalent to the one from <xref href="#nested_types_ddl/complex_types_phones_flat_normalized"/> using
+ complex types, this example uses an <codeph>ARRAY</codeph> where each array element is a <codeph>STRUCT</codeph>. As with the
+ earlier complex type examples, each person can have an arbitrary set of associated phone numbers. Making each array element into
+ a <codeph>STRUCT</codeph> lets us associate multiple data items with each phone number, and give a separate name and type to
+ each data item. The <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph> elements reproduce the columns of the dimension
+ table from the previous example.
+ </p>
+
+ <p>
+ You can do all the same kinds of queries with the complex type schema as with the normalized schema from the previous example.
+ The advantages of the complex type design are in the areas of convenience and performance. Now your backup and ETL processes
+ only deal with a single table. When a query uses a join to cross-reference the information about a person with their associated
+ phone numbers, all the relevant data for each row resides in the same HDFS data block, meaning each row can be processed on a
+ single host without requiring network transmission.
+ </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_detailed_phones
+(
+ id BIGINT, name STRING, address STRING
+ , phone ARRAY < STRUCT <
+ category: STRING
+ , international_code: STRING
+ , area_code: STRING
+ , exchange: STRING
+ , extension: STRING
+ , mobile: BOOLEAN
+ , carrier: STRING
+ , current: BOOLEAN
+ , service_start_date: TIMESTAMP
+ , service_end_date: TIMESTAMP
+ >>
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+ </fig>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_sql">
+
+ <title>SQL Statements that Support Complex Types</title>
+
+ <conbody>
+
+ <p>
+ The Impala SQL statements that support complex types are currently
+ <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>,
+ <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>,
+ <codeph><xref href="impala_describe.xml#describe">DESCRIBE</xref></codeph>,
+ <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph>, and
+ <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph>. That is, currently Impala can create or alter tables
+ containing complex type columns, examine the structure of a table containing complex type columns, import existing data files
+ containing complex type columns into a table, and query Parquet tables containing complex types.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="complex_types_ddl">
+
+ <title>DDL Statements and Complex Types</title>
+
+ <conbody>
+
+ <p>
+ Column specifications for complex or nested types use <codeph><</codeph> and <codeph>></codeph> delimiters:
+ </p>
+
+<codeblock><![CDATA[-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
+-- complex type (ARRAY, STRUCT, or MAP).
+CREATE TABLE array_t
+(
+ id BIGINT,
+ a1 ARRAY <STRING>,
+ a2 ARRAY <BIGINT>,
+ a3 ARRAY <TIMESTAMP>,
+ a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
+)
+STORED AS PARQUET;
+
+-- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
+-- a scalar type representing the key, and a scalar or complex type representing the value.
+CREATE TABLE map_t
+(
+ id BIGINT,
+ m1 MAP <STRING, STRING>,
+ m2 MAP <STRING, BIGINT>,
+ m3 MAP <BIGINT, STRING>,
+ m4 MAP <BIGINT, BIGINT>,
+ m5 MAP <STRING, ARRAY <STRING>>
+)
+STORED AS PARQUET;
+
+-- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
+-- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
+-- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
+-- an ARRAY or a MAP rather than as a top-level column.
+CREATE TABLE struct_t
+(
+ id BIGINT,
+ s1 STRUCT <f1: STRING, f2: BIGINT>,
+ s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
+ s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
+)
+STORED AS PARQUET;
+]]>
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_queries">
+
+ <title>Queries and Complex Types</title>
+
+ <conbody>
+
+<!-- Hive does the JSON output business: http://www.datascience-labs.com/hive/hiveql-data-manipulation/ -->
+
+<!-- SELECT * works but skips any nested type coloumns. -->
+
+ <p>
+ The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must
+ be <q>unpacked</q> using join queries. A query cannot directly retrieve the entire value for a complex type column. Impala
+ returns an error in this case. Queries using <codeph>SELECT *</codeph> are allowed for tables with complex types, but the
+ columns with complex types are skipped.
+ </p>
+
+ <p>
+ The following example shows how referring directly to a complex type column returns an error, while <codeph>SELECT *</codeph> on
+ the same table succeeds, but only retrieves the scalar columns.
+ </p>
+
+ <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
+
+<!-- Original error message:
+ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING,o_totalprice:DECIMAL(12,2),o_orderdate:STRING,o_orderpriority:STRING,o_clerk:STRING,o_shippriority:INT,o_comment:STRING,o_lineitems:ARRAY<STRUCT<l_partkey:BIGINT,l_suppkey:BIGINT,l_linenumber:INT,l_quantity:DECIMAL(12,2),l_extendedprice:DECIMAL(12,2),l_discount:DECIMAL(12,2),l_tax:DECIMAL(12,2),l_returnflag:STRING,l_linestatus:STRING,l_shipdate:STRING,l_commitdate:STRING,l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
+-->
+
+<codeblock><![CDATA[SELECT c_orders FROM customer LIMIT 1;
+ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
+Only scalar types are allowed in the select list.
+
+-- Original column has several scalar and one complex column.
+DESCRIBE customer;
++--------------+------------------------------------+
+| name | type |
++--------------+------------------------------------+
+| c_custkey | bigint |
+| c_name | string |
+...
+| c_orders | array<struct< |
+| | o_orderkey:bigint, |
+| | o_orderstatus:string, |
+| | o_totalprice:decimal(12,2), |
+...
+| | >> |
++--------------+------------------------------------+
+
+-- When we SELECT * from that table, only the scalar columns come back in the result set.
+CREATE TABLE select_star_customer STORED AS PARQUET AS SELECT * FROM customer;
++------------------------+
+| summary |
++------------------------+
+| Inserted 150000 row(s) |
++------------------------+
+
+-- The c_orders column, being of complex type, was not included in the SELECT * result set.
+DESC select_star_customer;
++--------------+---------------+
+| name | type |
++--------------+---------------+
+| c_custkey | bigint |
+| c_name | string |
+| c_address | string |
+| c_nationkey | smallint |
+| c_phone | string |
+| c_acctbal | decimal(12,2) |
+| c_mktsegment | string |
+| c_comment | string |
++--------------+---------------+
+]]>
+</codeblock>
+
+<!-- To do: These "references to..." bits could be promoted to their own 'expressions' subheads. -->
+
+ <p>
+ References to fields within <codeph>STRUCT</codeph> columns use dot notation. If the field name is unambiguous, you can omit
+ qualifiers such as table name, column name, or even the <codeph>ITEM</codeph> or <codeph>VALUE</codeph> pseudocolumn names for
+ <codeph>STRUCT</codeph> elements inside an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>.
+ </p>
+
+<!-- To do: rewrite example to use CUSTOMER table. -->
+
+<!-- Don't think TPC-H schema has a bare STRUCT to use in such a simple query though. -->
+
+<!-- Perhaps reuse the STRUCT_DEMO example here. -->
+
+<codeblock>SELECT id, address.city FROM customers WHERE address.zip = 94305;
+</codeblock>
+
+ <p>
+ References to elements within <codeph>ARRAY</codeph> columns use the <codeph>ITEM</codeph> pseudocolumn:
+ </p>
+
+<!-- To do: shorten qualified names. -->
+
+<codeblock>select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
++--------+----------------+
+| r_name | item.n_name |
++--------+----------------+
+| EUROPE | UNITED KINGDOM |
+| EUROPE | RUSSIA |
+| EUROPE | ROMANIA |
+| EUROPE | GERMANY |
+| EUROPE | FRANCE |
+| ASIA | VIETNAM |
+| ASIA | CHINA |
++--------+----------------+
+</codeblock>
+
+ <p>
+ References to fields within <codeph>MAP</codeph> columns use the <codeph>KEY</codeph> and <codeph>VALUE</codeph> pseudocolumns.
+ In this example, once the query establishes the alias <codeph>MAP_FIELD</codeph> for a <codeph>MAP</codeph> column with a
+ <codeph>STRING</codeph> key and an <codeph>INT</codeph> value, the query can refer to <codeph>MAP_FIELD.KEY</codeph> and
+ <codeph>MAP_FIELD.VALUE</codeph>, which have zero, one, or many instances for each row from the containing table.
+ </p>
+
+<codeblock><![CDATA[DESCRIBE table_0;
++---------+-----------------------+
+| name | type |
++---------+-----------------------+
+| field_0 | string |
+| field_1 | map<string,int> |
+...
+
+SELECT field_0, map_field.key, map_field.value
+ FROM table_0, table_0.field_1 AS map_field
+WHERE length(field_0) = 1
+LIMIT 10;
++---------+-----------+-------+
+| field_0 | key | value |
++---------+-----------+-------+
+| b | gshsgkvd | NULL |
+| b | twrtcxj6 | 18 |
+| b | 2vp5 | 39 |
+| b | fh0s | 13 |
+| v | 2 | 41 |
+| v | 8b58mz | 20 |
+| v | hw | 16 |
+| v | 65l388pyt | 29 |
+| v | 03k68g91z | 30 |
+| v | r2hlg5b | NULL |
++---------+-----------+-------+
+]]>
+</codeblock>
+
+<!-- To do: refer to or reuse examples from the other subtopics that discuss pseudocolumns etc. -->
+
+ <p>
+ When complex types are nested inside each other, you use a combination of joins, pseudocolumn names, and dot notation to refer
+ to specific fields at the appropriate level. This is the most frequent form of query syntax for complex columns, because the
+ typical use case involves two levels of complex types, such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements.
+ </p>
+
+<!-- To do: rewrite example to use CUSTOMER table. -->
+
+<!-- This is my own manufactured example so I have the table, and the query works, but I don't have sample data to show. -->
+
+<codeblock>SELECT id, phone_numbers.area_code FROM contact_info_many_structs INNER JOIN contact_info_many_structs.phone_numbers phone_numbers LIMIT 3;
+</codeblock>
+
+ <p>
+ You can express relationships between <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns at different levels as joins. You
+ include comparison operators between fields at the top level and within the nested type columns so that Impala can do the
+ appropriate join operation.
+ </p>
+
+<!-- Don't think TPC-H schema has any instances where outer field matches up with inner one though. -->
+
+<!-- But don't think this usage is important enough to call out at this early point. Hide the example for now. -->
+
+<!--
+<codeblock>SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc;
+SELECT c.id, o.txn_id FROM customers c, c.orders o;
+</codeblock>
+-->
+
+<!-- To do: move these examples down, to the examples subtopic at the end. -->
+
+ <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
+
+ <p>
+ For example, the following queries work equivalently. They each return customer and order data for customers that have at least
+ one order.
+ </p>
+
+<codeblock>SELECT c.c_name, o.o_orderkey FROM customer c, c.c_orders o LIMIT 5;
++--------------------+------------+
+| c_name | o_orderkey |
++--------------------+------------+
+| Customer#000072578 | 558821 |
+| Customer#000072578 | 2079810 |
+| Customer#000072578 | 5768068 |
+| Customer#000072578 | 1805604 |
+| Customer#000072578 | 3436389 |
++--------------------+------------+
+
+SELECT c.c_name, o.o_orderkey FROM customer c INNER JOIN c.c_orders o LIMIT 5;
++--------------------+------------+
+| c_name | o_orderkey |
++--------------------+------------+
+| Customer#000072578 | 558821 |
+| Customer#000072578 | 2079810 |
+| Customer#000072578 | 5768068 |
+| Customer#000072578 | 1805604 |
+| Customer#000072578 | 3436389 |
++--------------------+------------+
+</codeblock>
+
+ <p>
+ The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the
+ <codeph>C_ORDERS</codeph> array):
+ </p>
+
+<codeblock><![CDATA[SELECT c.c_custkey, o.o_orderkey
+ FROM customer c LEFT OUTER JOIN c.c_orders o
+LIMIT 5;
++-----------+------------+
+| c_custkey | o_orderkey |
++-----------+------------+
+| 60210 | NULL |
+| 147873 | NULL |
+| 72578 | 558821 |
+| 72578 | 2079810 |
+| 72578 | 5768068 |
++-----------+------------+
+]]>
+</codeblock>
+
+ <p>
+ The following query returns <i>only</i> customers that have no orders. (With <codeph>LEFT ANTI JOIN</codeph> or <codeph>LEFT
+ SEMI JOIN</codeph>, the query can only refer to columns from the left-hand table, because by definition there is no matching
+ information in the right-hand table.)
+ </p>
+
+<codeblock><![CDATA[SELECT c.c_custkey, c.c_name
+ FROM customer c LEFT ANTI JOIN c.c_orders o
+LIMIT 5;
++-----------+--------------------+
+| c_custkey | c_name |
++-----------+--------------------+
+| 60210 | Customer#000060210 |
+| 147873 | Customer#000147873 |
+| 141576 | Customer#000141576 |
+| 85365 | Customer#000085365 |
+| 70998 | Customer#000070998 |
++-----------+--------------------+
+]]>
+</codeblock>
+
+<!-- To do: promote the correlated subquery aspect into its own subtopic. -->
+
+ <p>
+ You can also perform correlated subqueries to examine the properties of complex type columns for each row in the result set.
+ </p>
+
+ <p>
+ Count the number of orders per customer. Note the correlated reference to the table alias <codeph>C</codeph>. The
+ <codeph>COUNT(*)</codeph> operation applies to all the elements of the <codeph>C_ORDERS</codeph> array for the corresponding
+ row, avoiding the need for a <codeph>GROUP BY</codeph> clause.
+ </p>
+
+<codeblock>select c_name, howmany FROM customer c, (SELECT COUNT(*) howmany FROM c.c_orders) v limit 5;
++--------------------+---------+
+| c_name | howmany |
++--------------------+---------+
+| Customer#000030065 | 15 |
+| Customer#000065455 | 18 |
+| Customer#000113644 | 21 |
+| Customer#000111078 | 0 |
+| Customer#000024621 | 0 |
++--------------------+---------+
+</codeblock>
+
+ <p>
+ Count the number of orders per customer, ignoring any customers that have not placed any orders:
+ </p>
+
+<codeblock>SELECT c_name, howmany_orders
+FROM
+ customer c,
+ (SELECT COUNT(*) howmany_orders FROM c.c_orders) subq1
+WHERE howmany_orders > 0
+LIMIT 5;
++--------------------+----------------+
+| c_name | howmany_orders |
++--------------------+----------------+
+| Customer#000072578 | 7 |
+| Customer#000046378 | 26 |
+| Customer#000069815 | 11 |
+| Customer#000079058 | 12 |
+| Customer#000092239 | 26 |
++--------------------+----------------+
+</codeblock>
+
+ <p>
+ Count the number of line items in each order. The reference to <codeph>C.C_ORDERS</codeph> in the <codeph>FROM</codeph> clause
+ is needed because the <codeph>O_ORDERKEY</codeph> field is a member of the elements in the <codeph>C_ORDERS</codeph> array. The
+ subquery labelled <codeph>SUBQ1</codeph> is correlated: it is re-evaluated for the <codeph>C_ORDERS.O_LINEITEMS</codeph> array
+ from each row of the <codeph>CUSTOMERS</codeph> table.
+ </p>
+
+<codeblock>SELECT c_name, o_orderkey, howmany_line_items
+FROM
+ customer c,
+ c.c_orders t2,
+ (SELECT COUNT(*) howmany_line_items FROM c.c_orders.o_lineitems) subq1
+WHERE howmany_line_items > 0
+LIMIT 5;
++--------------------+------------+--------------------+
+| c_name | o_orderkey | howmany_line_items |
++--------------------+------------+--------------------+
+| Customer#000020890 | 1884930 | 95 |
+| Customer#000020890 | 4570754 | 95 |
+| Customer#000020890 | 3771072 | 95 |
+| Customer#000020890 | 2555489 | 95 |
+| Customer#000020890 | 919171 | 95 |
++--------------------+------------+--------------------+
+</codeblock>
+
+ <p>
+ Get the number of orders, the average order price, and the maximum items in any order per customer. For this example, the
+ subqueries labelled <codeph>SUBQ1</codeph> and <codeph>SUBQ2</codeph> are correlated: they are re-evaluated for each row from
+ the original <codeph>CUSTOMER</codeph> table, and only apply to the complex columns associated with that row.
+ </p>
+
+<codeblock>SELECT c_name, howmany, average_price, most_items
+FROM
+ customer c,
+ (SELECT COUNT(*) howmany, AVG(o_totalprice) average_price FROM c.c_orders) subq1,
+ (SELECT MAX(l_quantity) most_items FROM c.c_orders.o_lineitems ) subq2
+LIMIT 5;
++--------------------+---------+---------------+------------+
+| c_name | howmany | average_price | most_items |
++--------------------+---------+---------------+------------+
+| Customer#000030065 | 15 | 128908.34 | 50.00 |
+| Customer#000088191 | 0 | NULL | NULL |
+| Customer#000101555 | 10 | 164250.31 | 50.00 |
+| Customer#000022092 | 0 | NULL | NULL |
+| Customer#000036277 | 27 | 166040.06 | 50.00 |
++--------------------+---------+---------------+------------+
+</codeblock>
+
+ <p>
+ For example, these queries show how to access information about the <codeph>ARRAY</codeph> elements within the
+ <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> schema, starting with the initial <codeph>ARRAY</codeph> elements
+ and progressing to examine the <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph>, and then the elements nested within
+ another <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>:
+ </p>
+
+<codeblock><![CDATA[-- How many orders does each customer have?
+-- The type of the ARRAY column doesn't matter, this is just counting the elements.
+SELECT c_custkey, count(*)
+ FROM customer, customer.c_orders
+GROUP BY c_custkey
+LIMIT 5;
++-----------+----------+
+| c_custkey | count(*) |
++-----------+----------+
+| 61081 | 21 |
+| 115987 | 15 |
+| 69685 | 19 |
+| 109124 | 15 |
+| 50491 | 12 |
++-----------+----------+
+
+-- How many line items are part of each customer order?
+-- Now we examine a field from a STRUCT nested inside the ARRAY.
+SELECT c_custkey, c_orders.o_orderkey, count(*)
+ FROM customer, customer.c_orders c_orders, c_orders.o_lineitems
+GROUP BY c_custkey, c_orders.o_orderkey
+LIMIT 5;
++-----------+------------+----------+
+| c_custkey | o_orderkey | count(*) |
++-----------+------------+----------+
+| 63367 | 4985959 | 7 |
+| 53989 | 1972230 | 2 |
+| 143513 | 5750498 | 5 |
+| 17849 | 4857989 | 1 |
+| 89881 | 1046437 | 1 |
++-----------+------------+----------+
+
+-- What are the line items in each customer order?
+-- One of the STRUCT fields inside the ARRAY is another
+-- ARRAY containing STRUCT elements. The join finds
+-- all the related items from both levels of ARRAY.
+SELECT c_custkey, o_orderkey, l_partkey
+ FROM customer, customer.c_orders, c_orders.o_lineitems
+LIMIT 5;
++-----------+------------+-----------+
+| c_custkey | o_orderkey | l_partkey |
++-----------+------------+-----------+
+| 113644 | 2738497 | 175846 |
+| 113644 | 2738497 | 27309 |
+| 113644 | 2738497 | 175873 |
+| 113644 | 2738497 | 88559 |
+| 113644 | 2738497 | 8032 |
++-----------+------------+-----------+
+]]>
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="pseudocolumns">
+
+ <title>Pseudocolumns for ARRAY and MAP Types</title>
+
+ <conbody>
+
+ <p>
+ Each element in an <codeph>ARRAY</codeph> type has a position, indexed starting from zero, and a value. Each element in a
+ <codeph>MAP</codeph> type represents a key-value pair. Impala provides pseudocolumns that let you retrieve this metadata as part
+ of a query, or filter query results by including such things in a <codeph>WHERE</codeph> clause. You refer to the pseudocolumns as
+ part of qualified column names in queries:
+ </p>
+
+ <ul>
+ <li>
+ <codeph>ITEM</codeph>: The value of an array element. If the <codeph>ARRAY</codeph> contains <codeph>STRUCT</codeph> elements,
+ you can refer to either <codeph><varname>array_name</varname>.ITEM.<varname>field_name</varname></codeph> or use the shorthand
+ <codeph><varname>array_name</varname>.<varname>field_name</varname></codeph>.
+ </li>
+
+ <li>
+ <codeph>POS</codeph>: The position of an element within an array.
+ </li>
+
+ <li>
+ <codeph>KEY</codeph>: The value forming the first part of a key-value pair in a map. It is not necessarily unique.
+ </li>
+
+ <li>
+ <codeph>VALUE</codeph>: The data item forming the second part of a key-value pair in a map. If the <codeph>VALUE</codeph> part
+ of the <codeph>MAP</codeph> element is a <codeph>STRUCT</codeph>, you can refer to either
+ <codeph><varname>map_name</varname>.VALUE.<varname>field_name</varname></codeph> or use the shorthand
+ <codeph><varname>map_name</varname>.<varname>field_name</varname></codeph>.
+ </li>
+ </ul>
+
+<!-- To do: Consider whether to move the detailed subtopics underneath ARRAY and MAP instead of embedded here. -->
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="item">
+
+ <title id="pos">ITEM and POS Pseudocolumns</title>
+
+ <conbody>
+
+ <p>
+ When an <codeph>ARRAY</codeph> column contains <codeph>STRUCT</codeph> elements, you can refer to a field within the
+ <codeph>STRUCT</codeph> using a qualified name of the form
+ <codeph><varname>array_column</varname>.<varname>field_name</varname></codeph>. If the <codeph>ARRAY</codeph> contains scalar
+ values, Impala recognizes the special name <codeph><varname>array_column</varname>.ITEM</codeph> to represent the value of each
+ scalar array element. For example, if a column contained an <codeph>ARRAY</codeph> where each element was a
+ <codeph>STRING</codeph>, you would use <codeph><varname>array_name</varname>.ITEM</codeph> to refer to each scalar value in the
+ <codeph>SELECT</codeph> list, or the <codeph>WHERE</codeph> or other clauses.
+ </p>
+
+ <p>
+ This example shows a table with two <codeph>ARRAY</codeph> columns whose elements are of the scalar type
+ <codeph>STRING</codeph>. When referring to the values of the array elements in the <codeph>SELECT</codeph> list,
+ <codeph>WHERE</codeph> clause, or <codeph>ORDER BY</codeph> clause, you use the <codeph>ITEM</codeph> pseudocolumn because
+ within the array, the individual elements have no defined names.
+ </p>
+
+<codeblock><![CDATA[create TABLE persons_of_interest
+(
+person_id BIGINT,
+aliases ARRAY <STRING>,
+associates ARRAY <STRING>,
+real_name STRING
+)
+STORED AS PARQUET;
+
+-- Get all the aliases of each person.
+SELECT real_name, aliases.ITEM
+ FROM persons_of_interest, persons_of_interest.aliases
+ORDER BY real_name, aliases.item;
+
+-- Search for particular associates of each person.
+SELECT real_name, associates.ITEM
+ FROM persons_of_interest, persons_of_interest.associates
+WHERE associates.item LIKE '% MacGuffin';
+]]>
+</codeblock>
+
+ <p>
+ Because an array is inherently an ordered data structure, Impala recognizes the special name
+ <codeph><varname>array_column</varname>.POS</codeph> to represent the numeric position of each element within the array. The
+ <codeph>POS</codeph> pseudocolumn lets you filter or reorder the result set based on the sequence of array elements.
+ </p>
+
+ <p>
+ The following example uses a table from a flattened version of the TPC-H schema. The <codeph>REGION</codeph> table only has a
+ few rows, such as one row for Europe and one for Asia. The row for each region represents all the countries in that region as an
+ <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements:
+ </p>
+
+<codeblock><![CDATA[[localhost:21000] > desc region;
++-------------+--------------------------------------------------------------------+
+| name | type |
++-------------+--------------------------------------------------------------------+
+| r_regionkey | smallint |
+| r_name | string |
+| r_comment | string |
+| r_nations | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
++-------------+--------------------------------------------------------------------+
+]]>
+</codeblock>
+
+ <p>
+ To find the countries within a specific region, you use a join query. To find out the order of elements in the array, you also
+ refer to the <codeph>POS</codeph> pseudocolumn in the select list:
+ </p>
+
+<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, <b>r2.POS</b>
+ > FROM region r1 INNER JOIN r1.r_nations r2
+ > WHERE r1.r_name = 'ASIA';
++--------+-----------+-----+
+| r_name | n_name | pos |
++--------+-----------+-----+
+| ASIA | VIETNAM | 0 |
+| ASIA | CHINA | 1 |
+| ASIA | JAPAN | 2 |
+| ASIA | INDONESIA | 3 |
+| ASIA | INDIA | 4 |
++--------+-----------+-----+
+</codeblock>
+
+ <p>
+ Once you know the positions of the elements, you can use that information in subsequent queries, for example to change the
+ ordering of results from the complex type column or to filter certain elements from the array:
+ </p>
+
+<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
+ > FROM region r1 INNER JOIN r1.r_nations r2
+ > WHERE r1.r_name = 'ASIA'
+ > <b>ORDER BY r2.POS DESC</b>;
++--------+-----------+-----+
+| r_name | n_name | pos |
++--------+-----------+-----+
+| ASIA | INDIA | 4 |
+| ASIA | INDONESIA | 3 |
+| ASIA | JAPAN | 2 |
+| ASIA | CHINA | 1 |
+| ASIA | VIETNAM | 0 |
++--------+-----------+-----+
+[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
+ > FROM region r1 INNER JOIN r1.r_nations r2
+ > WHERE r1.r_name = 'ASIA' AND <b>r2.POS BETWEEN 1 and 3</b>;
++--------+-----------+-----+
+| r_name | n_name | pos |
++--------+-----------+-----+
+| ASIA | CHINA | 1 |
+| ASIA | JAPAN | 2 |
+| ASIA | INDONESIA | 3 |
++--------+-----------+-----+
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="key">
+
+ <title id="value">KEY and VALUE Pseudocolumns</title>
+
+ <conbody>
+
+ <p>
+ The <codeph>MAP</codeph> data type is suitable for representing sparse or wide data structures, where each row might only have
+ entries for a small subset of named fields. Because the element names (the map keys) vary depending on the row, a query must be
+ able to refer to both the key and the value parts of each key-value pair. The <codeph>KEY</codeph> and <codeph>VALUE</codeph>
+ pseudocolumns let you refer to the parts of the key-value pair independently within the query, as
+ <codeph><varname>map_column</varname>.KEY</codeph> and <codeph><varname>map_column</varname>.VALUE</codeph>.
+ </p>
+
+ <p>
+ The <codeph>KEY</codeph> must always be a scalar type, such as <codeph>STRING</codeph>, <codeph>BIGINT</codeph>, or
+ <codeph>TIMESTAMP</codeph>. It can be <codeph>NULL</codeph>. Values of the <codeph>KEY</codeph> field are not necessarily unique
+ within the same <codeph>MAP</codeph>. You apply any required <codeph>DISTINCT</codeph>, <codeph>GROUP BY</codeph>, and other
+ clauses in the query, and loop through the result set to process all the values matching any specified keys.
+ </p>
+
+ <p>
+ The <codeph>VALUE</codeph> can be either a scalar type or another complex type. If the <codeph>VALUE</codeph> is a
+ <codeph>STRUCT</codeph>, you can construct a qualified name
+ <codeph><varname>map_column</varname>.VALUE.<varname>struct_field</varname></codeph> to refer to the individual fields inside
+ the value part. If the <codeph>VALUE</codeph> is an <codeph>ARRAY</codeph> or another <codeph>MAP</codeph>, you must include
+ another join condition that establishes a table alias for <codeph><varname>map_column</varname>.VALUE</codeph>, and then
+ construct another qualified name using that alias, for example <codeph><varname>table_alias</varname>.ITEM</codeph> or
+ <codeph><varname>table_alias</varname>.KEY</codeph> and <codeph><varname>table_alias</varname>.VALUE</codeph>
+ </p>
+
+ <p>
+ The following example shows different ways to access a <codeph>MAP</codeph> column using the <codeph>KEY</codeph> and
+ <codeph>VALUE</codeph> pseudocolumns. The <codeph>DETAILS</codeph> column has a <codeph>STRING</codeph> first part with short,
+ standardized values such as <codeph>'Recurring'</codeph>, <codeph>'Lucid'</codeph>, or <codeph>'Anxiety'</codeph>. This is the
+ <q>key</q> that is used to look up particular kinds of elements from the <codeph>MAP</codeph>. The second part, also a
+ <codeph>STRING</codeph>, is a longer free-form explanation. Impala gives you the standard pseudocolumn names
+ <codeph>KEY</codeph> and <codeph>VALUE</codeph> for the two parts, and you apply your own conventions and interpretations to the
+ underlying values.
+ </p>
+
+ <note>
+ If you find that the single-item nature of the <codeph>VALUE</codeph> makes it difficult to model your data accurately, the
+ solution is typically to add some nesting to the complex type. For example, to have several sets of key-value pairs, make the
+ column an <codeph>ARRAY</codeph> whose elements are <codeph>MAP</codeph>. To make a set of key-value pairs that holds more
+ elaborate information, make a <codeph>MAP</codeph> column whose <codeph>VALUE</codeph> part contains an <codeph>ARRAY</codeph>
+ or a <codeph>STRUCT</codeph>.
+ </note>
+
+<codeblock><![CDATA[CREATE TABLE dream_journal
+(
+ dream_id BIGINT,
+ details MAP <STRING,STRING>
+)
+STORED AS PARQUET;
+]]>
+
+-- What are all the types of dreams that are recorded?
+SELECT DISTINCT details.KEY FROM dream_journal, dream_journal.details;
+
+-- How many lucid dreams were recorded?
+-- Because there is no GROUP BY, we count the 'Lucid' keys across all rows.
+SELECT <b>COUNT(details.KEY)</b>
+ FROM dream_journal, dream_journal.details
+WHERE <b>details.KEY = 'Lucid'</b>;
+
+-- Print a report of a subset of dreams, filtering based on both the lookup key
+-- and the detailed value.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE AS "Dream Summary"</b>
+ FROM dream_journal, dream_journal.details
+WHERE
+ <b>details.KEY IN ('Happy', 'Pleasant', 'Joyous')</b>
+ AND <b>details.VALUE LIKE '%childhood%'</b>;
+</codeblock>
+
+ <p>
+ The following example shows a more elaborate version of the previous table, where the <codeph>VALUE</codeph> part of the
+ <codeph>MAP</codeph> entry is a <codeph>STRUCT</codeph> rather than a scalar type. Now instead of referring to the
+ <codeph>VALUE</codeph> pseudocolumn directly, you use dot notation to refer to the <codeph>STRUCT</codeph> fields inside it.
+ </p>
+
+<codeblock><![CDATA[CREATE TABLE better_dream_journal
+(
+ dream_id BIGINT,
+ details MAP <STRING,STRUCT <summary: STRING, when_happened: TIMESTAMP, duration: DECIMAL(5,2), woke_up: BOOLEAN> >
+)
+STORED AS PARQUET;
+]]>
+
+-- Do more elaborate reporting and filtering by examining multiple attributes within the same dream.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE.summary AS "Dream Summary"</b>, <b>details.VALUE.duration AS "Duration"</b>
+ FROM better_dream_journal, better_dream_journal.details
+WHERE
+ <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
+ AND <b>details.VALUE.duration > 60</b>
+ AND <b>details.VALUE.woke_up = TRUE</b>;
+
+-- Remember that if the ITEM or VALUE contains a STRUCT, you can reference
+-- the STRUCT fields directly without the .ITEM or .VALUE qualifier.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.summary AS "Dream Summary"</b>, <b>details.duration AS "Duration"</b>
+ FROM better_dream_journal, better_dream_journal.details
+WHERE
+ <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
+ AND <b>details.duration > 60</b>
+ AND <b>details.woke_up = TRUE</b>;
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="complex_types_etl">
+
+<!-- This topic overlaps in many ways with the preceding one. See which theme resonates with users, and combine them under the better title. -->
+
+ <title>Loading Data Containing Complex Types</title>
+
+ <conbody>
+
+ <p>
+ Because the Impala <codeph>INSERT</codeph> statement does not currently support creating new data with complex type columns, or
+ copying existing complex type values from one table to another, you primarily use Impala to query Parquet tables with complex
+ types where the data was inserted through Hive, or create tables with complex types where you already have existing Parquet data
+ files.
+ </p>
+
+ <p>
+ If you have created a Hive table with the Parquet file format and containing complex types, use the same table for Impala queries
+ with no changes. If you have such a Hive table in some other format, use a Hive <codeph>CREATE TABLE AS SELECT ... STORED AS
+ PARQUET</codeph> or <codeph>INSERT ... SELECT</codeph> statement to produce an equivalent Parquet table that Impala can query.
+ </p>
+
+ <p>
+ If you have existing Parquet data files containing complex types, located outside of any Impala or Hive table, such as data files
+ created by Spark jobs, you can use an Impala <codeph>CREATE TABLE ... STORED AS PARQUET</codeph> statement, followed by an Impala
+ <codeph>LOAD DATA</codeph> statement to move the data files into the table. As an alternative, you can use an Impala
+ <codeph>CREATE EXTERNAL TABLE</codeph> statement to create a table pointing to the HDFS directory that already contains the data
+ files.
+ </p>
+
+ <p>
+ Perhaps the simplest way to get started with complex type data is to take a denormalized table containing duplicated values, and
+ use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a Parquet table and condense the repeated values into
+ complex types. With the Hive <codeph>INSERT</codeph> statement, you use the <codeph>COLLECT_LIST()</codeph>,
+ <codeph>NAMED_STRUCT()</codeph>, and <codeph>MAP()</codeph> constructor functions within a <codeph>GROUP BY</codeph> query to
+ produce the complex type values. <codeph>COLLECT_LIST()</codeph> turns a sequence of values into an <codeph>ARRAY</codeph>.
+ <codeph>NAMED_STRUCT()</codeph> uses the first, third, and so on arguments as the field names for a <codeph>STRUCT</codeph>, to
+ match the field names from the <codeph>CREATE TABLE</codeph> statement.
+ </p>
+
+ <note>
+ Because Hive currently cannot construct individual rows using complex types through the <codeph>INSERT ... VALUES</codeph> syntax,
+ you prepare the data in flat form in a separate table, then copy it to the table with complex columns using <codeph>INSERT ...
+ SELECT</codeph> and the complex type constructors. See <xref href="impala_complex_types.xml#complex_types_ex_hive_etl"/> for
+ examples.
+ </note>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="complex_types_nesting">
+
+ <title>Using Complex Types as Nested Types</title>
+
+ <conbody>
+
+ <p>
+ The <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types can be the top-level types for <q>nested
+ type</q> columns. That is, each of these types can contain other complex or scalar types, with multiple levels of nesting to a
+ maximum depth of 100. For example, you can have an array of structures, a map containing other maps, a structure containing an
+ array of other structures, and so on. At the lowest level, there are always scalar types making up the fields of a
+ <codeph>STRUCT</codeph>, elements of an <codeph>ARRAY</codeph>, and keys and values of a <codeph>MAP</codeph>.
+ </p>
+
+ <p>
+ Schemas involving complex types typically use some level of nesting for the complex type columns.
+ </p>
+
+ <p>
+ For example, to model a relationship like a dimension table and a fact table, you typically use an <codeph>ARRAY</codeph> where
+ each array element is a <codeph>STRUCT</codeph>. The <codeph>STRUCT</codeph> fields represent what would traditionally be columns
+ in a separate joined table. It makes little sense to use a <codeph>STRUCT</codeph> as the top-level type for a column, because you
+ could just make the fields of the <codeph>STRUCT</codeph> into regular table columns.
+ </p>
+
+<!-- To do: this example might move somewhere else, under STRUCT itself or in a tips-and-tricks section. -->
+
+ <p>
+ Perhaps the only use case for a top-level <codeph>STRUCT</codeph> would be to to allow <codeph>STRUCT</codeph> fields with the
+ same name as columns to coexist in the same table. The following example shows how a table could have a column named
+ <codeph>ID</codeph>, and two separate <codeph>STRUCT</codeph> fields also named <codeph>ID</codeph>. Because the
+ <codeph>STRUCT</codeph> fields are always referenced using qualified names, the identical <codeph>ID</codeph> names do not cause a
+ conflict.
+ </p>
+
+<codeblock><![CDATA[CREATE TABLE struct_namespaces
+(
+ id BIGINT
+ , s1 STRUCT < id: BIGINT, field1: STRING >
+ , s2 STRUCT < id: BIGINT, when_happened: TIMESTAMP >
+)
+STORED AS PARQUET;
+
+select id, s1.id, s2.id from struct_namespaces;
+]]>
+</codeblock>
+
+ <p>
+ It is common to make the value portion of each key-value pair in a <codeph>MAP</codeph> a <codeph>STRUCT</codeph>,
+ <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, or other complex type variation. That way, each key in the <codeph>MAP</codeph>
+ can be associated with a flexible and extensible data structure. The key values are not predefined ahead of time (other than by
+ specifying their data type). Therefore, the <codeph>MAP</codeph> can accomodate a rapidly evolving schema, or sparse data
+ structures where each row contains only a few data values drawn from a large set of possible choices.
+ </p>
+
+ <p>
+ Although you can use an <codeph>ARRAY</codeph> of scalar values as the top-level column in a table, such a simple array is
+ typically of limited use for analytic queries. The only property of the array elements, aside from the element value, is the
+ ordering sequence available through the <codeph>POS</codeph> pseudocolumn. To record any additional item about each array element,
+ such as a <codeph>TIMESTAMP</codeph> or a symbolic name, you use an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> rather than
+ of scalar values.
+ </p>
+
+ <p>
+ If you are considering having multiple <codeph>ARRAY</codeph> or <codeph>MAP</codeph> columns, with related items under the same
+ position in each <codeph>ARRAY</codeph> or the same key in each <codeph>MAP</codeph>, prefer to use a <codeph>STRUCT</codeph> to
+ group all the related items into a single <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. Doing so avoids the additional storage
+ overhead and potential duplication of key values from having an extra complex type column. Also, because each
+ <codeph>ARRAY</codeph> or <codeph>MAP</codeph> that you reference in the query <codeph>SELECT</codeph> list requires an additional
+ join clause, minimizing the number of complex type columns also makes the query easier to read and maintain, relying more on dot
+ notation to refer to the relevant fields rather than a sequence of join clauses.
+ </p>
+
+ <p>
+ For example, here is a table with several complex type columns all at the top level and containing only scalar types. To retrieve
+ every data item for the row requires a separate join for each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> column. The fields of
+ the <codeph>STRUCT</codeph> can be referenced using dot notation, but there is no real advantage to using the
+ <codeph>STRUCT</codeph> at the top level rather than just making separate columns <codeph>FIELD1</codeph> and
+ <codeph>FIELD2</codeph>.
+ </p>
+
+<codeblock><![CDATA[CREATE TABLE complex_types_top_level
+(
+ id BIGINT,
+ a1 ARRAY<INT>,
+ a2 ARRAY<STRING>,
+ s STRUCT<field1: INT, field2: STRING>,
+-- Numeric lookup key for a string value.
+ m1 MAP<INT,STRING>,
+-- String lookup key for a numeric value.
+ m2 MAP<STRING,INT>
+)
+STORED AS PARQUET;
+
+describe complex_types_top_level;
++------+-----------------+
+| name | type |
++------+-----------------+
+| id | bigint |
+| a1 | array<int> |
+| a2 | array<string> |
+| s | struct< |
+| | field1:int, |
+| | field2:string |
+| | > |
+| m1 | map<int,string> |
+| m2 | map<string,int> |
++------+-----------------+
+
+select
+ id,
+ a1.item,
+ a2.item,
+ s.field1,
+ s.field2,
+ m1.key,
+ m1.value,
+ m2.key,
+ m2.value
+from
+ complex_types_top_level,
+ complex_types_top_level.a1,
+ complex_types_top_level.a2,
+ complex_types_top_level.m1,
+ complex_types_top_level.m2;
+]]>
+</codeblock>
+
+ <p>
+ For example, here is a table with columns containing an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, a <codeph>MAP</codeph>
+ where each key value is a <codeph>STRUCT</codeph>, and a <codeph>MAP</codeph> where each key value is an <codeph>ARRAY</codeph> of
+ <codeph>STRUCT</codeph>.
+ </p>
+
+<codeblock><![CDATA[CREATE TABLE nesting_demo
+(
+ user_id BIGINT,
+ family_members ARRAY < STRUCT < name: STRING, email: STRING, date_joined: TIMESTAMP >>,
+ foo map < STRING, STRUCT < f1: INT, f2: INT, f3: TIMESTAMP, f4: BOOLEAN >>,
+ gameplay MAP < STRING , ARRAY < STRUCT <
+ name: STRING, highest: BIGINT, lives_used: INT, total_spent: DECIMAL(16,2)
+ >>>
+)
+STORED AS PARQUET;
+]]>
+</codeblock>
+
+ <p>
+ The <codeph>DESCRIBE</codeph> statement rearranges the <codeph><</codeph> and <codeph>></codeph> separators and the field
+ names within each <codeph>STRUCT</codeph> for easy readability:
+ </p>
+
+<codeblock><![CDATA[DESCRIBE nesting_demo;
++----------------+-----------------------------+
+| name | type |
++----------------+-----------------------------+
+| user_id | bigint |
+| family_members | array<struct< |
+| | name:string, |
+| | email:string, |
+| | date_joined:timestamp |
+| | >>
<TRUNCATED>