You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/07/26 23:05:09 UTC
[17/22] incubator-impala git commit: First try at porting over the source files necessary for the Impala SQL Reference.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_complex_types.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_complex_types.xml b/docs/topics/impala_complex_types.xml
new file mode 100644
index 0000000..9fe7362
--- /dev/null
+++ b/docs/topics/impala_complex_types.xml
@@ -0,0 +1,2725 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.3.0" id="complex_types">
+
+  <title id="nested_types">Complex Types (CDH 5.5 and higher only)</title>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Data Types"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">complex types</indexterm>
+
+      <indexterm audience="Cloudera">nested types</indexterm>
+      <term>Complex types</term> (also referred to as <term>nested types</term>) let you represent multiple data values within a single
+      row/column position. They differ from the familiar column types such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>, known as
+      <term>scalar types</term> or <term>primitive types</term>, which represent a single data value within a given row/column position.
+      Impala supports the complex types <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> in Impala 2.3 / CDH 5.5
+      and higher. The Hive <codeph>UNION</codeph> type is not currently supported.
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+    <p>
+      Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your memory about syntax
+      and examples:
+    </p>
+
+    <ul>
+      <li>
+        <xref href="impala_array.xml#array"/>
+      </li>
+
+      <li>
+        <xref href="impala_struct.xml#struct"/>
+      </li>
+
+      <li>
+        <xref href="impala_map.xml#map"/>
+      </li>
+    </ul>
+
+  </conbody>
+
+  <concept id="complex_types_benefits">
+
+    <title>Benefits of Impala Complex Types</title>
+
+    <conbody>
+
+      <p>
+        The reasons for using Impala complex types include the following:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            You already have data produced by Hive or other non-Impala component that uses the complex type column names. You might need to
+            convert the underlying data to Parquet to use it with Impala.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are
+            representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala
+            <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more
+            related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional
+            Hadoop shuffle or broadcast join techniques.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used
+        with all SQL clauses, such as <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, all kinds of joins, subqueries, and inline
+        views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other
+        programming languages to deconstruct the underlying data structures.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="complex_types_overview">
+
+    <title>Overview of Impala Complex Types</title>
+
+    <conbody>
+
+      <p>
+<!--
+      Each <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or <codeph>STRUCT</codeph> column can include multiple instances of scalar types
+      such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>.
+-->
+        The <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types are closely related: they represent collections with arbitrary numbers of
+        elements, where each element is the same type. In contrast, <codeph>STRUCT</codeph> groups together a fixed number of items into a
+        single element. The parts of a <codeph>STRUCT</codeph> element (the <term>fields</term>) can be of different types, and each field
+        has a name.
+      </p>
+
+      <p>
+        The elements of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>, can also be other
+        complex types. You can construct elaborate data structures with up to 100 levels of nesting. For example, you can make an
+        <codeph>ARRAY</codeph> whose elements are <codeph>STRUCT</codeph>s. Within each <codeph>STRUCT</codeph>, you can have some fields
+        that are <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or another kind of <codeph>STRUCT</codeph>. The Impala documentation uses the
+        terms complex and nested types interchangeably; for simplicity, it primarily uses the term complex types, to encompass all the
+        properties of these types.
+      </p>
+
+      <p>
+        When visualizing your data model in familiar SQL terms, you can think of each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> as a
+        miniature table, and each <codeph>STRUCT</codeph> as a row within such a table. By default, the table represented by an
+        <codeph>ARRAY</codeph> has two columns, <codeph>POS</codeph> to represent ordering of elements, and <codeph>ITEM</codeph>
+        representing the value of each element. Likewise, by default, the table represented by a <codeph>MAP</codeph> encodes key-value
+        pairs, and therefore has two columns, <codeph>KEY</codeph> and <codeph>VALUE</codeph>.
+<!--
+        When you use a <codeph>STRUCT</codeph> as an
+        <codeph>ARRAY</codeph> element or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>, the field names of the
+        <codeph>STRUCT</codeph> become additional columns in the result set.
+-->
+      </p>
+
+      <p>
+        The <codeph>ITEM</codeph> and <codeph>VALUE</codeph> names are only required for the very simplest kinds of <codeph>ARRAY</codeph>
+        and <codeph>MAP</codeph> columns, ones that hold only scalar values. When the elements within the <codeph>ARRAY</codeph> or
+        <codeph>MAP</codeph> are of type <codeph>STRUCT</codeph> rather than a scalar type, then the result set contains columns with names
+        corresponding to the <codeph>STRUCT</codeph> fields rather than <codeph>ITEM</codeph> or <codeph>VALUE</codeph>.
+      </p>
+
+<!--
+      <p>
+        <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are both <term>collection</term> types, which can have a variable number of
+        elements; <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are typically used as the top-level type of a table column.
+        <codeph>STRUCT</codeph> represents a single element and has a fixed number of fields; <codeph>STRUCT</codeph> is typically used as
+        the final, lowest level of a nested type definition.
+      </p>
+-->
+
+      <p>
+        You write most queries that process complex type columns using familiar join syntax, even though the data for both sides of the join
+        resides in a single table. The join notation brings together the scalar values from a row with the values from the complex type
+        columns for that same row. The final result set contains all scalar values, allowing you to do all the familiar filtering,
+        aggregation, ordering, and so on for the complex data entirely in SQL or using business intelligence tools that issue SQL queries.
+<!--
+        Instead of pulling together values from different tables, the join selects the specified values from both
+        the scalar columns, and from inside the complex type columns, producing a flattened result set consisting of all scalar values. When
+        doing a join query involving a complex type column, Impala derives the join key automatically, without the need to create additional
+        ID columns in the table.
+-->
+      </p>
+
+      <p>
+        Behind the scenes, Impala ensures that the processing for each row is done efficiently on a single host, without the network traffic
+        involved in broadcast or shuffle joins. The most common type of join query for tables with complex type columns is <codeph>INNER
+        JOIN</codeph>, which returns results only in those cases where the complex type contains some elements. Therefore, most query
+        examples in this section use either the <codeph>INNER JOIN</codeph> clause or the equivalent comma notation.
+      </p>
+
+      <note>
+        <p>
+          Although Impala can query complex types that are present in Parquet files, Impala currently cannot create new Parquet files
+          containing complex types. Therefore, the discussion and examples presume that you are working with existing Parquet data produced
+          through Hive, Spark, or some other source. See <xref href="#complex_types_ex_hive_etl"/> for examples of constructing Parquet data
+          files with complex type columns.
+        </p>
+
+        <p>
+          For learning purposes, you can create empty tables with complex type columns and practice query syntax, even if you do not have
+          sample data with the required structure.
+        </p>
+      </note>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="complex_types_design">
+
+    <title>Design Considerations for Complex Types</title>
+
+    <conbody>
+
+      <p>
+        When planning to use Impala complex types, and designing the Impala schema, first learn how this kind of schema differs from
+        traditional table layouts from the relational database and data warehousing fields. Because you might have already encountered
+        complex types in a Hadoop context while using Hive for ETL, also learn how to write high-performance analytic queries for complex
+        type data using Impala SQL syntax.
+      </p>
+
+      <p outputclass="toc inpage"/>
+
+    </conbody>
+
+    <concept id="complex_types_vs_rdbms">
+
+      <title>How Complex Types Differ from Traditional Data Warehouse Schemas</title>
+
+      <conbody>
+
+        <p>
+          Complex types let you associate arbitrary data structures with a particular row. If you are familiar with schema design for
+          relational database management systems or data warehouses, a schema with complex types has the following differences:
+        </p>
+
+        <ul>
+          <li>
+            <p>
+              Logically, related values can now be grouped tightly together in the same table.
+            </p>
+
+            <p>
+              In traditional data warehousing, related values were typically arranged in one of two ways:
+            </p>
+            <ul>
+              <li>
+                <p>
+                  Split across multiple normalized tables. Foreign key columns specified which rows from each table were associated with
+                  each other. This arrangement avoided duplicate data and therefore the data was compact, but join queries could be
+                  expensive because the related data had to be retrieved from separate locations. (In the case of distributed Hadoop
+                  queries, the joined tables might even be transmitted between different hosts in a cluster.)
+                </p>
+              </li>
+
+              <li>
+                <p>
+                  Flattened into a single denormalized table. Although this layout eliminated some potential performance issues by removing
+                  the need for join queries, the table typically became larger because values were repeated. The extra data volume could
+                  cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans
+                  during queries.
+                </p>
+              </li>
+            </ul>
+            <p>
+              Complex types represent a middle ground that addresses these performance and volume concerns. By physically locating related
+              data within the same data files, complex types increase locality and reduce the expense of join queries. By associating an
+              arbitrary amount of data with a single row, complex types avoid the need to repeat lengthy values such as strings. Because
+              Impala knows which complex type values are associated with each row, you can save storage by avoiding artificial foreign key
+              values that are only used for joins. The flexibility of the <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and
+              <codeph>MAP</codeph> types lets you model familiar constructs such as fact and dimension tables from a data warehouse, and
+              wide tables representing sparse matrixes.
+            </p>
+          </li>
+        </ul>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_physical">
+
+      <title>Physical Storage for Complex Types</title>
+
+      <conbody>
+
+        <p>
+          Physically, the scalar and complex columns in each row are located adjacent to each other in the same Parquet data file, ensuring
+          that they are processed on the same host rather than being broadcast across the network when cross-referenced within a query. This
+          co-location simplifies the process of copying, converting, and backing all the columns up at once. Because of the column-oriented
+          layout of Parquet files, you can still query only the scalar columns of a table without imposing the I/O penalty of reading the
+          (possibly large) values of the composite columns.
+        </p>
+
+        <p>
+          Within each Parquet data file, the constituent parts of complex type columns are stored in column-oriented format:
+        </p>
+
+        <ul>
+          <li>
+            <p>
+              Each field of a <codeph>STRUCT</codeph> type is stored like a column, with all the scalar values adjacent to each other and
+              encoded, compressed, and so on using the Parquet space-saving techniques.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              For an <codeph>ARRAY</codeph> containing scalar values, all those values (represented by the <codeph>ITEM</codeph>
+              pseudocolumn) are stored adjacent to each other.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              For a <codeph>MAP</codeph>, the values of the <codeph>KEY</codeph> pseudocolumn are stored adjacent to each other. If the
+              <codeph>VALUE</codeph> pseudocolumn is a scalar type, its values are also stored adjacent to each other.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              If an <codeph>ARRAY</codeph> element, <codeph>STRUCT</codeph> field, or <codeph>MAP</codeph> <codeph>VALUE</codeph> part is
+              another complex type, the column-oriented storage applies to the next level down (or the next level after that, and so on for
+              deeply nested types) where the final elements, fields, or values are of scalar types.
+            </p>
+          </li>
+        </ul>
+
+        <p>
+          The numbers represented by the <codeph>POS</codeph> pseudocolumn of an <codeph>ARRAY</codeph> are not physically stored in the
+          data files. They are synthesized at query time based on the order of the <codeph>ARRAY</codeph> elements associated with each row.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_file_formats">
+
+      <title>File Format Support for Impala Complex Types</title>
+
+      <conbody>
+
+        <p>
+          Currently, Impala queries support complex type data only in the Parquet file format. See <xref href="impala_parquet.xml#parquet"/>
+          for details about the performance benefits and physical layout of this file format.
+        </p>
+
+        <p>
+          Each table, or each partition within a table, can have a separate file format, and you can change file format at the table or
+          partition level through an <codeph>ALTER TABLE</codeph> statement. Because this flexibility makes it difficult to guarantee ahead
+          of time that all the data files for a table or partition are in a compatible format, Impala does not throw any errors when you
+          change the file format for a table or partition using <codeph>ALTER TABLE</codeph>. Any errors come at runtime when Impala
+          actually processes a table or partition that contains nested types and is not in one of the supported formats. If a query on a
+          partitioned table only processes some partitions, and all those partitions are in one of the supported formats, the query
+          succeeds.
+        </p>
+
+        <p>
+          Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro,
+          SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the nested
+          type columns. Also, if a table using an unsupported format originally contained nested type columns, and then those columns were
+          dropped from the table using <codeph>ALTER TABLE ... DROP COLUMN</codeph>, any existing data files in the table still contain the
+          nested type data and Impala queries on that table will generate errors.
+        </p>
+
+        <p>
+          You can perform DDL operations (even <codeph>CREATE TABLE</codeph>) for tables involving complex types in file formats other than
+          Parquet. The DDL support lets you set up intermediate tables in your ETL pipeline, to be populated by Hive, before the final stage
+          where the data resides in a Parquet table and is queryable by Impala. Also, you can have a partitioned table with complex type
+          columns that uses a non-Parquet format, and use <codeph>ALTER TABLE</codeph> to change the file format to Parquet for individual
+          partitions. When you put Parquet data files into those partitions, Impala can execute queries against that data as long as the
+          query does not involve any of the non-Parquet partitions.
+        </p>
+
+        <p>
+          If you use the <cmdname>parquet-tools</cmdname> command to examine the structure of a Parquet data file that includes complex
+          types, you see that both <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are represented as a <codeph>Bag</codeph> in Parquet
+          terminology, with all fields marked <codeph>Optional</codeph> because Impala allows any column to be nullable.
+        </p>
+
+        <p>
+          Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data files outside
+          Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_vs_normalization">
+
+      <title>Choosing Between Complex Types and Normalized Tables</title>
+
+      <conbody>
+
+        <p>
+          Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an important design
+          decision.
+        </p>
+
+        <ul>
+          <li>
+            <p>
+              If you are coming from a traditional database or data warehousing background, you might be familiar with how to split up data
+              between tables. Your business intelligence tools might already be optimized for dealing with this kind of multi-table scenario
+              through join queries.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              If you are pulling data from Impala into an application written in a programming language that has data structures analogous
+              to the complex types, such as Python or Java, complex types in Impala could simplify data interchange and improve
+              understandability and reliability of your program logic.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              You might already be faced with existing infrastructure or receive high volumes of data that assume one layout or the other.
+              For example, complex types are popular with web-oriented applications, for example to keep information about an online user
+              all in one place for convenient lookup and analysis, or to deal with sparse or constantly evolving data fields.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              If some parts of the data change over time while related data remains constant, using multiple normalized tables lets you
+              replace certain parts of the data without reloading the entire data set. Conversely, if you receive related data all bundled
+              together, such as in JSON files, using complex types can save the overhead of splitting the related items across multiple
+              tables.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              From a performance perspective:
+            </p>
+            <ul>
+              <li>
+                <p>
+                  In Parquet tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of reading the
+                  embedded data. When complex types are nested within a column, the data is physically divided at a very granular level; for
+                  example, a query referring to data nested multiple levels deep in a complex type column does not have to read all the data
+                  from that column, only the data for the relevant parts of the column type hierarchy.
+<!-- Avro not supported in 5.5 / 2.3: Avro tables might experience some performance overhead due to
+                the need to skip past the complex type columns in each row when reading the data. -->
+                </p>
+              </li>
+
+              <li>
+                <p>
+                  Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is processed in
+                  parallel across multiple hosts. All the information for a row containing complex types is typically to be in the same data
+                  block, and therefore does not need to be transmitted across the network when joining fields that are all part of the same
+                  row.
+                </p>
+              </li>
+
+              <li>
+                <p>
+                  The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more data blocks
+                  with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data and the characteristics of
+                  your query workload. If the complex columns are rarely referenced, using them might lower efficiency. If you are seeing
+                  low parallelism due to a small volume of data (relatively few data blocks) in each table partition, increasing the row
+                  size by including complex columns might produce more data blocks and thus spread the work more evenly across the cluster.
+                  See <xref href="impala_scalability.xml#scalability"/> for more on this advanced topic.
+                </p>
+              </li>
+            </ul>
+          </li>
+        </ul>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_hive">
+
+      <title>Differences Between Impala and Hive Complex Types</title>
+
+      <conbody>
+
+<!-- HiveQL functions like nested type constructors and posexplode(): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -->
+
+<!-- HiveQL complex types: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes -->
+
+<!-- HiveQL lateral views: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -->
+
+        <p>
+          Impala can query Parquet tables containing <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> columns
+          produced by Hive. There are some differences to be aware of between the Impala SQL and HiveQL syntax for complex types, primarily
+          for queries.
+        </p>
+
+        <p>
+          The syntax for specifying <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types in a <codeph>CREATE
+          TABLE</codeph> statement is compatible between Impala and Hive.
+        </p>
+
+        <p>
+          Because Impala <codeph>STRUCT</codeph> columns include user-specified field names, you use the <codeph>NAMED_STRUCT()</codeph>
+          constructor in Hive rather than the <codeph>STRUCT()</codeph> constructor when you populate an Impala <codeph>STRUCT</codeph>
+          column using a Hive <codeph>INSERT</codeph> statement.
+        </p>
+
+        <p>
+          The Hive <codeph>UNION</codeph> type is not currently supported in Impala.
+        </p>
+
+        <p>
+          While Impala usually aims for a high degree of compatibility with HiveQL query syntax, Impala syntax differs from Hive for queries
+          involving complex types. The differences are intended to provide extra flexibility for queries involving these kinds of tables.
+        </p>
+
+        <ul>
+          <li>
+            Impala uses dot notation for referring to element names or elements within complex types, and join notation for
+            cross-referencing scalar columns with the elements of complex types within the same row, rather than the <codeph>LATERAL
+            VIEW</codeph> clause and <codeph>EXPLODE()</codeph> function of HiveQL.
+          </li>
+
+          <li>
+            Using join notation lets you use all the kinds of join queries with complex type columns. For example, you can use a
+            <codeph>LEFT OUTER JOIN</codeph>, <codeph>LEFT ANTI JOIN</codeph>, or <codeph>LEFT SEMI JOIN</codeph> query to evaluate
+            different scenarios where the complex columns do or do not contain any elements.
+          </li>
+
+          <li>
+            You can include references to collection types inside subqueries and inline views. For example, you can construct a
+            <codeph>FROM</codeph> clause where one of the <q>tables</q> is a subquery against a complex type column, or use a subquery
+            against a complex type column as the argument to an <codeph>IN</codeph> or <codeph>EXISTS</codeph> clause.
+          </li>
+
+          <li>
+            The Impala pseudocolumn <codeph>POS</codeph> lets you retrieve the position of elements in an array along with the elements
+            themselves, equivalent to the <codeph>POSEXPLODE()</codeph> function of HiveQL. You do not use index notation to retrieve a
+            single array element in a query; the join query loops through the array elements and you use <codeph>WHERE</codeph> clauses to
+            specify which elements to return.
+          </li>
+
+          <li>
+            <p>
+              Join clauses involving complex type columns do not require an <codeph>ON</codeph> or <codeph>USING</codeph> clause. Impala
+              implicitly applies the join key so that the correct array entries or map elements are associated with the correct row from the
+              table.
+            </p>
+          </li>
+
+          <li>
+            <p>
+              Impala does not currently support the <codeph>UNION</codeph> complex type.
+            </p>
+          </li>
+        </ul>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_limits">
+
+      <title>Limitations and Restrictions for Complex Types</title>
+
+      <conbody>
+
+        <p>
+          Complex type columns can only be used in tables or partitions with the Parquet file format.
+        </p>
+
+        <p>
+          Complex type columns cannot be used as partition key columns in a partitioned table.
+        </p>
+
+        <p>
+          When you use complex types with the <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph>, or
+          <codeph>WHERE</codeph> clauses, you cannot refer to the column name by itself. Instead, you refer to the names of the scalar
+          values within the complex type, such as the <codeph>ITEM</codeph>, <codeph>POS</codeph>, <codeph>KEY</codeph>, or
+          <codeph>VALUE</codeph> pseudocolumns, or the field names from a <codeph>STRUCT</codeph>.
+        </p>
+
+        <p>
+          The maximum depth of nesting for complex types is 100 levels.
+        </p>
+
+        <p>
+          For ideal performance and scalability, use small or medium-sized collections, where all the complex columns contain at most a few
+          hundred megabytes per row. Remember, all the columns of a row are stored in the same HDFS data block, whose size in Parquet files
+          typically ranges from 256 MB to 1 GB.
+        </p>
+
+        <p>
+          Including complex type columns in a table introduces some overhead that might make queries that do not reference those columns
+          somewhat slower than Impala queries against tables without any complex type columns. Expect at most a 2x slowdown compared to
+          tables that do not have any complex type columns.
+        </p>
+
+        <p>
+          Currently, the <codeph>COMPUTE STATS</codeph> statement does not collect any statistics for columns containing complex types.
+          Impala uses heuristics to construct execution plans involving complex type columns.
+        </p>
+
+        <p>
+          Currently, Impala built-in functions and user-defined functions cannot accept complex types as parameters or produce them as
+          function return values. (When the complex type values are materialized in an Impala result set, the result set contains the scalar
+          components of the values, such as the <codeph>POS</codeph> or <codeph>ITEM</codeph> for an <codeph>ARRAY</codeph>, the
+          <codeph>KEY</codeph> or <codeph>VALUE</codeph> for a <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>; these
+          scalar data items <i>can</i> be used with built-in functions and UDFs as usual.)
+        </p>
+
+        <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
+
+        <p>
+          Currently, Impala can query complex type columns only from Parquet tables or Parquet partitions within partitioned tables.
+          Although you can use complex types in tables with Avro, text, and other file formats as part of your ETL pipeline, for example as
+          intermediate tables populated through Hive, doing analytics through Impala requires that the data eventually ends up in a Parquet
+          table. The requirement for Parquet data files means that you can use complex types with Impala tables hosted on other kinds of
+          file storage systems such as Isilon and Amazon S3, but you cannot use Impala to query complex types from HBase tables. See
+          <xref href="impala_complex_types.xml#complex_types_file_formats"/> for more details.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+  </concept>
+
+  <concept id="complex_types_using">
+
+    <title>Using Complex Types from SQL</title>
+
+    <conbody>
+
+      <p>
+        When using complex types through SQL in Impala, you learn the notation for <codeph>&lt; &gt;</codeph> delimiters for the complex
+        type columns in <codeph>CREATE TABLE</codeph> statements, and how to construct join queries to <q>unpack</q> the scalar values
+        nested inside the complex data structures. You might need to condense a traditional RDBMS or data warehouse schema into a smaller
+        number of Parquet tables, and use Hive, Spark, Pig, or other mechanism outside Impala to populate the tables with data.
+      </p>
+
+      <p outputclass="toc inpage"/>
+
+    </conbody>
+
+    <concept id="nested_types_ddl">
+
+      <title>Complex Type Syntax for DDL Statements</title>
+
+      <conbody>
+
+        <p>
+          The definition of <varname>data_type</varname>, as seen in the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+          statements, now includes complex types in addition to primitive types:
+        </p>
+
+<codeblock>  primitive_type
+| array_type
+| map_type
+| struct_type 
+</codeblock>
+
+        <p>
+          Unions are not currently supported.
+        </p>
+
+        <p>
+          Array, struct, and map column type declarations are specified in the <codeph>CREATE TABLE</codeph> statement. You can also add or
+          change the type of complex columns through the <codeph>ALTER TABLE</codeph> statement.
+        </p>
+
+        <note>
+          <p>
+            Currently, Impala queries allow complex types only in tables that use the Parquet format. If an Impala query encounters complex
+            types in a table or partition using another file format, the query returns a runtime error.
+          </p>
+
+          <p>
+            The Impala DDL support for complex types works for all file formats, so that you can create tables using text or other
+            non-Parquet formats for Hive to use as staging tables in an ETL cycle that ends with the data in a Parquet table. You can also
+            use <codeph>ALTER TABLE ... SET FILEFORMAT PARQUET</codeph> to change the file format of an existing table containing complex
+            types to Parquet, after which Impala can query it. Make sure to load Parquet files into the table after changing the file
+            format, because the <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> statement does not convert existing data to the new file
+            format.
+          </p>
+        </note>
+
+        <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
+
+        <p>
+          Because use cases for Impala complex types require that you already have Parquet data files produced outside of Impala, you can
+          use the Impala <codeph>CREATE TABLE LIKE PARQUET</codeph> syntax to produce a table with columns that match the structure of an
+          existing Parquet file, including complex type columns for nested data structures. Remember to include the <codeph>STORED AS
+          PARQUET</codeph> clause in this case, because even with <codeph>CREATE TABLE LIKE PARQUET</codeph>, the default file format of the
+          resulting table is still text.
+        </p>
+
+        <p>
+          Because the complex columns are omitted from the result set of an Impala <codeph>SELECT *</codeph> or <codeph>SELECT
+          <varname>col_name</varname></codeph> query, and because Impala currently does not support writing Parquet files with complex type
+          columns, you cannot use the <codeph>CREATE TABLE AS SELECT</codeph> syntax to create a table with nested type columns.
+        </p>
+
+        <note>
+          <p>
+            Once you have a table set up with complex type columns, use the <codeph>DESCRIBE</codeph> and <codeph>SHOW CREATE TABLE</codeph>
+            statements to see the correct notation with <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters and comma and colon
+            separators within the complex type definitions. If you do not have existing data with the same layout as the table, you can
+            query the empty table to practice with the notation for the <codeph>SELECT</codeph> statement. In the <codeph>SELECT</codeph>
+            list, you use dot notation and pseudocolumns such as <codeph>ITEM</codeph>, <codeph>KEY</codeph>, and <codeph>VALUE</codeph> for
+            referring to items within the complex type columns. In the <codeph>FROM</codeph> clause, you use join notation to construct
+            table aliases for any referenced <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns.
+          </p>
+        </note>
+
+<!-- To do: show some simple CREATE TABLE statements for each of the complex types, without so much backstory for the schema. -->
+
+        <p>
+          For example, when defining a table that holds contact information, you might represent phone numbers differently depending on the
+          expected layout and relationships of the data, and how well you can predict those properties in advance.
+        </p>
+
+        <p>
+          Here are different ways that you might represent phone numbers in a traditional relational schema, with equivalent representations
+          using complex types.
+        </p>
+
+        <fig id="complex_types_phones_flat_fixed">
+
+          <title>Traditional Relational Representation of Phone Numbers: Single Table</title>
+
+          <p>
+            The traditional, simplest way to represent phone numbers in a relational table is to store all contact info in a single table,
+            with all columns having scalar types, and each potential phone number represented as a separate column. In this example, each
+            person can only have these 3 types of phone numbers. If the person does not have a particular kind of phone number, the
+            corresponding column is <codeph>NULL</codeph> for that row.
+          </p>
+
+<codeblock>
+CREATE TABLE contacts_fixed_phones
+(
+    id BIGINT
+  , name STRING
+  , address STRING
+  , home_phone STRING
+  , work_phone STRING
+  , mobile_phone STRING
+) STORED AS PARQUET;
+</codeblock>
+
+        </fig>
+
+        <fig id="complex_types_phones_array">
+
+          <title>An Array of Phone Numbers</title>
+
+          <p>
+            Using a complex type column to represent the phone numbers adds some extra flexibility. Now there could be an unlimited number
+            of phone numbers. Because the array elements have an order but not symbolic names, you could decide in advance that
+            phone_number[0] is the home number, [1] is the work number, [2] is the mobile number, and so on. (In subsequent examples, you
+            will see how to create a more flexible naming scheme using other complex type variations, such as a <codeph>MAP</codeph> or an
+            <codeph>ARRAY</codeph> where each element is a <codeph>STRUCT</codeph>.)
+          </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_array_of_phones
+(
+    id BIGINT
+  , name STRING
+  , address STRING
+  , phone_number ARRAY < STRING >
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+        </fig>
+
+        <fig id="complex_types_phones_map">
+
+          <title>A Map of Phone Numbers</title>
+
+          <p>
+            Another way to represent an arbitrary set of phone numbers is with a <codeph>MAP</codeph> column. With a <codeph>MAP</codeph>,
+            each element is associated with a key value that you specify, which could be a numeric, string, or other scalar type. This
+            example uses a <codeph>STRING</codeph> key to give each phone number a name, such as <codeph>'home'</codeph> or
+            <codeph>'mobile'</codeph>. A query could filter the data based on the key values, or display the key values in reports.
+          </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_unlimited_phones
+(
+  id BIGINT, name STRING, address STRING, phone_number MAP < STRING,STRING >
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+        </fig>
+
+        <fig id="complex_types_phones_flat_normalized">
+
+          <title>Traditional Relational Representation of Phone Numbers: Normalized Tables</title>
+
+          <p>
+            If you are an experienced database designer, you already know how to work around the limitations of the single-table schema from
+            <xref href="#nested_types_ddl/complex_types_phones_flat_fixed"/>. By normalizing the schema, with the phone numbers in their own
+            table, you can associate an arbitrary set of phone numbers with each person, and associate additional details with each phone
+            number, such as whether it is a home, work, or mobile phone.
+          </p>
+
+          <p>
+            The flexibility of this approach comes with some drawbacks. Reconstructing all the data for a particular person requires a join
+            query, which might require performance tuning on Hadoop because the data from each table might be transmitted from a different
+            host. Data management tasks such as backups and refreshing the data require dealing with multiple tables instead of a single
+            table.
+          </p>
+
+          <p>
+            This example illustrates a traditional database schema to store contact info normalized across 2 tables. The fact table
+            establishes the identity and basic information about person. A dimension table stores information only about phone numbers,
+            using an ID value to associate each phone number with a person ID from the fact table. Each person can have 0, 1, or many
+            phones; the categories are not restricted to a few predefined ones; and the phone table can contain as many columns as desired,
+            to represent all sorts of details about each phone number.
+          </p>
+
+<codeblock>
+CREATE TABLE fact_contacts (id BIGINT, name STRING, address STRING) STORED AS PARQUET;
+CREATE TABLE dim_phones
+(
+    contact_id BIGINT
+  , category STRING
+  , international_code STRING
+  , area_code STRING
+  , exchange STRING
+  , extension STRING
+  , mobile BOOLEAN
+  , carrier STRING
+  , current BOOLEAN
+  , service_start_date TIMESTAMP
+  , service_end_date TIMESTAMP
+)
+STORED AS PARQUET;
+</codeblock>
+
+        </fig>
+
+        <fig id="complex_types_phones_array_struct">
+
+          <title>Phone Numbers Represented as an Array of Structs</title>
+
+          <p>
+            To represent a schema equivalent to the one from <xref href="#nested_types_ddl/complex_types_phones_flat_normalized"/> using
+            complex types, this example uses an <codeph>ARRAY</codeph> where each array element is a <codeph>STRUCT</codeph>. As with the
+            earlier complex type examples, each person can have an arbitrary set of associated phone numbers. Making each array element into
+            a <codeph>STRUCT</codeph> lets us associate multiple data items with each phone number, and give a separate name and type to
+            each data item. The <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph> elements reproduce the columns of the dimension
+            table from the previous example.
+          </p>
+
+          <p>
+            You can do all the same kinds of queries with the complex type schema as with the normalized schema from the previous example.
+            The advantages of the complex type design are in the areas of convenience and performance. Now your backup and ETL processes
+            only deal with a single table. When a query uses a join to cross-reference the information about a person with their associated
+            phone numbers, all the relevant data for each row resides in the same HDFS data block, meaning each row can be processed on a
+            single host without requiring network transmission.
+          </p>
+
+<codeblock><![CDATA[
+CREATE TABLE contacts_detailed_phones
+(
+  id BIGINT, name STRING, address STRING
+    , phone ARRAY < STRUCT <
+        category: STRING
+      , international_code: STRING
+      , area_code: STRING
+      , exchange: STRING
+      , extension: STRING
+      , mobile: BOOLEAN
+      , carrier: STRING
+      , current: BOOLEAN
+      , service_start_date: TIMESTAMP
+      , service_end_date: TIMESTAMP
+    >>
+) STORED AS PARQUET;
+]]>
+</codeblock>
+
+        </fig>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_sql">
+
+      <title>SQL Statements that Support Complex Types</title>
+
+      <conbody>
+
+        <p>
+          The Impala SQL statements that support complex types are currently
+          <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>,
+          <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>,
+          <codeph><xref href="impala_describe.xml#describe">DESCRIBE</xref></codeph>,
+          <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph>, and
+          <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph>. That is, currently Impala can create or alter tables
+          containing complex type columns, examine the structure of a table containing complex type columns, import existing data files
+          containing complex type columns into a table, and query Parquet tables containing complex types.
+        </p>
+
+        <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
+
+        <p outputclass="toc inpage"/>
+
+      </conbody>
+
+      <concept id="complex_types_ddl">
+
+        <title>DDL Statements and Complex Types</title>
+
+        <conbody>
+
+          <p>
+            Column specifications for complex or nested types use <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters:
+          </p>
+
+<codeblock><![CDATA[-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
+-- complex type (ARRAY, STRUCT, or MAP).
+CREATE TABLE array_t
+(
+  id BIGINT,
+  a1 ARRAY <STRING>,
+  a2 ARRAY <BIGINT>,
+  a3 ARRAY <TIMESTAMP>,
+  a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
+)
+STORED AS PARQUET;
+
+-- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
+-- a scalar type representing the key, and a scalar or complex type representing the value.
+CREATE TABLE map_t
+(
+  id BIGINT,
+  m1 MAP <STRING, STRING>,
+  m2 MAP <STRING, BIGINT>,
+  m3 MAP <BIGINT, STRING>,
+  m4 MAP <BIGINT, BIGINT>,
+  m5 MAP <STRING, ARRAY <STRING>>
+)
+STORED AS PARQUET;
+
+-- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
+-- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
+-- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
+-- an ARRAY or a MAP rather than as a top-level column.
+CREATE TABLE struct_t
+(
+  id BIGINT,
+  s1 STRUCT <f1: STRING, f2: BIGINT>,
+  s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
+  s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
+)
+STORED AS PARQUET;
+]]>
+</codeblock>
+
+        </conbody>
+
+      </concept>
+
+      <concept id="complex_types_queries">
+
+        <title>Queries and Complex Types</title>
+
+        <conbody>
+
+<!-- Hive does the JSON output business: http://www.datascience-labs.com/hive/hiveql-data-manipulation/ -->
+
+<!-- SELECT * works but skips any nested type coloumns. -->
+
+          <p>
+            The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must
+            be <q>unpacked</q> using join queries. A query cannot directly retrieve the entire value for a complex type column. Impala
+            returns an error in this case. Queries using <codeph>SELECT *</codeph> are allowed for tables with complex types, but the
+            columns with complex types are skipped.
+          </p>
+
+          <p>
+            The following example shows how referring directly to a complex type column returns an error, while <codeph>SELECT *</codeph> on
+            the same table succeeds, but only retrieves the scalar columns.
+          </p>
+
+          <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
+
+<!-- Original error message:
+ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING,o_totalprice:DECIMAL(12,2),o_orderdate:STRING,o_orderpriority:STRING,o_clerk:STRING,o_shippriority:INT,o_comment:STRING,o_lineitems:ARRAY<STRUCT<l_partkey:BIGINT,l_suppkey:BIGINT,l_linenumber:INT,l_quantity:DECIMAL(12,2),l_extendedprice:DECIMAL(12,2),l_discount:DECIMAL(12,2),l_tax:DECIMAL(12,2),l_returnflag:STRING,l_linestatus:STRING,l_shipdate:STRING,l_commitdate:STRING,l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
+-->
+
+<codeblock><![CDATA[SELECT c_orders FROM customer LIMIT 1;
+ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
+Only scalar types are allowed in the select list.
+
+-- Original column has several scalar and one complex column.
+DESCRIBE customer;
++--------------+------------------------------------+
+| name         | type                               |
++--------------+------------------------------------+
+| c_custkey    | bigint                             |
+| c_name       | string                             |
+...
+| c_orders     | array<struct<                      |
+|              |   o_orderkey:bigint,               |
+|              |   o_orderstatus:string,            |
+|              |   o_totalprice:decimal(12,2),      |
+...
+|              | >>                                 |
++--------------+------------------------------------+
+
+-- When we SELECT * from that table, only the scalar columns come back in the result set.
+CREATE TABLE select_star_customer STORED AS PARQUET AS SELECT * FROM customer;
++------------------------+
+| summary                |
++------------------------+
+| Inserted 150000 row(s) |
++------------------------+
+
+-- The c_orders column, being of complex type, was not included in the SELECT * result set.
+DESC select_star_customer;
++--------------+---------------+
+| name         | type          |
++--------------+---------------+
+| c_custkey    | bigint        |
+| c_name       | string        |
+| c_address    | string        |
+| c_nationkey  | smallint      |
+| c_phone      | string        |
+| c_acctbal    | decimal(12,2) |
+| c_mktsegment | string        |
+| c_comment    | string        |
++--------------+---------------+
+]]>
+</codeblock>
+
+<!-- To do: These "references to..." bits could be promoted to their own 'expressions' subheads. -->
+
+          <p>
+            References to fields within <codeph>STRUCT</codeph> columns use dot notation. If the field name is unambiguous, you can omit
+            qualifiers such as table name, column name, or even the <codeph>ITEM</codeph> or <codeph>VALUE</codeph> pseudocolumn names for
+            <codeph>STRUCT</codeph> elements inside an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>.
+          </p>
+
+<!-- To do: rewrite example to use CUSTOMER table. -->
+
+<!-- Don't think TPC-H schema has a bare STRUCT to use in such a simple query though. -->
+
+<!-- Perhaps reuse the STRUCT_DEMO example here. -->
+
+<codeblock>SELECT id, address.city FROM customers WHERE address.zip = 94305;
+</codeblock>
+
+          <p>
+            References to elements within <codeph>ARRAY</codeph> columns use the <codeph>ITEM</codeph> pseudocolumn:
+          </p>
+
+<!-- To do: shorten qualified names. -->
+
+<codeblock>select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
++--------+----------------+
+| r_name | item.n_name    |
++--------+----------------+
+| EUROPE | UNITED KINGDOM |
+| EUROPE | RUSSIA         |
+| EUROPE | ROMANIA        |
+| EUROPE | GERMANY        |
+| EUROPE | FRANCE         |
+| ASIA   | VIETNAM        |
+| ASIA   | CHINA          |
++--------+----------------+
+</codeblock>
+
+          <p>
+            References to fields within <codeph>MAP</codeph> columns use the <codeph>KEY</codeph> and <codeph>VALUE</codeph> pseudocolumns.
+            In this example, once the query establishes the alias <codeph>MAP_FIELD</codeph> for a <codeph>MAP</codeph> column with a
+            <codeph>STRING</codeph> key and an <codeph>INT</codeph> value, the query can refer to <codeph>MAP_FIELD.KEY</codeph> and
+            <codeph>MAP_FIELD.VALUE</codeph>, which have zero, one, or many instances for each row from the containing table.
+          </p>
+
+<codeblock><![CDATA[DESCRIBE table_0;
++---------+-----------------------+
+| name    | type                  |
++---------+-----------------------+
+| field_0 | string                |
+| field_1 | map<string,int>       |
+...
+
+SELECT field_0, map_field.key, map_field.value
+  FROM table_0, table_0.field_1 AS map_field
+WHERE length(field_0) = 1
+LIMIT 10;
++---------+-----------+-------+
+| field_0 | key       | value |
++---------+-----------+-------+
+| b       | gshsgkvd  | NULL  |
+| b       | twrtcxj6  | 18    |
+| b       | 2vp5      | 39    |
+| b       | fh0s      | 13    |
+| v       | 2         | 41    |
+| v       | 8b58mz    | 20    |
+| v       | hw        | 16    |
+| v       | 65l388pyt | 29    |
+| v       | 03k68g91z | 30    |
+| v       | r2hlg5b   | NULL  |
++---------+-----------+-------+
+]]>
+</codeblock>
+
+<!-- To do: refer to or reuse examples from the other subtopics that discuss pseudocolumns etc. -->
+
+          <p>
+            When complex types are nested inside each other, you use a combination of joins, pseudocolumn names, and dot notation to refer
+            to specific fields at the appropriate level. This is the most frequent form of query syntax for complex columns, because the
+            typical use case involves two levels of complex types, such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements.
+          </p>
+
+<!-- To do: rewrite example to use CUSTOMER table. -->
+
+<!-- This is my own manufactured example so I have the table, and the query works, but I don't have sample data to show. -->
+
+<codeblock>SELECT id, phone_numbers.area_code FROM contact_info_many_structs INNER JOIN contact_info_many_structs.phone_numbers phone_numbers LIMIT 3;
+</codeblock>
+
+          <p>
+            You can express relationships between <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns at different levels as joins. You
+            include comparison operators between fields at the top level and within the nested type columns so that Impala can do the
+            appropriate join operation.
+          </p>
+
+<!-- Don't think TPC-H schema has any instances where outer field matches up with inner one though. -->
+
+<!-- But don't think this usage is important enough to call out at this early point. Hide the example for now. -->
+
+<!--
+<codeblock>SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc;
+SELECT c.id, o.txn_id FROM customers c, c.orders o;
+</codeblock>
+-->
+
+<!-- To do: move these examples down, to the examples subtopic at the end. -->
+
+          <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
+
+          <p>
+            For example, the following queries work equivalently. They each return customer and order data for customers that have at least
+            one order.
+          </p>
+
+<codeblock>SELECT c.c_name, o.o_orderkey FROM customer c, c.c_orders o LIMIT 5;
++--------------------+------------+
+| c_name             | o_orderkey |
++--------------------+------------+
+| Customer#000072578 | 558821     |
+| Customer#000072578 | 2079810    |
+| Customer#000072578 | 5768068    |
+| Customer#000072578 | 1805604    |
+| Customer#000072578 | 3436389    |
++--------------------+------------+
+
+SELECT c.c_name, o.o_orderkey FROM customer c INNER JOIN c.c_orders o LIMIT 5;
++--------------------+------------+
+| c_name             | o_orderkey |
++--------------------+------------+
+| Customer#000072578 | 558821     |
+| Customer#000072578 | 2079810    |
+| Customer#000072578 | 5768068    |
+| Customer#000072578 | 1805604    |
+| Customer#000072578 | 3436389    |
++--------------------+------------+
+</codeblock>
+
+          <p>
+            The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the
+            <codeph>C_ORDERS</codeph> array):
+          </p>
+
+<codeblock><![CDATA[SELECT c.c_custkey, o.o_orderkey
+  FROM customer c LEFT OUTER JOIN c.c_orders o
+LIMIT 5;
++-----------+------------+
+| c_custkey | o_orderkey |
++-----------+------------+
+| 60210     | NULL       |
+| 147873    | NULL       |
+| 72578     | 558821     |
+| 72578     | 2079810    |
+| 72578     | 5768068    |
++-----------+------------+
+]]>
+</codeblock>
+
+          <p>
+            The following query returns <i>only</i> customers that have no orders. (With <codeph>LEFT ANTI JOIN</codeph> or <codeph>LEFT
+            SEMI JOIN</codeph>, the query can only refer to columns from the left-hand table, because by definition there is no matching
+            information in the right-hand table.)
+          </p>
+
+<codeblock><![CDATA[SELECT c.c_custkey, c.c_name
+  FROM customer c LEFT ANTI JOIN c.c_orders o
+LIMIT 5;
++-----------+--------------------+
+| c_custkey | c_name             |
++-----------+--------------------+
+| 60210     | Customer#000060210 |
+| 147873    | Customer#000147873 |
+| 141576    | Customer#000141576 |
+| 85365     | Customer#000085365 |
+| 70998     | Customer#000070998 |
++-----------+--------------------+
+]]>
+</codeblock>
+
+<!-- To do: promote the correlated subquery aspect into its own subtopic. -->
+
+          <p>
+            You can also perform correlated subqueries to examine the properties of complex type columns for each row in the result set.
+          </p>
+
+          <p>
+            Count the number of orders per customer. Note the correlated reference to the table alias <codeph>C</codeph>. The
+            <codeph>COUNT(*)</codeph> operation applies to all the elements of the <codeph>C_ORDERS</codeph> array for the corresponding
+            row, avoiding the need for a <codeph>GROUP BY</codeph> clause.
+          </p>
+
+<codeblock>select c_name, howmany FROM customer c, (SELECT COUNT(*) howmany FROM c.c_orders) v limit 5;
++--------------------+---------+
+| c_name             | howmany |
++--------------------+---------+
+| Customer#000030065 | 15      |
+| Customer#000065455 | 18      |
+| Customer#000113644 | 21      |
+| Customer#000111078 | 0       |
+| Customer#000024621 | 0       |
++--------------------+---------+
+</codeblock>
+
+          <p>
+            Count the number of orders per customer, ignoring any customers that have not placed any orders:
+          </p>
+
+<codeblock>SELECT c_name, howmany_orders
+FROM
+  customer c,
+  (SELECT COUNT(*) howmany_orders FROM c.c_orders) subq1
+WHERE howmany_orders > 0
+LIMIT 5;
++--------------------+----------------+
+| c_name             | howmany_orders |
++--------------------+----------------+
+| Customer#000072578 | 7              |
+| Customer#000046378 | 26             |
+| Customer#000069815 | 11             |
+| Customer#000079058 | 12             |
+| Customer#000092239 | 26             |
++--------------------+----------------+
+</codeblock>
+
+          <p>
+            Count the number of line items in each order. The reference to <codeph>C.C_ORDERS</codeph> in the <codeph>FROM</codeph> clause
+            is needed because the <codeph>O_ORDERKEY</codeph> field is a member of the elements in the <codeph>C_ORDERS</codeph> array. The
+            subquery labelled <codeph>SUBQ1</codeph> is correlated: it is re-evaluated for the <codeph>C_ORDERS.O_LINEITEMS</codeph> array
+            from each row of the <codeph>CUSTOMERS</codeph> table.
+          </p>
+
+<codeblock>SELECT c_name, o_orderkey, howmany_line_items
+FROM
+  customer c,
+  c.c_orders t2,
+  (SELECT COUNT(*) howmany_line_items FROM c.c_orders.o_lineitems) subq1
+WHERE howmany_line_items > 0
+LIMIT 5;
++--------------------+------------+--------------------+
+| c_name             | o_orderkey | howmany_line_items |
++--------------------+------------+--------------------+
+| Customer#000020890 | 1884930    | 95                 |
+| Customer#000020890 | 4570754    | 95                 |
+| Customer#000020890 | 3771072    | 95                 |
+| Customer#000020890 | 2555489    | 95                 |
+| Customer#000020890 | 919171     | 95                 |
++--------------------+------------+--------------------+
+</codeblock>
+
+          <p>
+            Get the number of orders, the average order price, and the maximum items in any order per customer. For this example, the
+            subqueries labelled <codeph>SUBQ1</codeph> and <codeph>SUBQ2</codeph> are correlated: they are re-evaluated for each row from
+            the original <codeph>CUSTOMER</codeph> table, and only apply to the complex columns associated with that row.
+          </p>
+
+<codeblock>SELECT c_name, howmany, average_price, most_items
+FROM
+  customer c,
+  (SELECT COUNT(*) howmany, AVG(o_totalprice) average_price FROM c.c_orders) subq1,
+  (SELECT MAX(l_quantity) most_items FROM c.c_orders.o_lineitems ) subq2
+LIMIT 5;
++--------------------+---------+---------------+------------+
+| c_name             | howmany | average_price | most_items |
++--------------------+---------+---------------+------------+
+| Customer#000030065 | 15      | 128908.34     | 50.00      |
+| Customer#000088191 | 0       | NULL          | NULL       |
+| Customer#000101555 | 10      | 164250.31     | 50.00      |
+| Customer#000022092 | 0       | NULL          | NULL       |
+| Customer#000036277 | 27      | 166040.06     | 50.00      |
++--------------------+---------+---------------+------------+
+</codeblock>
+
+          <p>
+            For example, these queries show how to access information about the <codeph>ARRAY</codeph> elements within the
+            <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> schema, starting with the initial <codeph>ARRAY</codeph> elements
+            and progressing to examine the <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph>, and then the elements nested within
+            another <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>:
+          </p>
+
+<codeblock><![CDATA[-- How many orders does each customer have?
+-- The type of the ARRAY column doesn't matter, this is just counting the elements.
+SELECT c_custkey, count(*)
+  FROM customer, customer.c_orders
+GROUP BY c_custkey
+LIMIT 5;
++-----------+----------+
+| c_custkey | count(*) |
++-----------+----------+
+| 61081     | 21       |
+| 115987    | 15       |
+| 69685     | 19       |
+| 109124    | 15       |
+| 50491     | 12       |
++-----------+----------+
+
+-- How many line items are part of each customer order?
+-- Now we examine a field from a STRUCT nested inside the ARRAY.
+SELECT c_custkey, c_orders.o_orderkey, count(*)
+  FROM customer, customer.c_orders c_orders, c_orders.o_lineitems
+GROUP BY c_custkey, c_orders.o_orderkey
+LIMIT 5;
++-----------+------------+----------+
+| c_custkey | o_orderkey | count(*) |
++-----------+------------+----------+
+| 63367     | 4985959    | 7        |
+| 53989     | 1972230    | 2        |
+| 143513    | 5750498    | 5        |
+| 17849     | 4857989    | 1        |
+| 89881     | 1046437    | 1        |
++-----------+------------+----------+
+
+-- What are the line items in each customer order?
+-- One of the STRUCT fields inside the ARRAY is another
+-- ARRAY containing STRUCT elements. The join finds
+-- all the related items from both levels of ARRAY.
+SELECT c_custkey, o_orderkey, l_partkey
+  FROM customer, customer.c_orders, c_orders.o_lineitems
+LIMIT 5;
++-----------+------------+-----------+
+| c_custkey | o_orderkey | l_partkey |
++-----------+------------+-----------+
+| 113644    | 2738497    | 175846    |
+| 113644    | 2738497    | 27309     |
+| 113644    | 2738497    | 175873    |
+| 113644    | 2738497    | 88559     |
+| 113644    | 2738497    | 8032      |
++-----------+------------+-----------+
+]]>
+</codeblock>
+
+        </conbody>
+
+      </concept>
+
+    </concept>
+
+    <concept id="pseudocolumns">
+
+      <title>Pseudocolumns for ARRAY and MAP Types</title>
+
+      <conbody>
+
+        <p>
+          Each element in an <codeph>ARRAY</codeph> type has a position, indexed starting from zero, and a value. Each element in a
+          <codeph>MAP</codeph> type represents a key-value pair. Impala provides pseudocolumns that let you retrieve this metadata as part
+          of a query, or filter query results by including such things in a <codeph>WHERE</codeph> clause. You refer to the pseudocolumns as
+          part of qualified column names in queries:
+        </p>
+
+        <ul>
+          <li>
+            <codeph>ITEM</codeph>: The value of an array element. If the <codeph>ARRAY</codeph> contains <codeph>STRUCT</codeph> elements,
+            you can refer to either <codeph><varname>array_name</varname>.ITEM.<varname>field_name</varname></codeph> or use the shorthand
+            <codeph><varname>array_name</varname>.<varname>field_name</varname></codeph>.
+          </li>
+
+          <li>
+            <codeph>POS</codeph>: The position of an element within an array.
+          </li>
+
+          <li>
+            <codeph>KEY</codeph>: The value forming the first part of a key-value pair in a map. It is not necessarily unique.
+          </li>
+
+          <li>
+            <codeph>VALUE</codeph>: The data item forming the second part of a key-value pair in a map. If the <codeph>VALUE</codeph> part
+            of the <codeph>MAP</codeph> element is a <codeph>STRUCT</codeph>, you can refer to either
+            <codeph><varname>map_name</varname>.VALUE.<varname>field_name</varname></codeph> or use the shorthand
+            <codeph><varname>map_name</varname>.<varname>field_name</varname></codeph>.
+          </li>
+        </ul>
+
+<!-- To do: Consider whether to move the detailed subtopics underneath ARRAY and MAP instead of embedded here. -->
+
+        <p outputclass="toc inpage"/>
+
+      </conbody>
+
+      <concept id="item">
+
+        <title id="pos">ITEM and POS Pseudocolumns</title>
+
+        <conbody>
+
+          <p>
+            When an <codeph>ARRAY</codeph> column contains <codeph>STRUCT</codeph> elements, you can refer to a field within the
+            <codeph>STRUCT</codeph> using a qualified name of the form
+            <codeph><varname>array_column</varname>.<varname>field_name</varname></codeph>. If the <codeph>ARRAY</codeph> contains scalar
+            values, Impala recognizes the special name <codeph><varname>array_column</varname>.ITEM</codeph> to represent the value of each
+            scalar array element. For example, if a column contained an <codeph>ARRAY</codeph> where each element was a
+            <codeph>STRING</codeph>, you would use <codeph><varname>array_name</varname>.ITEM</codeph> to refer to each scalar value in the
+            <codeph>SELECT</codeph> list, or the <codeph>WHERE</codeph> or other clauses.
+          </p>
+
+          <p>
+            This example shows a table with two <codeph>ARRAY</codeph> columns whose elements are of the scalar type
+            <codeph>STRING</codeph>. When referring to the values of the array elements in the <codeph>SELECT</codeph> list,
+            <codeph>WHERE</codeph> clause, or <codeph>ORDER BY</codeph> clause, you use the <codeph>ITEM</codeph> pseudocolumn because
+            within the array, the individual elements have no defined names.
+          </p>
+
+<codeblock><![CDATA[create TABLE persons_of_interest
+(
+person_id BIGINT,
+aliases ARRAY <STRING>,
+associates ARRAY <STRING>,
+real_name STRING
+)
+STORED AS PARQUET;
+
+-- Get all the aliases of each person.
+SELECT real_name, aliases.ITEM
+  FROM persons_of_interest, persons_of_interest.aliases
+ORDER BY real_name, aliases.item;
+
+-- Search for particular associates of each person.
+SELECT real_name, associates.ITEM
+  FROM persons_of_interest, persons_of_interest.associates
+WHERE associates.item LIKE '% MacGuffin';
+]]>
+</codeblock>
+
+          <p>
+            Because an array is inherently an ordered data structure, Impala recognizes the special name
+            <codeph><varname>array_column</varname>.POS</codeph> to represent the numeric position of each element within the array. The
+            <codeph>POS</codeph> pseudocolumn lets you filter or reorder the result set based on the sequence of array elements.
+          </p>
+
+          <p>
+            The following example uses a table from a flattened version of the TPC-H schema. The <codeph>REGION</codeph> table only has a
+            few rows, such as one row for Europe and one for Asia. The row for each region represents all the countries in that region as an
+            <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements:
+          </p>
+
+<codeblock><![CDATA[[localhost:21000] > desc region;
++-------------+--------------------------------------------------------------------+
+| name        | type                                                               |
++-------------+--------------------------------------------------------------------+
+| r_regionkey | smallint                                                           |
+| r_name      | string                                                             |
+| r_comment   | string                                                             |
+| r_nations   | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
++-------------+--------------------------------------------------------------------+
+]]>
+</codeblock>
+
+          <p>
+            To find the countries within a specific region, you use a join query. To find out the order of elements in the array, you also
+            refer to the <codeph>POS</codeph> pseudocolumn in the select list:
+          </p>
+
+<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, <b>r2.POS</b>
+                  > FROM region r1 INNER JOIN r1.r_nations r2
+                  > WHERE r1.r_name = 'ASIA'; 
++--------+-----------+-----+
+| r_name | n_name    | pos |
++--------+-----------+-----+
+| ASIA   | VIETNAM   | 0   |
+| ASIA   | CHINA     | 1   |
+| ASIA   | JAPAN     | 2   |
+| ASIA   | INDONESIA | 3   |
+| ASIA   | INDIA     | 4   |
++--------+-----------+-----+
+</codeblock>
+
+          <p>
+            Once you know the positions of the elements, you can use that information in subsequent queries, for example to change the
+            ordering of results from the complex type column or to filter certain elements from the array:
+          </p>
+
+<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
+                  > FROM region r1 INNER JOIN r1.r_nations r2
+                  > WHERE r1.r_name = 'ASIA'
+                  > <b>ORDER BY r2.POS DESC</b>; 
++--------+-----------+-----+
+| r_name | n_name    | pos |
++--------+-----------+-----+
+| ASIA   | INDIA     | 4   |
+| ASIA   | INDONESIA | 3   |
+| ASIA   | JAPAN     | 2   |
+| ASIA   | CHINA     | 1   |
+| ASIA   | VIETNAM   | 0   |
++--------+-----------+-----+
+[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
+                  > FROM region r1 INNER JOIN r1.r_nations r2
+                  > WHERE r1.r_name = 'ASIA' AND <b>r2.POS BETWEEN 1 and 3</b>; 
++--------+-----------+-----+
+| r_name | n_name    | pos |
++--------+-----------+-----+
+| ASIA   | CHINA     | 1   |
+| ASIA   | JAPAN     | 2   |
+| ASIA   | INDONESIA | 3   |
++--------+-----------+-----+
+</codeblock>
+
+        </conbody>
+
+      </concept>
+
+      <concept id="key">
+
+        <title id="value">KEY and VALUE Pseudocolumns</title>
+
+        <conbody>
+
+          <p>
+            The <codeph>MAP</codeph> data type is suitable for representing sparse or wide data structures, where each row might only have
+            entries for a small subset of named fields. Because the element names (the map keys) vary depending on the row, a query must be
+            able to refer to both the key and the value parts of each key-value pair. The <codeph>KEY</codeph> and <codeph>VALUE</codeph>
+            pseudocolumns let you refer to the parts of the key-value pair independently within the query, as
+            <codeph><varname>map_column</varname>.KEY</codeph> and <codeph><varname>map_column</varname>.VALUE</codeph>.
+          </p>
+
+          <p>
+            The <codeph>KEY</codeph> must always be a scalar type, such as <codeph>STRING</codeph>, <codeph>BIGINT</codeph>, or
+            <codeph>TIMESTAMP</codeph>. It can be <codeph>NULL</codeph>. Values of the <codeph>KEY</codeph> field are not necessarily unique
+            within the same <codeph>MAP</codeph>. You apply any required <codeph>DISTINCT</codeph>, <codeph>GROUP BY</codeph>, and other
+            clauses in the query, and loop through the result set to process all the values matching any specified keys.
+          </p>
+
+          <p>
+            The <codeph>VALUE</codeph> can be either a scalar type or another complex type. If the <codeph>VALUE</codeph> is a
+            <codeph>STRUCT</codeph>, you can construct a qualified name
+            <codeph><varname>map_column</varname>.VALUE.<varname>struct_field</varname></codeph> to refer to the individual fields inside
+            the value part. If the <codeph>VALUE</codeph> is an <codeph>ARRAY</codeph> or another <codeph>MAP</codeph>, you must include
+            another join condition that establishes a table alias for <codeph><varname>map_column</varname>.VALUE</codeph>, and then
+            construct another qualified name using that alias, for example <codeph><varname>table_alias</varname>.ITEM</codeph> or
+            <codeph><varname>table_alias</varname>.KEY</codeph> and <codeph><varname>table_alias</varname>.VALUE</codeph>
+          </p>
+
+          <p>
+            The following example shows different ways to access a <codeph>MAP</codeph> column using the <codeph>KEY</codeph> and
+            <codeph>VALUE</codeph> pseudocolumns. The <codeph>DETAILS</codeph> column has a <codeph>STRING</codeph> first part with short,
+            standardized values such as <codeph>'Recurring'</codeph>, <codeph>'Lucid'</codeph>, or <codeph>'Anxiety'</codeph>. This is the
+            <q>key</q> that is used to look up particular kinds of elements from the <codeph>MAP</codeph>. The second part, also a
+            <codeph>STRING</codeph>, is a longer free-form explanation. Impala gives you the standard pseudocolumn names
+            <codeph>KEY</codeph> and <codeph>VALUE</codeph> for the two parts, and you apply your own conventions and interpretations to the
+            underlying values.
+          </p>
+
+          <note>
+            If you find that the single-item nature of the <codeph>VALUE</codeph> makes it difficult to model your data accurately, the
+            solution is typically to add some nesting to the complex type. For example, to have several sets of key-value pairs, make the
+            column an <codeph>ARRAY</codeph> whose elements are <codeph>MAP</codeph>. To make a set of key-value pairs that holds more
+            elaborate information, make a <codeph>MAP</codeph> column whose <codeph>VALUE</codeph> part contains an <codeph>ARRAY</codeph>
+            or a <codeph>STRUCT</codeph>.
+          </note>
+
+<codeblock><![CDATA[CREATE TABLE dream_journal
+(
+  dream_id BIGINT,
+  details MAP <STRING,STRING>
+)
+STORED AS PARQUET;
+]]>
+
+-- What are all the types of dreams that are recorded?
+SELECT DISTINCT details.KEY FROM dream_journal, dream_journal.details;
+
+-- How many lucid dreams were recorded?
+-- Because there is no GROUP BY, we count the 'Lucid' keys across all rows.
+SELECT <b>COUNT(details.KEY)</b>
+  FROM dream_journal, dream_journal.details
+WHERE <b>details.KEY = 'Lucid'</b>;
+
+-- Print a report of a subset of dreams, filtering based on both the lookup key
+-- and the detailed value.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE AS "Dream Summary"</b>
+  FROM dream_journal, dream_journal.details
+WHERE
+  <b>details.KEY IN ('Happy', 'Pleasant', 'Joyous')</b>
+  AND <b>details.VALUE LIKE '%childhood%'</b>;
+</codeblock>
+
+          <p>
+            The following example shows a more elaborate version of the previous table, where the <codeph>VALUE</codeph> part of the
+            <codeph>MAP</codeph> entry is a <codeph>STRUCT</codeph> rather than a scalar type. Now instead of referring to the
+            <codeph>VALUE</codeph> pseudocolumn directly, you use dot notation to refer to the <codeph>STRUCT</codeph> fields inside it.
+          </p>
+
+<codeblock><![CDATA[CREATE TABLE better_dream_journal
+(
+  dream_id BIGINT,
+  details MAP <STRING,STRUCT <summary: STRING, when_happened: TIMESTAMP, duration: DECIMAL(5,2), woke_up: BOOLEAN> >
+)
+STORED AS PARQUET;
+]]>
+
+-- Do more elaborate reporting and filtering by examining multiple attributes within the same dream.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE.summary AS "Dream Summary"</b>, <b>details.VALUE.duration AS "Duration"</b>
+  FROM better_dream_journal, better_dream_journal.details
+WHERE
+  <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
+  AND <b>details.VALUE.duration > 60</b>
+  AND <b>details.VALUE.woke_up = TRUE</b>;
+
+-- Remember that if the ITEM or VALUE contains a STRUCT, you can reference
+-- the STRUCT fields directly without the .ITEM or .VALUE qualifier.
+SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.summary AS "Dream Summary"</b>, <b>details.duration AS "Duration"</b>
+  FROM better_dream_journal, better_dream_journal.details
+WHERE
+  <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
+  AND <b>details.duration > 60</b>
+  AND <b>details.woke_up = TRUE</b>;
+</codeblock>
+
+        </conbody>
+
+      </concept>
+
+    </concept>
+
+    <concept id="complex_types_etl">
+
+<!-- This topic overlaps in many ways with the preceding one. See which theme resonates with users, and combine them under the better title. -->
+
+      <title>Loading Data Containing Complex Types</title>
+
+      <conbody>
+
+        <p>
+          Because the Impala <codeph>INSERT</codeph> statement does not currently support creating new data with complex type columns, or
+          copying existing complex type values from one table to another, you primarily use Impala to query Parquet tables with complex
+          types where the data was inserted through Hive, or create tables with complex types where you already have existing Parquet data
+          files.
+        </p>
+
+        <p>
+          If you have created a Hive table with the Parquet file format and containing complex types, use the same table for Impala queries
+          with no changes. If you have such a Hive table in some other format, use a Hive <codeph>CREATE TABLE AS SELECT ... STORED AS
+          PARQUET</codeph> or <codeph>INSERT ... SELECT</codeph> statement to produce an equivalent Parquet table that Impala can query.
+        </p>
+
+        <p>
+          If you have existing Parquet data files containing complex types, located outside of any Impala or Hive table, such as data files
+          created by Spark jobs, you can use an Impala <codeph>CREATE TABLE ... STORED AS PARQUET</codeph> statement, followed by an Impala
+          <codeph>LOAD DATA</codeph> statement to move the data files into the table. As an alternative, you can use an Impala
+          <codeph>CREATE EXTERNAL TABLE</codeph> statement to create a table pointing to the HDFS directory that already contains the data
+          files.
+        </p>
+
+        <p>
+          Perhaps the simplest way to get started with complex type data is to take a denormalized table containing duplicated values, and
+          use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a Parquet table and condense the repeated values into
+          complex types. With the Hive <codeph>INSERT</codeph> statement, you use the <codeph>COLLECT_LIST()</codeph>,
+          <codeph>NAMED_STRUCT()</codeph>, and <codeph>MAP()</codeph> constructor functions within a <codeph>GROUP BY</codeph> query to
+          produce the complex type values. <codeph>COLLECT_LIST()</codeph> turns a sequence of values into an <codeph>ARRAY</codeph>.
+          <codeph>NAMED_STRUCT()</codeph> uses the first, third, and so on arguments as the field names for a <codeph>STRUCT</codeph>, to
+          match the field names from the <codeph>CREATE TABLE</codeph> statement.
+        </p>
+
+        <note>
+          Because Hive currently cannot construct individual rows using complex types through the <codeph>INSERT ... VALUES</codeph> syntax,
+          you prepare the data in flat form in a separate table, then copy it to the table with complex columns using <codeph>INSERT ...
+          SELECT</codeph> and the complex type constructors. See <xref href="impala_complex_types.xml#complex_types_ex_hive_etl"/> for
+          examples.
+        </note>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="complex_types_nesting">
+
+      <title>Using Complex Types as Nested Types</title>
+
+      <conbody>
+
+        <p>
+          The <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types can be the top-level types for <q>nested
+          type</q> columns. That is, each of these types can contain other complex or scalar types, with multiple levels of nesting to a
+          maximum depth of 100. For example, you can have an array of structures, a map containing other maps, a structure containing an
+          array of other structures, and so on. At the lowest level, there are always scalar types making up the fields of a
+          <codeph>STRUCT</codeph>, elements of an <codeph>ARRAY</codeph>, and keys and values of a <codeph>MAP</codeph>.
+        </p>
+
+        <p>
+          Schemas involving complex types typically use some level of nesting for the complex type columns.
+        </p>
+
+        <p>
+          For example, to model a relationship like a dimension table and a fact table, you typically use an <codeph>ARRAY</codeph> where
+          each array element is a <codeph>STRUCT</codeph>. The <codeph>STRUCT</codeph> fields represent what would traditionally be columns
+          in a separate joined table. It makes little sense to use a <codeph>STRUCT</codeph> as the top-level type for a column, because you
+          could just make the fields of the <codeph>STRUCT</codeph> into regular table columns.
+        </p>
+
+<!-- To do: this example might move somewhere else, under STRUCT itself or in a tips-and-tricks section. -->
+
+        <p>
+          Perhaps the only use case for a top-level <codeph>STRUCT</codeph> would be to to allow <codeph>STRUCT</codeph> fields with the
+          same name as columns to coexist in the same table. The following example shows how a table could have a column named
+          <codeph>ID</codeph>, and two separate <codeph>STRUCT</codeph> fields also named <codeph>ID</codeph>. Because the
+          <codeph>STRUCT</codeph> fields are always referenced using qualified names, the identical <codeph>ID</codeph> names do not cause a
+          conflict.
+        </p>
+
+<codeblock><![CDATA[CREATE TABLE struct_namespaces
+(
+  id BIGINT
+  , s1 STRUCT < id: BIGINT, field1: STRING >
+  , s2 STRUCT < id: BIGINT, when_happened: TIMESTAMP >
+)
+STORED AS PARQUET;
+
+select id, s1.id, s2.id from struct_namespaces;
+]]>
+</codeblock>
+
+        <p>
+          It is common to make the value portion of each key-value pair in a <codeph>MAP</codeph> a <codeph>STRUCT</codeph>,
+          <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, or other complex type variation. That way, each key in the <codeph>MAP</codeph>
+          can be associated with a flexible and extensible data structure. The key values are not predefined ahead of time (other than by
+          specifying their data type). Therefore, the <codeph>MAP</codeph> can accomodate a rapidly evolving schema, or sparse data
+          structures where each row contains only a few data values drawn from a large set of possible choices.
+        </p>
+
+        <p>
+          Although you can use an <codeph>ARRAY</codeph> of scalar values as the top-level column in a table, such a simple array is
+          typically of limited use for analytic queries. The only property of the array elements, aside from the element value, is the
+          ordering sequence available through the <codeph>POS</codeph> pseudocolumn. To record any additional item about each array element,
+          such as a <codeph>TIMESTAMP</codeph> or a symbolic name, you use an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> rather than
+          of scalar values.
+        </p>
+
+        <p>
+          If you are considering having multiple <codeph>ARRAY</codeph> or <codeph>MAP</codeph> columns, with related items under the same
+          position in each <codeph>ARRAY</codeph> or the same key in each <codeph>MAP</codeph>, prefer to use a <codeph>STRUCT</codeph> to
+          group all the related items into a single <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. Doing so avoids the additional storage
+          overhead and potential duplication of key values from having an extra complex type column. Also, because each
+          <codeph>ARRAY</codeph> or <codeph>MAP</codeph> that you reference in the query <codeph>SELECT</codeph> list requires an additional
+          join clause, minimizing the number of complex type columns also makes the query easier to read and maintain, relying more on dot
+          notation to refer to the relevant fields rather than a sequence of join clauses.
+        </p>
+
+        <p>
+          For example, here is a table with several complex type columns all at the top level and containing only scalar types. To retrieve
+          every data item for the row requires a separate join for each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> column. The fields of
+          the <codeph>STRUCT</codeph> can be referenced using dot notation, but there is no real advantage to using the
+          <codeph>STRUCT</codeph> at the top level rather than just making separate columns <codeph>FIELD1</codeph> and
+          <codeph>FIELD2</codeph>.
+        </p>
+
+<codeblock><![CDATA[CREATE TABLE complex_types_top_level
+(
+  id BIGINT,
+  a1 ARRAY<INT>,
+  a2 ARRAY<STRING>,
+  s STRUCT<field1: INT, field2: STRING>,
+-- Numeric lookup key for a string value.
+  m1 MAP<INT,STRING>,
+-- String lookup key for a numeric value.
+  m2 MAP<STRING,INT>
+)
+STORED AS PARQUET;
+
+describe complex_types_top_level;
++------+-----------------+
+| name | type            |
++------+-----------------+
+| id   | bigint          |
+| a1   | array<int>      |
+| a2   | array<string>   |
+| s    | struct<         |
+|      |   field1:int,   |
+|      |   field2:string |
+|      | >               |
+| m1   | map<int,string> |
+| m2   | map<string,int> |
++------+-----------------+
+
+select
+  id,
+  a1.item,
+  a2.item,
+  s.field1,
+  s.field2,
+  m1.key,
+  m1.value,
+  m2.key,
+  m2.value
+from
+  complex_types_top_level,
+  complex_types_top_level.a1,
+  complex_types_top_level.a2,
+  complex_types_top_level.m1,
+  complex_types_top_level.m2;
+]]>
+</codeblock>
+
+        <p>
+          For example, here is a table with columns containing an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, a <codeph>MAP</codeph>
+          where each key value is a <codeph>STRUCT</codeph>, and a <codeph>MAP</codeph> where each key value is an <codeph>ARRAY</codeph> of
+          <codeph>STRUCT</codeph>.
+        </p>
+
+<codeblock><![CDATA[CREATE TABLE nesting_demo
+(
+  user_id BIGINT,
+  family_members ARRAY < STRUCT < name: STRING, email: STRING, date_joined: TIMESTAMP >>,
+  foo map < STRING, STRUCT < f1: INT, f2: INT, f3: TIMESTAMP, f4: BOOLEAN >>,
+  gameplay MAP < STRING , ARRAY < STRUCT <
+    name: STRING, highest: BIGINT, lives_used: INT, total_spent: DECIMAL(16,2)
+  >>>
+)
+STORED AS PARQUET;
+]]>
+</codeblock>
+
+        <p>
+          The <codeph>DESCRIBE</codeph> statement rearranges the <codeph>&lt;</codeph> and <codeph>&gt;</codeph> separators and the field
+          names within each <codeph>STRUCT</codeph> for easy readability:
+        </p>
+
+<codeblock><![CDATA[DESCRIBE nesting_demo;
++----------------+-----------------------------+
+| name           | type                        |
++----------------+-----------------------------+
+| user_id        | bigint                      |
+| family_members | array<struct<               |
+|                |   name:string,              |
+|                |   email:string,             |
+|                |   date_joined:timestamp     |
+|                | >>                     

<TRUNCATED>