You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/10/31 05:53:38 UTC
[3/6] incubator-impala git commit: Add files that weren't needed
during initial build testing of SQL Reference.
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_glossary.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_glossary.xml b/docs/topics/impala_glossary.xml
new file mode 100644
index 0000000..f713738
--- /dev/null
+++ b/docs/topics/impala_glossary.xml
@@ -0,0 +1,834 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="glossary">
+
+ <title>Impala Glossary</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Administrators"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Data Analysts"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <section id="glos_ddl">
+
+ <title>DDL</title>
+
+ <p>
+ A category of SQL statements involving administrative or metadata-related operations. Many of these
+ statements start with the keywords <codeph>CREATE</codeph>, <codeph>ALTER</codeph>, or
+ <codeph>DROP</codeph>. Contrast with DML.
+ </p>
+ </section>
+
+ <section id="glos_dml">
+
+ <title>DML</title>
+
+ <p>
+ A category of SQL statements that modify table data. Impala has a limited set of DML statements, primarily
+ the <codeph>INSERT</codeph> statement. Contrast with DDL.
+ </p>
+ </section>
+
+ <section id="glos_catalog">
+
+ <title>catalog</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_catalogd">
+
+ <title>catalogd</title>
+
+ <p>
+ A daemon.
+ </p>
+ </section>
+
+ <section id="glos_impalad">
+
+ <title>impalad</title>
+
+ <p>
+ A daemon.
+ </p>
+ </section>
+
+ <section id="glos_statestored">
+
+ <title>statestored</title>
+
+ <p>
+ A daemon.
+ </p>
+ </section>
+
+ <section id="glos_statestore">
+
+ <title>statestore</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_metastore">
+
+ <title>metastore</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_table">
+
+ <title>table</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_database">
+
+ <title>database</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_hive">
+
+ <title>Hive</title>
+
+ <p>
+ Its full trademarked name is Apache Hive.
+ </p>
+ </section>
+
+ <section id="glos_hbase">
+
+ <title>HBase</title>
+
+ <p>
+ Its full trademarked name is Apache HBase.
+ </p>
+ </section>
+
+ <section id="glos_coordinator_node">
+
+ <title>coordinator node</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_sql">
+
+ <title>SQL</title>
+
+ <p>
+ An acronym for Structured Query Language, the industry standard language for database operations.
+ Implemented by components such as Impala and Hive.
+ </p>
+ </section>
+
+ <section id="glos_avro">
+
+ <title>Avro</title>
+
+ <p>
+ A file format.
+ </p>
+ </section>
+
+ <section id="glos_text">
+
+ <title>text</title>
+
+ <p>
+ A file format.
+ </p>
+ </section>
+
+ <section id="glos_hdfs">
+
+ <title>HDFS</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_block">
+
+ <title>block</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_parquet">
+
+ <title>Parquet</title>
+
+ <p>
+ A file format.
+ </p>
+ </section>
+
+ <section id="glos_rcfile">
+
+ <title>RCFile</title>
+
+ <p>
+ A file format.
+ </p>
+ </section>
+
+ <section id="glos_sequencefile">
+
+ <title>SequenceFile</title>
+
+ <p>
+ A file format.
+ </p>
+ </section>
+
+ <section id="glos_snappy">
+
+ <title>Snappy</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_file_format">
+
+ <title>file format</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_compression">
+
+ <title>compression</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_codec">
+
+ <title>codec</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_lzo">
+
+ <title>LZO</title>
+
+ <p>
+ A compression codec.
+ </p>
+ </section>
+
+ <section id="glos_gzip">
+
+ <title>gzip</title>
+
+ <p>
+ A compression codec.
+ </p>
+ </section>
+
+ <section id="glos_bzip2">
+
+ <title>BZip2</title>
+
+ <p>
+ A compression codec.
+ </p>
+ </section>
+
+ <section id="glos_sentry">
+
+ <title/>
+
+ <p></p>
+ </section>
+
+ <section id="glos_authorization">
+
+ <title>authorization</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_authentication">
+
+ <title>authentication</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_delegation">
+
+ <title>delegation</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_skew">
+
+ <title>skew</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_profile">
+
+ <title>profile</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_query">
+
+ <title>query</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_external_table">
+
+ <title>external table</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_internal_table">
+
+ <title>internal table</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_view">
+
+ <title>view</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_function">
+
+ <title>function</title>
+
+ <p>
+ In Impala, a kind of database object that performs customized processing or business logic. The kinds of
+ functions include built-in functions, UDFs, and UDAs.
+ </p>
+ </section>
+
+ <section id="glos_udf">
+
+ <title>UDF</title>
+
+ <p>
+ Acronym for user-defined function. In Impala, a function that you write in C++ or Java. A UDF (as opposed
+ to a UDA) returns a single scalar value for each row evaluated in the result set. On occasion, UDF is used
+ in a broad context to include all kinds of user-written functions, including UDAs.
+ </p>
+ </section>
+
+ <section id="glos_uda">
+
+ <title>UDA</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_aggregate_function">
+
+ <title>aggregate function</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_built_in_function">
+
+ <title>built-in function</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_analytic_function">
+
+ <title>analytic function</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_admission_control">
+
+ <title>admission control</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_llama">
+
+ <title>llama</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_impala">
+
+ <title>Impala</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_table_statistics">
+
+ <title>table statistics</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_column_statistics">
+
+ <title>column statistics</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_statistics">
+
+ <title>statistics</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_incremental_statistics">
+
+ <title>incremental statistics</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_hdfs_caching">
+
+ <title>HDFS caching</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_join_query">
+
+ <title>join query</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_straight_join">
+
+ <title>straight join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_inner_join">
+
+ <title>inner join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_outer_join">
+
+ <title>outer join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_left_join">
+
+ <title>left join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_right_join">
+
+ <title>right join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_equijoin">
+
+ <title>equijoin</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_antijoin">
+
+ <title>antijoin</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_semijoin">
+
+ <title>semijoin</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_cross_join">
+
+ <title>cross join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_hint">
+
+ <title>hint</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_distinct">
+
+ <title>distinct</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_impala_shell">
+
+ <title>impala-shell</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_odbc">
+
+ <title>ODBC</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_jdbc">
+
+ <title>JDBC</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_data_types">
+
+ <title>data types</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_performance">
+
+ <title>performance</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_scalability">
+
+ <title>scalability</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_hue">
+
+ <title>Hue</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_partition">
+
+ <title>partition</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_partitioning">
+
+ <title>partitioning</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_partitioned_table">
+
+ <title>partitioned table</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_hdfs_trashcan">
+
+ <title>HDFS trashcan</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_explain_plan">
+
+ <title>EXPLAIN plan</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_summary">
+
+ <title>summary</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_executive_summary">
+
+ <title>executive summary</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_alias">
+
+ <title>alias</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_identifier">
+
+ <title>identifier</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_query_option">
+
+ <title>query option</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_daemon">
+
+ <title>daemon</title>
+
+ <p>
+ The daemons that make up the Impala service are <cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>,
+ and <cmdname>catalogd</cmdname>.
+ </p>
+ </section>
+
+ <section id="impala_service">
+
+ <title>Impala service</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_session">
+
+ <title>session</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_schema">
+
+ <title>schema</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_schema_evolution">
+
+ <title>schema evolution</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_literal">
+
+ <title>literal</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_string">
+
+ <title>string</title>
+
+ <p>
+ String values are represented in Impala by the data types <codeph>STRING</codeph>,
+ <codeph>VARCHAR</codeph>, and <codeph>CHAR</codeph>.
+ </p>
+ </section>
+
+ <section id="glos_number">
+
+ <title>number</title>
+
+ <p>
+ Numeric values are represented in Impala by the data types <codeph>INT</codeph>, <codeph>BIGINT</codeph>,
+ <codeph>SMALLINT</codeph>, <codeph>TINYINT</codeph>, <codeph>DECIMAL</codeph>, <codeph>DOUBLE</codeph>, and
+ <codeph>FLOAT</codeph>.
+ </p>
+ </section>
+
+ <section id="glos_date_time">
+
+ <title>date/time</title>
+
+ <p>
+ Date and time values are represented in Impala by the <codeph>TIMESTAMP</codeph> data type. Sometimes, for
+ convenience or as part of a partitioning plan, dates are represented as formatted <codeph>STRING</codeph>
+ columns or as separate integer columns for year, month, and day.
+ </p>
+ </section>
+
+ <section id="glos_cancel">
+
+ <title>cancel</title>
+
+ <p>
+ Halting an Impala query before it is completely finished.
+ </p>
+ </section>
+
+ <section id="glos_streaming">
+
+ <title>streaming</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_sorting">
+
+ <title>sorting</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_aggregation">
+
+ <title>aggregation</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_fragment">
+
+ <title>fragment</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_source">
+
+ <title>source</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_sink">
+
+ <title>sink</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_thrift">
+
+ <title>Thrift</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_delegation">
+
+ <title>delegation</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_authorization">
+
+ <title>authorization</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_authentication">
+
+ <title>authentication</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_partition_pruning">
+
+ <title>partition pruning</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_heartbeat">
+
+ <title>heartbeat</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_subscriber">
+
+ <title>subscriber</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_topic">
+
+ <title>topic</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_broadcast_join">
+
+ <title>broadcast join</title>
+
+ <p></p>
+ </section>
+
+ <section id="glos_shuffle_join">
+
+ <title>shuffle join</title>
+
+ <p></p>
+ </section>
+
+<!-- Use as a template for new definitions:
+ <section id="glos_">
+ <title></title>
+ <p>
+ </p>
+ </section>
+
+-->
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_howto_rm.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_howto_rm.xml b/docs/topics/impala_howto_rm.xml
new file mode 100644
index 0000000..4c3facc
--- /dev/null
+++ b/docs/topics/impala_howto_rm.xml
@@ -0,0 +1,420 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="howto_impala_rm">
+
+ <title>How to Configure Resource Management for Impala</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Admission Control"/>
+ <data name="Category" value="Resource Management"/>
+ </metadata>
+ </prolog>
+ <conbody>
+ <p>Impala includes features that balance and maximize resources in your CDH cluster. This topic describes
+ how you can enhance a CDH cluster using Impala to improve efficiency.</p>
+ <p outputclass="toc inpage">A typical deployment uses the following.</p>
+ <ul audience="HTML">
+ <li audience="HTML">Creating Static Service Pools</li>
+ <li audience="HTML">Using Admission Control <ul>
+ <li audience="HTML">Setting Per-query Memory Limits</li>
+ <li audience="HTML">Creating Dynamic Resource Pools</li>
+ </ul></li>
+
+ </ul>
+ </conbody>
+ <concept id="static_service_pools">
+ <title>Creating Static Service Pools</title>
+ <conbody>
+ <p>Use Static Service Pools to allocate dedicated resources for Impala and other services to allow
+ for predictable resource availability. </p>
+ <p>Static service pools isolate services from one another, so that high load on one service has
+ bounded impact on other services. You can use Cloudera Manager to configure static service pools
+ that control memory, CPU and Disk I/O.</p>
+ <p >The following screenshot shows a sample configuration for Static Service Pools in
+ Cloudera Manager:</p>
+ <draft-comment author="ddawson">
+ <p>Need accurate numbers - or can we remove services other than HDFS, Impala, and YARN? Matt
+ Jacobs is going to run these numbers by someone in the field.</p>
+ </draft-comment>
+ <p>
+ <image href="../images/howto_static_server_pools_config.png" placement="break"
+ id="PDF_33" align="center" scale="33" audience="HTML"/>
+ <image href="../images/howto_static_server_pools_config.png" placement="break"
+ id="HTML_SCALEFIT" align="center" scalefit="yes" audience="PDF"/>
+ </p>
+ <p>
+ <ul id="ul_tkw_4rs_pw">
+ <li >
+ <p >HDFS always needs to have a minimum of 5-10% of the resources.</p>
+ </li>
+ <li >
+ Generally, YARN and Impala split the rest of the resources.
+
+ <ul id="ul_ukw_4rs_pw">
+ <li >
+ <p >For mostly batch workloads, you might allocate YARN 60%, Impala 30%, and HDFS
+ 10%. </p>
+ </li>
+ <li >
+ <p >For mostly ad hoc query workloads, you might allocate Impala 60%, YARN 30%, and
+ HDFS 10%.</p>
+ </li>
+ </ul></li>
+ </ul>
+ </p>
+ </conbody>
+ </concept>
+ <concept id="enable_admission_control">
+ <title>Using Admission Control</title>
+ <conbody>
+ <p>Within the constraints of the static service pool, you can further subdivide Impala's
+ resources using Admission Control. You configure Impala Admission Control pools in the Cloudera
+ Manager Dynamic Resource Pools page.</p>
+ <p>You use Admission Control to divide usage between Dynamic Resource Pools in multitenant use
+ cases. Allocating resources judiciously allows your most important queries to run faster and
+ more reliably.</p>
+ <p>
+ <note>In this context, Impala Dynamic Resource Pools are different than the default YARN Dynamic
+ Resource Pools. You can turn on Dynamic Resource Pools that are exclusively for use by
+ Impala.</note>
+ </p>
+ <p>Admission Control is enabled by default.</p>
+ <p>A Dynamic Resource Pool has the following properties:<ul id="ul_blk_jjg_sw">
+ <li><b>Max Running Queries</b>: Maximum number of concerrently executing queries in the pool
+ before incoming queries are queued.</li>
+ <li><b>Max Memory Resources</b>: Maximum memory used by queries in the pool before incoming
+ queries are queued. This value is used at the time of admission and is not enforced at query
+ runtime.</li>
+ <li><b>Default Query Memory Limit</b>: Defines the maximum amount of memory a query can
+ allocate on each node. This is enforced at runtime. If the query attempts to use more memory,
+ it is forced to spill, if possible. Otherwise, it is cancelled. The total memory that can be
+ used by a query is the <codeph>MEM_LIMIT</codeph> times the number of nodes.</li>
+ <li><b>Max Queued Queries</b>: Maximum number of queries that can be queued in the pool before
+ additional queries are rejected.</li>
+ <li><b>Queue Timeout</b>: Specifies how long queries can wait in the queue before they are
+ cancelled with a timeout error.</li>
+ </ul></p>
+ </conbody>
+ </concept>
+ <concept id="set_per_query_memory_limits">
+ <title>Setting Per-query Memory Limits</title>
+ <conbody>
+ <p>Use per-query memory limits to prevent queries from consuming excessive memory resources that
+ impact other queries. Cloudera recommends that you set the query memory limits whenever
+ possible.</p>
+ <p>If you set the <b>Pool Max Mem Resources</b> for a resource pool, Impala attempts to throttle
+ queries if there is not enough memory to run them within the specified resources.</p>
+ <p>Only use admission control with maximum memory resources if you can ensure there are query
+ memory limits. Set the pool <b>Default Query Memory Limit</b> to be certain. You can override
+ this setting with the <codeph>query</codeph> option, if necessary.</p>
+ <p>Typically, you set query memory limits using the <codeph>set MEM_LIMIT=Xg;</codeph> query
+ option. When you find the right value for your business case, memory-based admission control
+ works well. The potential downside is that queries that attempt to use more memory might perform
+ poorly or even be cancelled.</p>
+ <p>To find a reasonable default query memory limit:<ol id="ol_ydt_xhy_pw">
+ <li>Run the workload.</li>
+ <li>In Cloudera Manager, go to <menucascade>
+ <uicontrol>Impala</uicontrol>
+ <uicontrol>Queries</uicontrol>
+ </menucascade>.</li>
+ <li>Click <uicontrol>Select Attributes</uicontrol>.</li>
+ <li>Select <uicontrol>Per Node Peak Memory Usage</uicontrol> and click
+ <uicontrol>Update</uicontrol>.</li>
+ <li>Allow the system time to gather information, then click the <uicontrol>Show
+ Histogram</uicontrol> icon to see the results.<image placement="break"
+ href="../images/howto_show_histogram.png" align="center" id="image_hmv_xky_pw"/></li>
+ <li>Use the histogram to find a value that accounts for most queries. Queries that require more
+ resources than this limit should explicitly set the memory limit to ensure they can run to
+ completion.<draft-comment author="ddawson">This chart uses bad sample data - we will change
+ the chart when we have real numbers from the sample use case.</draft-comment><image
+ placement="break" href="../images/howto_per_node_peak_memory_usage.png" align="center"
+ id="image_ehn_hly_pw" scalefit="yes"/></li>
+ </ol></p>
+ </conbody>
+ </concept>
+ <concept id="concept_en4_3sy_pw">
+ <title>Creating Dynamic Resource Pools</title>
+ <conbody>
+ <p>A dynamic resource pool is a named configuration of resources and a policy for scheduling the
+ resources among Impala queries running in the pool. Dynamic resource pools allow you to schedule
+ and allocate resources to Impala queries based on a user's access to specific pools and the
+ resources available to those pools. </p>
+ <p>This example creates both production and development resource pools or queues. It assumes you
+ have 3 worker nodes with 24GiB of RAM each for an aggregate memory of 72000MiB. This pool
+ configuration allocates the Production queue twice the memory resources of the Development
+ queue, and a higher number of concurrent queries.</p>
+ <p>To create a Production dynamic resource pool for Impala:</p>
+ <ol>
+ <li>In Cloudera Manager, select <menucascade>
+ <uicontrol>Clusters</uicontrol>
+ <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+ </menucascade>.</li>
+ <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+ <li>Click <b>Create Resource Pool</b>.</li>
+ <li>Specify a name and resource limits for the Production pool:<ul id="ul_rjt_wqv_2v">
+ <li>In the <b>Resource Pool Name</b> field, enter <userinput>Production</userinput>.</li>
+ <li>In the <uicontrol>Max Memory</uicontrol> field, enter <userinput>48000</userinput>.</li>
+ <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter
+ <userinput>1600</userinput>.</li>
+ <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter
+ <userinput>10</userinput>.</li>
+ <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter
+ <userinput>200</userinput>.</li>
+ </ul></li>
+ <li>Click <uicontrol>Create</uicontrol>.</li>
+ <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.</li>
+ </ol>
+ <p>The Production queue runs up to 10 queries at once. If the total memory requested
+ by these queries exceeds 48000 MiB, it holds the next query in the queue until the memory is
+ released. It also prevents a query from running if it needs more memory than is currently
+ available. Admission Control holds the next query if either Max Running Queries is reached, or
+ the pool Max Memory limit is reached.</p>
+ <p>Here, Max Memory resources and Default Query Memory Limit throttle throughput to 10 queries,
+ so setting Max Running Queries might not be necessary, though it does not hurt to do so. Most
+ users set Max Running Queries when they cannot pick good numbers for memory. Since users can
+ override the query option <varname>mem_limit</varname>, setting the Max Running Queries property
+ might make sense.</p>
+ <p>To create a Development dynamic resource pool for Impala:</p>
+
+ <ol>
+ <li>In Cloudera Manager, select <menucascade>
+ <uicontrol>Clusters</uicontrol>
+ <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+ </menucascade>.</li>
+ <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+ <li>Click <b>Create Resource Pool</b>.</li>
+ <li>Specify a name and resource limits for the Development pool:<ul id="ul_j42_q3z_pw">
+ <li>In the <b>Resource Pool Name</b> field, enter <userinput>Development</userinput>.</li>
+ <li>In the <uicontrol>Max Memory</uicontrol> field, enter <userinput>24000</userinput>.</li>
+ <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter
+ <userinput>8000</userinput>.</li>
+ <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter 1.</li>
+ <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter 100.</li>
+ </ul></li>
+ <li>Click <uicontrol>Create</uicontrol>.</li>
+ <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.<p>The Development queue runs
+ one query at a time. If the total memory required by the query exceeds 24000 MiB, it holds the
+ query until memory is released.</p></li>
+ </ol>
+ </conbody>
+ <concept id="setting_placement_rules">
+ <title>Understanding Placement Rules</title>
+ <conbody>
+ <p>Placement rules determine how queries are mapped to resource pools. The standard settings are
+ to use a specified pool when specified; otherwise, use the default pool.</p>
+ <p>For example, you can use the SET statement to select the pool in which to run a
+ query.<codeblock>SET REQUEST_POOL=Production;</codeblock></p>
+ <p>If you do not use a <codeph>SET</codeph> statement, queries are run in the default pool.</p>
+ </conbody>
+ </concept>
+ <concept id="setting_access_control_on_pools">
+ <title>Setting Access Control on Pools</title>
+ <conbody>
+ <p>You can specify that only cetain users and groups are allowed to use the pools you define.</p>
+ <p>To create a Development dynamic resource pool for Impala:</p>
+ <ol>
+ <li>In Cloudera Manager, select <menucascade>
+ <uicontrol>Clusters</uicontrol>
+ <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+ </menucascade>.</li>
+ <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+ <li>Click the <uicontrol>Edit</uicontrol> button for the Production pool.</li>
+ <li>Click the Submission Access Control tab.</li>
+ <li>Select <uicontrol>Allow these users and groups to submit to this pool</uicontrol>.</li>
+ <li>Enter a comma-separated list of users who can use the pool.
+ <image placement="break"
+ href="../images/howto_access_control.png" align="center" scalefit="yes"/>
+ </li>
+
+ <li>Click <uicontrol>Save</uicontrol>.</li>
+ </ol>
+ </conbody>
+ </concept>
+ </concept>
+<concept id="impala_resource_management_example">
+ <title>Impala Resource Management Example</title>
+ <conbody>
+ <p>Anne Chang is administrator for an enterprise data hub that runs a number of workloads,
+ including Impala. </p>
+ <p>Anne has a 20-node cluster that uses Cloudera Manager static partitioning. Because of the
+ heavy Impala workload, Anne needs to make sure Impala gets enough resources. While the best
+ configuration values might not be known in advance, she decides to start by allocating 50% of
+ resources to Impala. Each node has 128 GiB dedicated to each impalad. Impala has 2560 GiB in
+ aggregate that can be shared across the resource pools she creates.</p>
+ <p>Next, Anne studies the workload in more detail. After some research, she might choose to
+ revisit these initial values for static partitioning. </p>
+ <p>To figure out how to further allocate Impala\u2019s resources, Anne needs to consider the workloads
+ and users, and determine their requirements. There are a few main sources of Impala queries: <ul
+ id="ul_ml3_sf2_5w">
+ <li>Large reporting queries executed by an external process/tool. These are critical business
+ intelligence queries that are important for business decisions. It is important that they get
+ the resources they need to run. There typically are not many of these queries at a given
+ time.</li>
+ <li>Frequent, small queries generated by a web UI. These queries scan a limited amount of data
+ and do not require expensive joins or aggregations. These queries are important, but not as
+ critical, perhaps the client tries resending the query or the end user refreshes the
+ page.</li>
+ <li>Occasionally, expert users might run ad-hoc queries. The queries can vary significantly in
+ their resource requirements. While Anne wants a good experience for these users, it is hard to
+ control what they do (for example, submitting inefficient or incorrect queries by mistake).
+ Anne restricts these queries by default and tells users to reach out to her if they need more
+ resources. </li>
+ </ul></p>
+ <p>To set up admission control for this workload, Anne first runs the workloads independently, so
+ that she can observe the workload\u2019s resource usage in Cloudera Manager. If they could not easily
+ be run manually, but had been run in the past, Anne uses the history information from Cloudera
+ Manager. It can be helpful to use other search criteria (for example, <i>user</i>) to isolate
+ queries by workload. Anne uses the Cloudera Manager chart for Per-Node Peak Memory usage to
+ identify the maximum memory requirements for the queries. </p>
+ <p>From this data, Anne observes the following about the queries in the groups above:<ul
+ id="ul_amq_ng2_5w">
+ <li> Large reporting queries use up to 32 GiB per node. There are typically 1 or 2 queries
+ running at a time. On one occasion, she observed that 3 of these queries were running
+ concurrently. Queries can take 3 minutes to complete.</li>
+ <li>Web UI-generated queries use between 100 MiB per node to usually less than 4 GiB per node
+ of memory, but occasionally as much as 10 GiB per node. Queries take, on average, 5 seconds,
+ and there can be as many as 140 incoming queries per minute.</li>
+ <li>Anne has little data on ad hoc queries, but some are trivial (approximately 100 MiB per
+ node), others join several tables (requiring a few GiB per node), and one user submitted a
+ huge cross join of all tables that used all system resources (that was likely a mistake).</li>
+ </ul></p>
+ <p>Based on these observations, Anne creates the admission control configuration with the
+ following pools: </p>
+ <section id="section_yjc_h32_5w">
+ <title>XL_Reporting</title>
+ <p>
+ <table frame="all" rowsep="1" colsep="1" id="XL_Reporting_Table">
+ <tgroup cols="2">
+ <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+ <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+ <thead>
+ <row>
+ <entry>Property</entry>
+ <entry>Value</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Max Memory</entry>
+ <entry>1280 GiB</entry>
+ </row>
+ <row>
+ <entry>Default Query Memory Limit</entry>
+ <entry>32 GiB</entry>
+ </row>
+ <row>
+ <entry>Max Running Queries</entry>
+ <entry>2</entry>
+ </row>
+ <row>
+ <entry>Queue Timeout</entry>
+ <entry>5 minutes</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </p>
+ <p>This pool is for large reporting queries. In order to support running 2 queries at a time,
+ the pool memory resources are set to 1280 GiB (aggregate cluster memory). This is for 2
+ queries, each with 32 GiB per node, across 20 nodes. Anne sets the pool\u2019s Default Query Memory
+ Limit to 32 GiB so that no query uses more than 32 GiB on any given node. She sets Max Running
+ Queries to 2 (though it is not necessary she do so). She increases the pool\u2019s queue timeout to
+ 5 minutes in case a third query comes in and has to wait. She does not expect more than 3
+ concurrent queries, and she does not want them to wait that long anyway, so she does not
+ increase the queue timeout. If the workload increases in the future, she might choose to adjust
+ the configuration or buy more hardware. </p>
+ </section>
+ <section id="section_xm3_j32_5w"><title>HighThroughput_UI</title>
+ <p>
+ <table frame="all" rowsep="1" colsep="1" id="High_Throughput_UI_Table">
+ <tgroup cols="2">
+ <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+ <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+ <thead>
+ <row>
+ <entry>Property</entry>
+ <entry>Value</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Max Memory</entry>
+ <entry>960 GiB (inferred)</entry>
+ </row>
+ <row>
+ <entry>Default Query Memory Limit</entry>
+ <entry>4 GiB</entry>
+ </row>
+ <row>
+ <entry>Max Running Queries</entry>
+ <entry>12</entry>
+ </row>
+ <row>
+ <entry>Queue Timeout</entry>
+ <entry>5 minutes</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </p>
+ <p>This pool is used for the small, high throughput queries generated by the web tool. Anne sets
+ the Default Query Memory Limit to 4 GiB per node, and sets Max Running Queries to 12. This
+ implies a maximum amount of memory per node used by the queries in this pool: 48 GiB per node
+ (12 queries * 4 GiB per node memory limit).</p><p>Notice that Anne does not set the pool memory resources, but does set the pool\u2019s Default Query
+ Memory Limit. This is intentional: admission control processes queries faster when a pool uses
+ the Max Running Queries limit instead of the peak memory resources.</p><p>This should be enough memory for most queries, since only a few go over 4 GiB per node. For those
+ that do require more memory, they can probably still complete with less memory (spilling if
+ necessary). If, on occasion, a query cannot run with this much memory and it fails, Anne might
+ reconsider this configuration later, or perhaps she does not need to worry about a few rare
+ failures from this web UI.</p><p>With regard to throughput, since these queries take around 5 seconds and she is allowing 12
+ concurrent queries, the pool should be able to handle approximately 144 queries per minute,
+ which is enough for the peak maximum expected of 140 queries per minute. In case there is a
+ large burst of queries, Anne wants them to queue. The default maximum size of the queue is
+ already 200, which should be more than large enough. Anne does not need to change it.</p></section>
+ <section id="section_asm_yj2_5w"><title>Default</title>
+ <p>
+ <table frame="all" rowsep="1" colsep="1" id="default_table">
+ <tgroup cols="2">
+ <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+ <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+ <thead>
+ <row>
+ <entry>Property</entry>
+ <entry>Value</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>Max Memory</entry>
+ <entry>320 GiB</entry>
+ </row>
+ <row>
+ <entry>Default Query Memory Limit</entry>
+ <entry>4 GiB</entry>
+ </row>
+ <row>
+ <entry>Max Running Queries</entry>
+ <entry>Unlimited</entry>
+ </row>
+ <row>
+ <entry>Queue Timeout</entry>
+ <entry>60 Seconds</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </p><p>The default pool (which already exists) is a catch all for ad-hoc queries. Anne wants to use the
+ remaining memory not used by the first two pools, 16 GiB per node (XL_Reporting uses 64 GiB per
+ node, High_Throughput_UI uses 48 GiB per node). For the other pools to get the resources they
+ expect, she must still set the Max Memory resources and the Default Query Memory Limit. She
+ sets the Max Memory resources to 320 GiB (16 * 20). She sets the Default Query Memory Limit to
+ 4 GiB per node for now. That is somewhat arbitrary, but satisfies some of the ad hoc queries
+ she observed. If someone writes a bad query by mistake, she does not actually want it using all
+ the system resources. If a user has a large query to submit, an expert user can override the
+ Default Query Memory Limit (up to 16 GiB per node, since that is bound by the pool Max Memory
+ resources). If that is still insufficient for this user\u2019s workload, the user should work with
+ Anne to adjust the settings and perhaps create a dedicated pool for the workload.</p></section>
+ </conbody>
+</concept>
+</concept>