You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ar...@apache.org on 2019/07/03 20:24:18 UTC
[impala] 01/02: IMPALA-8519: [DOCS] Doc the limitation in insert
events from SparkSQL
This is an automated email from the ASF dual-hosted git repository.
arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 7bbf8344e174ef0ff948c6aa9e55e1bd91348f79
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Mon Jul 1 18:06:02 2019 -0700
IMPALA-8519: [DOCS] Doc the limitation in insert events from SparkSQL
- Also made a few formatting changes.
- Removed the Preview Release note for Invalidation of Metadata cache.
Change-Id: I36cfc7e592ed2588a8c1f8375033d60492b27a4f
Reviewed-on: http://gerrit.cloudera.org:8080/13777
Reviewed-by: Vihang Karajgaonkar <vi...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
docs/topics/impala_metadata.xml | 102 ++++++++++++++++++++++++++--------------
1 file changed, 68 insertions(+), 34 deletions(-)
diff --git a/docs/topics/impala_metadata.xml b/docs/topics/impala_metadata.xml
index 4139e80..98cb6fd 100644
--- a/docs/topics/impala_metadata.xml
+++ b/docs/topics/impala_metadata.xml
@@ -44,34 +44,54 @@ under the License.
<concept id="auto_invalidate_metadata">
- <title>Startup Options for Automatic Invalidation of Metadata</title>
+ <title>Automatic Invalidation of Metadata Cache</title>
<conbody>
<p>
To keep the size of metadata bounded, <codeph>catalogd</codeph> periodically scans all
the tables and invalidates those not recently used. There are two types of
- configurations in <codeph>catalogd</codeph>.
+ configurations for <codeph>catalogd</codeph> and <codeph>impalad</codeph>.
</p>
- <ul>
- <li>
- Time-based invalidation with the
- <codeph>‑‑invalidate_tables_timeout_s</codeph> flag:
- <codeph>Catalogd</codeph> invalidates tables that are not recently used in the
- specified time period (in seconds). This flag needs to be applied to both
- <codeph>impalad</codeph> and <codeph>catalogd</codeph>.
- </li>
+ <dl>
+ <dlentry>
- <li>
- Memory-based invalidation with the
- <codeph>‑‑invalidate_tables_on_memory_pressure</codeph> flag: When the
- memory pressure reaches 60% of JVM heap size after a Java garbage collection in
- <codeph>catalogd</codeph>, Impala invalidates 10% of the least recently used tables.
- This flag needs to be applied to both <codeph>impalad</codeph> and
- <codeph>catalogd</codeph>.
- </li>
- </ul>
+ <dt>
+ Time-based cache invalidation
+ </dt>
+
+ <dd>
+ <codeph>Catalogd</codeph> invalidates tables that are not recently used in the
+ specified time period (in seconds).
+ </dd>
+
+ <dd>
+ The <codeph>‑‑invalidate_tables_timeout_s</codeph> flag needs to be
+ applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.
+ </dd>
+
+ </dlentry>
+
+ <dlentry>
+
+ <dt>
+ Memory-based cache invalidation
+ </dt>
+
+ <dd>
+ When the memory pressure reaches 60% of JVM heap size after a Java garbage
+ collection in <codeph>catalogd</codeph>, Impala invalidates 10% of the least
+ recently used tables.
+ </dd>
+
+ <dd>
+ The <codeph>‑‑invalidate_tables_on_memory_pressure</codeph> flag needs
+ to be applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.
+ </dd>
+
+ </dlentry>
+ </dl>
<p>
Automatic invalidation of metadata provides more stability with lower chances of running
@@ -79,23 +99,28 @@ under the License.
require tuning.
</p>
- <note>
- This is a preview feature in Impala 3.1 and not generally available.
- </note>
-
</conbody>
</concept>
<concept id="auto_poll_hms_notification">
- <title>Automatic Metadata Sync using Hive Metastore Notification Events</title>
+ <title>Automatic Invalidation/Refresh of Metadata</title>
<conbody>
<p>
- When this feature is enabled, <codeph>catalogd</codeph> polls Hive Metastore (HMS)
- notification events at a configurable interval and processes the following changes:
+ When tools such as Hive and Spark are used to process the raw data ingested into Hive
+ tables, new HMS metadata (database, tables, partitions) and filesystem metadata (new
+ files in existing partitions/tables) is generated. In previous versions of Impala, in
+ order to pick up this new information, Impala users needed to manually issue an
+ <codeph>INVALIDATE</codeph> or <codeph>REFRESH</codeph> commands.
+ </p>
+
+ <p>
+ When automatic invalidate/refresh of metadata is enabled, <codeph>catalogd</codeph>
+ polls Hive Metastore (HMS) notification events at a configurable interval and processes
+ the following changes:
</p>
<note>
@@ -109,8 +134,8 @@ under the License.
</li>
<li>
- Refreshes the table when it receives the <codeph>ALTER</codeph>, <codeph>ADD</codeph>,
- or <codeph>DROP</codeph> its partitions.
+ Refreshes the partition when it receives the <codeph>ALTER</codeph>,
+ <codeph>ADD</codeph>, or <codeph>DROP</codeph> partitions.
</li>
<li>
@@ -176,11 +201,6 @@ under the License.
<ul>
<li>
- The operations that do not generate events in HMS, such as adding new data to existing
- tables/partitions from Spark, are not supported.
- </li>
-
- <li>
When you bypass HMS and add or remove data into table by adding files directly on the
filesystem, HMS does not generate the <codeph>INSERT</codeph> event, and the event
processor will not invalidate the corresponding table or refresh the corresponding
@@ -191,6 +211,12 @@ under the License.
<codeph>LOAD</codeph> command.
</p>
</li>
+
+ <li>
+ The Spark APIs that saves data to a specified location does not generate events in
+ HMS, thus is not supported. For example:
+<codeblock>Seq((1, 2)).toDF("i", "j").write.save("/user/hive/warehouse/spark_etl.db/customers/date=01012019")</codeblock>
+ </li>
</ul>
<p>
@@ -236,7 +262,15 @@ under the License.
</li>
<li>
- Restart the HiveServer2 and Hive Metastore services.
+ If applicable, set the <codeph>hive.metastore.dml.events</codeph> configuration key
+ to <codeph>true</codeph> in <codeph>hive-site.xml</codeph> used by the Spark
+ applications (typically, <codeph>/etc/hive/conf/hive-site.xml</codeph>) so that the
+ <codeph>INSERT</codeph> events are generated when the Spark application inserts data
+ into existing tables and partitions.
+ </li>
+
+ <li>
+ Restart the HiveServer2, Hive Metastore, and Spark (if applicable) services.
</li>
</ol>