You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/08 23:40:42 UTC
[Hadoop Wiki] Trivial Update of "Hive/ViewDev" by CarlSteinbach

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/ViewDev" page has been changed by CarlSteinbach.
http://wiki.apache.org/hadoop/Hive/ViewDev?action=diff&rev1=14&rev2=15

--------------------------------------------------

+ = Hive Views =
+ 
+ <<TableOfContents>>
+ 
- = Use Cases =
+ == Use Cases ==
  
  Views (http://issues.apache.org/jira/browse/HIVE-972) are a standard DBMS feature and their uses are well understood.  A typical use case might be to create an interface layer with a consistent entity/attribute naming scheme on top of an existing set of inconsistently named tables, without having to cause disruption due to direct modification of the tables.  More advanced use cases would involve predefined filters, joins, aggregations, etc for simplifying query construction by end users, as well as sharing common definitions within ETL pipelines.
  
- = Scope =
+ == Scope ==
  
  At a minimum, we want to 
  
@@ -17, +21 @@

  
   * expose metadata about view definitions and dependencies (at table-level or column-level) in a way that makes them consumable by metadata-driven tools
  
- = Syntax =
+ == Syntax ==
  
  {{{
  CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
@@ -27, +31 @@

  DROP VIEW view_name
  }}}
  
- = Implementation Sketch =
+ == Implementation Sketch ==
  
  The basics of view implementation are very easy due to the fact that Hive already supports subselects in the FROM clause.
  
   * For '''CREATE VIEW v AS view-def-select''', we extend SemanticAnalyzer to behave similarly to '''CREATE TABLE t AS select''', except that we don't actually execute the query (we stop after plan generation).  It's necessary to perform all of plan generation (even though we're not actually going to execute the plan) since currently some validations such as type compatibility-checking are only performed during plan generation.  After successful validation, the text of the view is saved in the metastore (the simplest approach snips out the text from the parser's token stream, but this approach introduces problems described in the issues section below).
   * For '''select ... from view-reference''', we detect the view reference in SemanticAnalyzer.getMetaData, load the text of its definition from the metastore, parse it back into an AST, prepare a QBExpr to hold it, and then plug this into the referencing query's QB, resulting in a tree equivalent to '''select ... from (view-def-select)'''; plan generation can then be carried out on the combined tree.
  
- = Issues =
+ == Issues ==
  
  Some of these are related to functionality/scope; others are related to implementation approaches.  Opinions are welcome on all of them.
  
- == Stored View Definition ==
+ === Stored View Definition ===
  
  In SQL:200n, a view definition is supposed to be frozen at the time it is created, so that if the view is defined as select * from t, where t is a table with two columns a and b, then later requests to select * from the view should return just columns a and b, even if a new column c is later added to the table.  This is implemented correctly by most DBMS products.
  
@@ -52, +56 @@

  
  This approach will break if we ever need to perform more drastic (AST-based) rewrites as part of view expansion in the future.
  
- == Metastore Modeling ==
+ === Metastore Modeling ===
  
  The metastore model will need to be augmented in order to allow view definitions to be saved.  An important issue to be resolved is whether to model this via inheritance, or just shoehorn views in as a special kind of table.
  
@@ -81, +85 @@

  
  '''Update 20-Jan-2010''':  After further discussion with Prasad, we decided to put the view definition on the table object instead; for details, see discussion in [[https://issues.apache.org/jira/browse/HIVE-972|HIVE-972]].  Also, per [[https://issues.apache.org/jira/browse/HIVE-1068|HIVE-1068]], we added an attribute to store the type (view, managed table, external table) for each table descriptor.
  
- == Dependency Tracking ==
+ === Dependency Tracking ===
  
  It's necessary to track dependencies from a view to objects it references in the metastore:
  
@@ -101, +105 @@

  
  '''Update 30-Dec-2009''':  Based on a design review meeting, we'll start with the bare-minimum MySQL approach (with no metastore support for dependency tracking), then if time allows, add dependency analysis and storage, followed by CASCADE support.  See HIVE-1073 and HIVE-1074.
  
- == Dependency Invalidation ==
+ === Dependency Invalidation ===
  
  What happens when an object is modified underneath a view?  For example, suppose a view references a table's column, and then ALTER TABLE is used to drop or replace that column.  Note that if the column's datatype changes, the view definition may remain meaningful, but the view's schema may need to be updated to match.  Here are two possible options:
  
@@ -112, +116 @@

  
  '''Update 30-Dec-2009''':  Based on a design review meeting, we'll start with the lenient approach, without any support for marking objects invalid in the metastore, then if time allows, follow up with strict support and possibly metastore support for tracking object validity.  See HIVE-1077.
  
- == View Modification ==
+ === View Modification ===
  
  In SQL:200n, there's no standard way to update a view definition.  MySQL supports both
  
@@ -123, +127 @@

  
  '''Update 30-Dec-2009''':  Based on a design review meeting, we'll start with an Oracle-style ALTER VIEW v RECOMPILE, which can be used to revalidate a view definition, as well as to re-expand the original definition for clauses such as select *.  Then if time allows, we'll follow up with CREATE OR REPLACE VIEW support.  (The latter is less important since we're going with the lenient invalidation model, making DROP and re-CREATE possible without having to deal with downstream dependencies.)  See HIVE-1077 and HIVE-1078.
  
- == Fast Path Execution ==
+ === Fast Path Execution ===
  
  For '''select * from t''', hive supports fast-path execution (skipping Map/Reduce).  Is it important for this to work for '''select * from v''' as well?
  
  '''Update 30-Dec-2009''':  Based on feedback in JIRA, we'll leave this as dependent on getting the fast-path working for the underlying filters and projections.
  
- == ORDER BY and LIMIT in view definition ==
+ === ORDER BY and LIMIT in view definition ===
  
  SQL:200n prohibits ORDER BY in a view definition, since a view is supposed to be a virtual (unordered) table, not a query alias.  However, many DBMS's ignore this rule; for example, MySQL allows ORDER BY, but ignores it in the case where it is superceded by an ORDER BY in the query.  Should we prevent ORDER BY?  This question also applies to the LIMIT clause.
  
  '''Update 30-Dec-2009''':  Based on feedback in JIRA, ORDER BY is important as forward-looking to materialized views.  LIMIT may be less important, but we should probably support it too for consistency.
  
- == Underlying Partition Dependencies ==
+ === Underlying Partition Dependencies ===
  
  '''Update 30-Dec-2009''':  Prasad pointed out that even without supporting materialized views, it may be necessary to provide users with metadata about data dependencies between views and underlying table partitions so that users can avoid seeing inconsistent results during the window when not all partitions have been refreshed with the latest data.  One option is to attempt to derive this information automatically (using an overconservative guess in cases where the dependency analysis can't be made smart enough); another is to allow view creators to declare the dependency rules in some fashion as part of the view definition.  Based on a design review meeting, we will probably go with the automatic analysis approach once dependency tracking is implemented.  The analysis will be performed on-demand, perhaps as part of describing the view or submitting a query job against it.  Until this becomes available, users may be able to do their own analysis either via empirical lineage tools or via view->table dependency tracking metadata once it is implemented.  See HIVE-1079.
  
- = Metastore Upgrades =
+ == Metastore Upgrades ==
  
  Since we are adding new columns to the TBLS table in the metastore schema, existing metastore deployments will need to be upgraded.  There are two ways this can happen.
  
- == Automatic ALTER TABLE ==
+ === Automatic ALTER TABLE ===
  
  If the following property is set in the Hive configuration file, JDO will notice the difference between the persistent schema and the model and ALTER the tables automatically:
  
@@ -154, +158 @@

  </property>
  }}}
  
- == Explicit ALTER TABLE ==
+ === Explicit ALTER TABLE ===
  
  However, if the {{{datanucleus.autoCreateSchema}}} property is set to {{{false}}}, then the ALTER statements must be executed explicitly.  (An administrator may have set this property for safety in production configurations.)
  
@@ -170, +174 @@

  
  Note that it should be safe to execute this script and continue operations BEFORE upgrading Hive; the old Hive version will simply ignore/nullify the columns it doesn't recognize.
  
- == Existing Row UPDATE ==
+ === Existing Row UPDATE ===
  
  After the tables are altered, the new columns will contain NULL values for existing rows describing previously created tables.  This is correct for VIEW_ORIGINAL_TEXT and VIEW_EXPANDED_TEXT (since views did not previously exist), but is incorrect for the TBL_TYPE column introduced by [[https://issues.apache.org/jira/browse/HIVE-1068|HIVE-1068]].  The new Hive code is capable of handling this (automatically filling in the correct value for the new field when a descriptor is retrieved), but it does not "fix" the stored rows.  This could be an issue if in the future other tools are used to retrieve information directly from the metastore database rather than accessing the metastore API.