You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/01/24 17:34:00 UTC

[jira] [Commented] (IMPALA-9068) Impala should respect the distinction between the managed warehouse and the external warehouse

    [ https://issues.apache.org/jira/browse/IMPALA-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023135#comment-17023135 ] 

ASF subversion and git services commented on IMPALA-9068:
---------------------------------------------------------

Commit 0163a10332cc534f3a355a662a50de2d50b01c82 in impala's branch refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0163a10 ]

IMPALA-9068: Use different directories for external vs managed warehouse

Hive 3 changed the typical storage model for tables to split them
between two directories:
 - hive.metastore.warehouse.dir stores managed tables (which is now
   defined to be only transactional tables)
 - hive.metastore.warehouse.external.dir stores external tables
   (everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.

Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.

The Hive 2 configuration doesn't change as it does not have this concept.

Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.

Testing:
 - Ran exhaustive tests with USE_CDP_HIVE=false
 - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
 - Verified that dataload succeeds and tests are able to run with a newer
   Hive version.

Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Impala should respect the distinction between the managed warehouse and the external warehouse
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-9068
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9068
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>
> Recent Hive 3 makes a distinction between the directory for managed tables and the directory for external tables.
> {code:java}
> WAREHOUSE("metastore.warehouse.dir", "hive.metastore.warehouse.dir", "/user/hive/warehouse",
>         "location of default database for the warehouse"),    WAREHOUSE_EXTERNAL("metastore.warehouse.external.dir",        "hive.metastore.warehouse.external.dir", "",
>         "Default location for external tables created in the warehouse. " +
>         "If not set or null, then the normal warehouse location will be used as the default location."),
> {code}
> With HIVE-22189, Hive is militantly enforcing the distinction. It no longer allows external tables in the hive.metastore.warehouse.dir (the managed directory). The create table statements are currently translated to create external table statements with appropriate table properties, but in order for this to work correctly, we need to specify hive.metastore.warehouse.external.dir to be different from hive.metastore.warehouse.dir. A sensible approach is to set hive.metastore.warehouse.external.dir to /test-warehouse and change hive.metastore.warehouse.dir to something else, like /test-warehouse-managed.
> This will require further changes in our test infrastructure to incorporate this distinction. For example, tests/comparison/cluster/cluster.py's warehouse_dir needs to handle this appropriately (this is needed for testdata/bin/load_nested.py). It may also require changes to some paths for tests that use managed tables.
> hive.metastore.warehouse.external.dir does not exist in Hive 2, so this will require some Hive 3 specific logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org