You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2017/11/04 21:26:00 UTC
[jira] [Commented] (HIVE-17983) Make the standalone metastore generate tarballs etc.

    [ https://issues.apache.org/jira/browse/HIVE-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239245#comment-16239245 ] 

Alan Gates commented on HIVE-17983:
-----------------------------------

I have committed a patch for this to the standalone-metastore branch.  There is much in this patch to comment on.

On the standalone-metastore side, this patch adds several things.  
# creation of tarballs for both both source and binary distributions, hopefully along with the necessary license information;
# install and upgrade scripts for the various RDBMS types (more on this below);
# a version of HiveSchemaTool (called MetastoreSchemaTool) for the standalone metastore;
# docker files for the various RDBMS types for testing (more on this below);
# shell scripts for running the schema tool and starting the metastore;
# log4j configuration for the metastore server process (this config is not what we want as it dumps the logfile in the current working directory, this should be changed to match Hive's behavior of where the file is written).

On the docker files, I have added ones for mysql and postgres.  I haven't done oracle or sqlserver yet.  The docker files are generated as part of the build, but the build does not call {{docker build}}.  This is partly because building the images is time consuming and eats up several G of disk space.  But for oracle there are also license issues that prevent the automatic inclusion of the docker images.  Right now the resulting containers just have the required RDBMS, Hadoop, and the metastore distribution tarball (unpacked).  My goal is to get to a point where different docker files are created for users to test against the metastore, and for automated testing of installation and upgrade of each RDBMS type.

On the installation and upgrade scripts I have only copied the 2.3 and 3.0 installation scripts and 2.3->3.0 upgrade.  My assumption is that the standalone metastore will be used with Hive 3.0 or later, so it doesn't make sense to copy all the older scripts.  I copied in the 2.3 installation scripts so that we could test the upgrade procedure.  

Also on these scripts I have unrolled them so that scripts no longer invoke other scripts.  For example, the 2.3->3.0 upgrade script now includes all the create table and alter table statement itself rather than calling run on the various 0XX-HIVE-XXXXX.rdbms.sql scripts.  The main reason for this is that HiveSchemaTool went to a lot of work to do the unrolling on these scripts.  As part of copying HiveSchemaTool and I had to convert it to use SqlLine rather than Beeline (since the metastore does not have access to beeline) and I did not want to go through the work of making the unrolling work for SqlLine.  And I saw no advantage to having every DB change in a separate script.  Our tools only support upgrade between versions.  I suspect these separate updates are a holdover from the days when Facebook used to run Hive top of trunk internally and thus wanted to be able to apply each change discretely.

This patch does not remove the RDBMS scripts from metastore or HiveSchemaTool from beeline.  There are two reasons for this.  One, the Hive information schema depends on HiveSchemaTool to setup a series of tables in Hive via beeline.  The metastore version of SchemaTool can't do this, because it doesn't have access to beeline.

But the second and much large reason is this brings up the question of how Hive and the standalone metastore should be installed.  Do we completely separate them out and require users to install the standalone metastore and then Hive?  This is easier for devs but harder on ops and packagers.  But it also gives users maximum flexibility.  Or do we modify the Hive build process to pull in the standalone metastore packages and produce a distribution that includes the metastore?  This is more work for us devs.  It gives users a seamless experience between older and newer versions of Hive.  It also matches user expectations (I can't think of any database that requires you to install its data catalog as a separate package).  On the other hand it locksteps a version of Hive with a version of the metastore, which may not be what people want.  I don't propose to answer these questions in this JIRA, but I wanted to bring them up so we can start discussing them.

Leaving copies of the installation and upgrade scripts in metastore and HiveSchemaTool in beeline that duplicate much code that's also in standalone-metastore is obviously not a viable long term solution.  We will need some combination of separating things cleanly and refactoring so that a minimum amount of code is duplicated.  But until we answer the questions above we won't know which way to go so I've left it like this for the moment.

> Make the standalone metastore generate tarballs etc.
> ----------------------------------------------------
>
>                 Key: HIVE-17983
>                 URL: https://issues.apache.org/jira/browse/HIVE-17983
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Standalone Metastore
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Major
>
> In order to be separately installable the standalone metastore needs its own tarballs, startup scripts, etc.  All of the SQL installation and upgrade scripts also need to move from metastore to standalone-metastore.
> I also plan to create Dockerfiles for different database types so that developers can test the SQL installation and upgrade scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)