You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by David Beveridge <db...@cylance.com> on 2014/08/12 02:52:52 UTC

Altering the Metastore on EC2

We are creating an Hive schema for reading massive JSON files. Our JSON schema is rather large, and we have found that the default metastore schema for Hive cannot work for us as-is.

To be specific, one field in our schema has about 17KB of nested structs within it. Unfortunately, it appears that Hive has a limit of varchar(4000) for the field that stores the resulting definition:

    CREATE TABLE "COLUMNS_V2" (
    "CD_ID" bigint NOT NULL,
    "COMMENT" varchar(4000),
    "COLUMN_NAME" varchar(128) NOT NULL,
    "TYPE_NAME" varchar(4000),
    "INTEGER_IDX" INTEGER NOT NULL,
    PRIMARY KEY ("CD_ID", "COLUMN_NAME")
    );

We are running this on Amazon MapReduce (v0.11 with default Derby metastore)

So, our initial questions are:

*         Is there a reason that the TYPE_NAME is being limited to 4000 (IIUC, varchar on derby can grow to 32672, which would be sufficient for a long time)

*         Can we alter the metastore schema without hacking/reinstalling Hive? (if so, how?)

*         If so, is there a proper way to update the schema on all nodes?


Thanks in advance!
--DB

Re: Altering the Metastore on EC2

Posted by Stephen Sprague <sp...@gmail.com>.

i'll take a stab at this.

- probably no reason.

- if you can. is there a derby client s/t you can issue the command: "alter
table COLUMNS_V2 modify TYPE_NAME varchar(32672)". otherwise maybe use the
mysql or postgres metastores (instead of derby) and run that alter command
after the install.

- the schema only exists in one place and that's the metastore (which is
probably on your namenode for derby.) for mysql or postgres it can be
anywhere you want but again examples will probably show localhost (the
namenode)

that's a mighty big schema! you don't just want to use string type and use
get_json_object to pull data out of it dynamically? not as elegant as using
static syntax like nested structs but its better than nothing. something to
think about anyway.

i'm guessing given a nested struct that large you'll get over one hump only
to be faced with another one. hive needs to do some crazy mapping there for
every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge <db...@cylance.com>
wrote:

>  We are creating an Hive schema for reading massive JSON files. Our JSON
> schema is rather large, and we have found that the default metastore schema
> for Hive cannot work for us as-is.
>
> To be specific, one field in our schema has about 17KB of nested structs
> within it. Unfortunately, it appears that Hive has a limit of varchar(4000)
> for the field that stores the resulting definition:
>
>
>
>     CREATE TABLE "COLUMNS_V2" (
>
>     "CD_ID" bigint NOT NULL,
>
>     "COMMENT" varchar(4000),
>
>     "COLUMN_NAME" varchar(128) NOT NULL,
>
>     "TYPE_NAME" varchar(4000),
>
>     "INTEGER_IDX" INTEGER NOT NULL,
>
>     PRIMARY KEY ("CD_ID", "COLUMN_NAME")
>
>     );
>
>
>
> We are running this on Amazon MapReduce (v0.11 with default Derby
> metastore)
>
>
>
> So, our initial questions are:
>
> ·         Is there a reason that the TYPE_NAME is being limited to 4000
> (IIUC, varchar on derby can grow to 32672, which would be sufficient for
> a long time)
>
> ·         Can we alter the metastore schema without hacking/reinstalling
> Hive? (if so, how?)
>
> ·         If so, is there a proper way to update the schema on all nodes?
>
>
>
>
>
> Thanks in advance!
>
> --DB
>