You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2022/04/02 21:33:52 UTC

Next gen metastore

While not active in the development community as much I have been using
hive in the field as well as spark and impala for some time.

My ancmecdotal opinion is that the current metastore needs a significant re
write to deal with "next generation" workloads. By next generation I
actually mean last generation.

Currently cloudera's impala advice is . No more then 1k rows in table. And
tables with lots of partitions are problematic.

Thus really "wont get it done" at the "new" web scale. Hive server can have
memory problems with tables with 2k columns and 5k partitions.

It feels like design ideas like "surely we can fetch all the columns of a
table in one go' dont make sense universally.

Amazon has glue which can scale to amazon scale. Hive metastore cant even
really scale to q single organization. So what are the next steps,  I dont
think its simple as "move it to nosql" I think it has to be reworked from
ground up.


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Next gen metastore

Posted by Peter Vary <pv...@cloudera.com>.

Hi Edward,

We are currently working on integrating Apache Iceberg tables to Hive.
In the latest released Hive 4.0.0-alpha-1 it is possible to create tables backed by Iceberg tables, and those could be queried by Hive. You can define the partitioning using Iceberg specification like this:

CREATE EXTERNAL TABLE ice_table (id bigint, year_field date) PARTITIONED BY SPEC (year(year_field)) STORED BY ICEBERG;

These partitions will be handled by Iceberg, and in the HMS no partitions are stored.
This will remove a serious part of the load from the HMS and will allow higher number of partitions for a single table.

Also there is an ongoing work at the Impala project to read/write these Hive-Iceberg tables.

https://blog.cloudera.com/introducing-apache-iceberg-in-cloudera-data-platform/ <https://blog.cloudera.com/introducing-apache-iceberg-in-cloudera-data-platform/>

I hope this helps,
Peter


> On 2022. Apr 2., at 23:33, Edward Capriolo <ed...@gmail.com> wrote:
> 
> While not active in the development community as much I have been using hive in the field as well as spark and impala for some time.
> 
> My ancmecdotal opinion is that the current metastore needs a significant re write to deal with "next generation" workloads. By next generation I actually mean last generation. 
> 
> Currently cloudera's impala advice is . No more then 1k rows in table. And tables with lots of partitions are problematic.
> 
> Thus really "wont get it done" at the "new" web scale. Hive server can have memory problems with tables with 2k columns and 5k partitions. 
> 
> It feels like design ideas like "surely we can fetch all the columns of a table in one go' dont make sense universally.
> 
> Amazon has glue which can scale to amazon scale. Hive metastore cant even really scale to q single organization. So what are the next steps,  I dont think its simple as "move it to nosql" I think it has to be reworked from ground up.
> 
> 
> -- 
> Sorry this was sent from mobile. Will do less grammar and spell check than usual.