You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Laszlo Pinter <lp...@cloudera.com.INVALID> on 2020/11/25 09:40:24 UTC

Iceberg/Hive properties handling

Hi All,

I would like to start a discussion, how should we handle properties from
various sources like Iceberg, Hive or global configuration. I've put
together a short document
<https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
please have a look and let me know what you think.

Thanks,
Laszlo

Re: Iceberg/Hive properties handling

Posted by Laszlo Pinter <lp...@cloudera.com.INVALID>.

Hi Team,

Based on the discussion on this thread  I implemented the Spark-like
catalog configuration in Hive. You can check the PR here
<https://github.com/apache/iceberg/pull/2129>.
Basically, I extended the current hive global config based catalog
configuration with table-level catalog configuration, but the latter one
has higher priority. I kept the old logic because I wanted to make sure
that I don't break any already implemented use cases using this config
approach. Sooner or later I would like to remove the old one because that
would reduce the code complexity and improve readability, but I'm not sure
when it's the right time.

Can I introduce such breaking changes between releases? Frankly, I would
keep both for time being, but deprecate the old one, and in a future
release completely remove it.

Thanks,
Laszlo




On Tue, Jan 12, 2021 at 4:20 AM Jacques Nadeau <ja...@gmail.com>
wrote:

> Hey Peter,
>
> Despite best intentions I made only nominal progress.
>
>
>
> On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
>> consolidate the thoughts in a document for the path forward. Did you have
>> time for that, or the holidays overwritten all of the plans as usual :)
>>
>> Orher: Ryan convinced me that it would be good to move forward with the
>> synchronised Hive-Iceberg property list whenever it is possible, and use
>> the Iceberg Table properties as a master when not. This would be the
>> solution which is alligns most with the other integration solutios.
>>
>> Thanks, Peter
>>
>> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
>> 8:27):
>>
>>> I like the strong coupling between Hive and Iceberg if we can make it
>>> work. It could be beneficial for the end users, but I still have some
>>> concerns.
>>> We should consider the following aspects:
>>> - Where has the change initiated (Hive or Spark)
>>> - Which Catalog is used (HiveCatalog or other)
>>> - Which Hive version is used (Hive 2/3)
>>>
>>> Some current constraints I think we have:
>>> - There could be multiple Hive tables above a single Iceberg table with
>>> most of the Catalogs (HiveCatalog being the single exception)
>>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>>> HiveCatalog
>>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>>> creation
>>> - Hive inserts modify the table data (one Iceberg commit) and then the
>>> table metadata (another Iceberg commit). This could be suboptimal but
>>> solvable.
>>>
>>> My feeling is that the tight coupling could work as expected with only
>>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>>> properties will deviate. That is why I think it would be easier to
>>> understand for the user that Iceberg and Hive is a different system with
>>> different properties.
>>>
>>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>>> with 1-on-1 mapping too.
>>> If we move this way we should remove the current property filtering from
>>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>>
>>> Thanks, Peter
>>>
>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>>> Sze 23:01):
>>>
>>>> Who cares if there are a few extra properties from Hive? Users may
>>>>> expect those properties to be there anyway.
>>>>
>>>>
>>>> Yeah, what is the key argument against letting them leak? What problem
>>>> are people trying to solve?
>>>>
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>>> either. There is no way to add a prefix for HMS properties that already
>>>>> exist, so the only option is to have a list of properties to suppress,
>>>>> which is option #1.
>>>>>
>>>>> I think that option #3 is a bad idea because it would lead to
>>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>>> sets table properties, those properties should be present in the source of
>>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>>> that would create a situation where properties appear to be missing because
>>>>> the user tried to use syntax that works for nearly every other table.
>>>>>
>>>>> That leaves either option #1 or doing nothing. I actually think that
>>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>>> extra properties from Hive? Users may expect those properties to be there
>>>>> anyway.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>>> <ht...@dremio.com> wrote:
>>>>>
>>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>>> thread on your capacity to look at this before EOY.
>>>>>>
>>>>>> I'm going to try to create a specification document based on the
>>>>>> discussion document you put together. I think there is general consensus
>>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>>> formalize that more.
>>>>>>
>>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>>> side of things. You outline four approaches:
>>>>>>
>>>>>>    1. Hard coded HMS only property list
>>>>>>    2. Hard coded Iceberg only property list
>>>>>>    3. Prefix for Iceberg properties
>>>>>>    4. Prefix for HMS only properties
>>>>>>
>>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>>> current state and the proposed solutions in a document:
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>>
>>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>>> all the possible solutions.
>>>>>>> Please comment!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>>
>>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>>
>>>>>>>    - external.table.purge
>>>>>>>    - EXTERNAL
>>>>>>>    - bucketing_version
>>>>>>>    - numRows
>>>>>>>    - rawDataSize
>>>>>>>    - totalSize
>>>>>>>    - numFiles
>>>>>>>    - numFileErasureCoded
>>>>>>>
>>>>>>>
>>>>>>> I am afraid that we can not change the name of most of these
>>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>>> Resending:
>>>>>>>
>>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>>> table level when the general engine independent configuration is not ideal
>>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>>> config.
>>>>>>>
>>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>>> from Spark, the Hive config should also be included right?
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>>
>>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>>
>>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>>> specific config.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>>> 1., Ke 17:06):
>>>>>>>>
>>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>>> current proposal?
>>>>>>>>>
>>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>>> notes:
>>>>>>>>>>>
>>>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>>>
>>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. I don’t think that catalog configuration should be kept
>>>>>>>>>>>    in table properties. A catalog should not be loaded for each table. So I
>>>>>>>>>>>    don’t think we need iceberg.catalog_location. Instead, we
>>>>>>>>>>>    should have a way to define catalogs in the Configuration
>>>>>>>>>>>    for tables in the metastore to reference.
>>>>>>>>>>>
>>>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>>> environments.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>>>
>>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>>> follows
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>>> default catalog.
>>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as
>>>>>>>>>>>>> well, so the behavior is consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>>>       confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Impala sets the real Iceberg table location for external
>>>>>>>>>>>> tables. E.g. if the user issues
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>>
>>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I like the flexibility of setting the table_identifier on
>>>>>>>>>>>>> table level, which could help removing naming conflicts. We might want to
>>>>>>>>>>>>> have this in the Iceberg Catalog implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>>>       HMSes.
>>>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>>>       opinion.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>>> round...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name
>>>>>>>>>>>>>> is confusing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I like this as it would allow you to use a different
>>>>>>>>>>>>>> database and table name in Hive as opposed to the Hadoop Catalog - at the
>>>>>>>>>>>>>> moment they have to match. The only thing here is that I think Hive
>>>>>>>>>>>>>> requires a table LOCATION to be set and it's then confusing as there are
>>>>>>>>>>>>>> now two locations on the table. I'm not sure whether in the Hive storage
>>>>>>>>>>>>>> handler or SerDe etc. we can get Hive to not require that and maybe even
>>>>>>>>>>>>>> disallow it from being set. That would probably be best in conjunction with
>>>>>>>>>>>>>> this. Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a
>>>>>>>>>>>>>>>>>>    table
>>>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>>>    instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>>>    any other config.
>>>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific
>>>>>>>>>>>>>>>>>>> concerns, and runtime overrides. A good example of an engine-specific
>>>>>>>>>>>>>>>>>>> concern is the catalogs that are available to load Iceberg tables. Spark
>>>>>>>>>>>>>>>>>>> has a way to load and configure catalog implementations and Iceberg uses
>>>>>>>>>>>>>>>>>>> that for all catalog-level config. Runtime overrides are things like target
>>>>>>>>>>>>>>>>>>> split size. Iceberg has a table-level default split size in table
>>>>>>>>>>>>>>>>>>> properties, but this can be overridden by a Spark option for each table, as
>>>>>>>>>>>>>>>>>>> well as an option passed to the individual read. Note that these
>>>>>>>>>>>>>>>>>>> necessarily have different config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses
>>>>>>>>>>>>>>>>>>>    the current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>>>    how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>>>       then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables
>>>>>>>>>>>>>>>>>>> will be used to load table locations. If it is set, then that catalog will
>>>>>>>>>>>>>>>>>>> be used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we
>>>>>>>>>>>>>>>>>>>> handle properties from various sources like Iceberg, Hive or global
>>>>>>>>>>>>>>>>>>>> configuration. I've put together a short document
>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@gmail.com>.

Hey Peter,

Despite best intentions I made only nominal progress.



On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
> consolidate the thoughts in a document for the path forward. Did you have
> time for that, or the holidays overwritten all of the plans as usual :)
>
> Orher: Ryan convinced me that it would be good to move forward with the
> synchronised Hive-Iceberg property list whenever it is possible, and use
> the Iceberg Table properties as a master when not. This would be the
> solution which is alligns most with the other integration solutios.
>
> Thanks, Peter
>
> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
> 8:27):
>
>> I like the strong coupling between Hive and Iceberg if we can make it
>> work. It could be beneficial for the end users, but I still have some
>> concerns.
>> We should consider the following aspects:
>> - Where has the change initiated (Hive or Spark)
>> - Which Catalog is used (HiveCatalog or other)
>> - Which Hive version is used (Hive 2/3)
>>
>> Some current constraints I think we have:
>> - There could be multiple Hive tables above a single Iceberg table with
>> most of the Catalogs (HiveCatalog being the single exception)
>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>> HiveCatalog
>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>> creation
>> - Hive inserts modify the table data (one Iceberg commit) and then the
>> table metadata (another Iceberg commit). This could be suboptimal but
>> solvable.
>>
>> My feeling is that the tight coupling could work as expected with only
>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>> properties will deviate. That is why I think it would be easier to
>> understand for the user that Iceberg and Hive is a different system with
>> different properties.
>>
>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>> with 1-on-1 mapping too.
>> If we move this way we should remove the current property filtering from
>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>> Sze 23:01):
>>
>>> Who cares if there are a few extra properties from Hive? Users may
>>>> expect those properties to be there anyway.
>>>
>>>
>>> Yeah, what is the key argument against letting them leak? What problem
>>> are people trying to solve?
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>> either. There is no way to add a prefix for HMS properties that already
>>>> exist, so the only option is to have a list of properties to suppress,
>>>> which is option #1.
>>>>
>>>> I think that option #3 is a bad idea because it would lead to
>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>> sets table properties, those properties should be present in the source of
>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>> that would create a situation where properties appear to be missing because
>>>> the user tried to use syntax that works for nearly every other table.
>>>>
>>>> That leaves either option #1 or doing nothing. I actually think that
>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>> extra properties from Hive? Users may expect those properties to be there
>>>> anyway.
>>>>
>>>> rb
>>>>
>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>> <ht...@dremio.com> wrote:
>>>>
>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>> thread on your capacity to look at this before EOY.
>>>>>
>>>>> I'm going to try to create a specification document based on the
>>>>> discussion document you put together. I think there is general consensus
>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>> formalize that more.
>>>>>
>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>> side of things. You outline four approaches:
>>>>>
>>>>>    1. Hard coded HMS only property list
>>>>>    2. Hard coded Iceberg only property list
>>>>>    3. Prefix for Iceberg properties
>>>>>    4. Prefix for HMS only properties
>>>>>
>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>> current state and the proposed solutions in a document:
>>>>>>
>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>
>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>> all the possible solutions.
>>>>>> Please comment!
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>
>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>
>>>>>>    - external.table.purge
>>>>>>    - EXTERNAL
>>>>>>    - bucketing_version
>>>>>>    - numRows
>>>>>>    - rawDataSize
>>>>>>    - totalSize
>>>>>>    - numFiles
>>>>>>    - numFileErasureCoded
>>>>>>
>>>>>>
>>>>>> I am afraid that we can not change the name of most of these
>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>
>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>> Resending:
>>>>>>
>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>> table level when the general engine independent configuration is not ideal
>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>> config.
>>>>>>
>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>> from Spark, the Hive config should also be included right?
>>>>>>
>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>
>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>
>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>> specific config.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>> 1., Ke 17:06):
>>>>>>>
>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>> current proposal?
>>>>>>>>
>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>> notes:
>>>>>>>>>>
>>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>>
>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>>>    think we need iceberg.catalog_location. Instead, we should
>>>>>>>>>>    have a way to define catalogs in the Configuration for tables
>>>>>>>>>>    in the metastore to reference.
>>>>>>>>>>
>>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>> environments.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>>
>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>
>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>> follows
>>>>>>>>>>>>
>>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>> default catalog.
>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>>       confusing.
>>>>>>>>>>>>
>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>>
>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>
>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>>       HMSes.
>>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>>       opinion.
>>>>>>>>>>>>
>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>> round...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>>    instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>>    any other config.
>>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>>    how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>>       then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Team,

@Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
consolidate the thoughts in a document for the path forward. Did you have
time for that, or the holidays overwritten all of the plans as usual :)

Orher: Ryan convinced me that it would be good to move forward with the
synchronised Hive-Iceberg property list whenever it is possible, and use
the Iceberg Table properties as a master when not. This would be the
solution which is alligns most with the other integration solutios.

Thanks, Peter

Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
8:27):

> I like the strong coupling between Hive and Iceberg if we can make it
> work. It could be beneficial for the end users, but I still have some
> concerns.
> We should consider the following aspects:
> - Where has the change initiated (Hive or Spark)
> - Which Catalog is used (HiveCatalog or other)
> - Which Hive version is used (Hive 2/3)
>
> Some current constraints I think we have:
> - There could be multiple Hive tables above a single Iceberg table with
> most of the Catalogs (HiveCatalog being the single exception)
> - I see no ways to propagate Spark changes for HMS if the Catalog is not
> HiveCatalog
> - Only Hive3 has ways to propagate changes to the Iceberg table after
> creation
> - Hive inserts modify the table data (one Iceberg commit) and then the
> table metadata (another Iceberg commit). This could be suboptimal but
> solvable.
>
> My feeling is that the tight coupling could work as expected with only the
> HiveCatalog using Hive3. In every other case the Iceberg and the HMS
> properties will deviate. That is why I think it would be easier to
> understand for the user that Iceberg and Hive is a different system with
> different properties.
>
> All that said we will use Hive3 and HiveCatalog so I think we are fine
> with 1-on-1 mapping too.
> If we move this way we should remove the current property filtering from
> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>
> Thanks, Peter
>
> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9., Sze
> 23:01):
>
>> Who cares if there are a few extra properties from Hive? Users may expect
>>> those properties to be there anyway.
>>
>>
>> Yeah, what is the key argument against letting them leak? What problem
>> are people trying to solve?
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>> either. There is no way to add a prefix for HMS properties that already
>>> exist, so the only option is to have a list of properties to suppress,
>>> which is option #1.
>>>
>>> I think that option #3 is a bad idea because it would lead to surprising
>>> behavior for users. If a user creates a table using Hive DDL and sets table
>>> properties, those properties should be present in the source of truth
>>> Iceberg table. If a prefix was required to forward them to Iceberg, that
>>> would create a situation where properties appear to be missing because the
>>> user tried to use syntax that works for nearly every other table.
>>>
>>> That leaves either option #1 or doing nothing. I actually think that
>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>> extra properties from Hive? Users may expect those properties to be there
>>> anyway.
>>>
>>> rb
>>>
>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>> <ht...@dremio.com> wrote:
>>>
>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>> thread on your capacity to look at this before EOY.
>>>>
>>>> I'm going to try to create a specification document based on the
>>>> discussion document you put together. I think there is general consensus
>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>> formalize that more.
>>>>
>>>> It seems like there is less consensus around the whitelist/blacklist
>>>> side of things. You outline four approaches:
>>>>
>>>>    1. Hard coded HMS only property list
>>>>    2. Hard coded Iceberg only property list
>>>>    3. Prefix for Iceberg properties
>>>>    4. Prefix for HMS only properties
>>>>
>>>> I generally think #2 is a no-go as it creates too much coupling between
>>>> catalog implementations and core iceberg. It seems like Ryan Blue would
>>>> prefer #4 (correct?). Any other strong opinions?
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>> current state and the proposed solutions in a document:
>>>>>
>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>
>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>> all the possible solutions.
>>>>> Please comment!
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>> TBLPROPERTIES on newly created tables:
>>>>>
>>>>>    - external.table.purge
>>>>>    - EXTERNAL
>>>>>    - bucketing_version
>>>>>    - numRows
>>>>>    - rawDataSize
>>>>>    - totalSize
>>>>>    - numFiles
>>>>>    - numFileErasureCoded
>>>>>
>>>>>
>>>>> I am afraid that we can not change the name of most of these
>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>> targeted way to push properties to the Iceberg tables.
>>>>>
>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>> Resending:
>>>>>
>>>>> Also I expect that we want to add Hive write specific configs to table
>>>>> level when the general engine independent configuration is not ideal for
>>>>> Hive, but every Hive query for a given table should use some specific
>>>>> config.
>>>>>
>>>>> Hive may need configuration, but I think these should still be kept in
>>>>> the Iceberg table. There is no reason to make Hive config inaccessible from
>>>>> other engines. If someone wants to view all of the config for a table from
>>>>> Spark, the Hive config should also be included right?
>>>>>
>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>
>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>
>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>> configs to table level when the general engine independent configuration is
>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>> specific config.
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>> 1., Ke 17:06):
>>>>>>
>>>>>>> Would someone be willing to create a document that states the
>>>>>>> current proposal?
>>>>>>>
>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>>>>
>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>
>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>>    think we need iceberg.catalog_location. Instead, we should
>>>>>>>>>    have a way to define catalogs in the Configuration for tables
>>>>>>>>>    in the metastore to reference.
>>>>>>>>>
>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>> environments.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>
>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>> behavior would cause the least surprises.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>
>>>>>>>>>>> Answers below:
>>>>>>>>>>>
>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>> follows
>>>>>>>>>>>
>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>
>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>> default catalog.
>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>> have currently in Impala.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>       confusing.
>>>>>>>>>>>
>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>
>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>
>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>       HMSes.
>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>       opinion.
>>>>>>>>>>>
>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>     Zoltan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>> round...
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>
>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine how
>>>>>>>>>>>>>>>>>    to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>>>>       use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

I like the strong coupling between Hive and Iceberg if we can make it work.
It could be beneficial for the end users, but I still have some concerns.
We should consider the following aspects:
- Where has the change initiated (Hive or Spark)
- Which Catalog is used (HiveCatalog or other)
- Which Hive version is used (Hive 2/3)

Some current constraints I think we have:
- There could be multiple Hive tables above a single Iceberg table with
most of the Catalogs (HiveCatalog being the single exception)
- I see no ways to propagate Spark changes for HMS if the Catalog is not
HiveCatalog
- Only Hive3 has ways to propagate changes to the Iceberg table after
creation
- Hive inserts modify the table data (one Iceberg commit) and then the
table metadata (another Iceberg commit). This could be suboptimal but
solvable.

My feeling is that the tight coupling could work as expected with only the
HiveCatalog using Hive3. In every other case the Iceberg and the HMS
properties will deviate. That is why I think it would be easier to
understand for the user that Iceberg and Hive is a different system with
different properties.

All that said we will use Hive3 and HiveCatalog so I think we are fine with
1-on-1 mapping too.
If we move this way we should remove the current property filtering from
the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.

Thanks, Peter

Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9., Sze
23:01):

> Who cares if there are a few extra properties from Hive? Users may expect
>> those properties to be there anyway.
>
>
> Yeah, what is the key argument against letting them leak? What problem are
> people trying to solve?
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>> either. There is no way to add a prefix for HMS properties that already
>> exist, so the only option is to have a list of properties to suppress,
>> which is option #1.
>>
>> I think that option #3 is a bad idea because it would lead to surprising
>> behavior for users. If a user creates a table using Hive DDL and sets table
>> properties, those properties should be present in the source of truth
>> Iceberg table. If a prefix was required to forward them to Iceberg, that
>> would create a situation where properties appear to be missing because the
>> user tried to use syntax that works for nearly every other table.
>>
>> That leaves either option #1 or doing nothing. I actually think that
>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>> properties to be mixed in the Iceberg table. Who cares if there are a few
>> extra properties from Hive? Users may expect those properties to be there
>> anyway.
>>
>> rb
>>
>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>> <ht...@dremio.com> wrote:
>>
>> Hey Peter, thanks for updating the doc and your heads up in the other
>>> thread on your capacity to look at this before EOY.
>>>
>>> I'm going to try to create a specification document based on the
>>> discussion document you put together. I think there is general consensus
>>> around what you call "Spark-like catalog configuration" so I'd like to
>>> formalize that more.
>>>
>>> It seems like there is less consensus around the whitelist/blacklist
>>> side of things. You outline four approaches:
>>>
>>>    1. Hard coded HMS only property list
>>>    2. Hard coded Iceberg only property list
>>>    3. Prefix for Iceberg properties
>>>    4. Prefix for HMS only properties
>>>
>>> I generally think #2 is a no-go as it creates too much coupling between
>>> catalog implementations and core iceberg. It seems like Ryan Blue would
>>> prefer #4 (correct?). Any other strong opinions?
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>> current state and the proposed solutions in a document:
>>>>
>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>
>>>> My feeling is that we do not have a final decision, so tried to list
>>>> all the possible solutions.
>>>> Please comment!
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>
>>>> When I was working on the CREATE TABLE patch I found the following
>>>> TBLPROPERTIES on newly created tables:
>>>>
>>>>    - external.table.purge
>>>>    - EXTERNAL
>>>>    - bucketing_version
>>>>    - numRows
>>>>    - rawDataSize
>>>>    - totalSize
>>>>    - numFiles
>>>>    - numFileErasureCoded
>>>>
>>>>
>>>> I am afraid that we can not change the name of most of these
>>>> properties, and might not be useful to have most of them along with Iceberg
>>>> statistics already there. Also my feeling is that this is only the top of
>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>> targeted way to push properties to the Iceberg tables.
>>>>
>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>>>>
>>>> Also I expect that we want to add Hive write specific configs to table
>>>> level when the general engine independent configuration is not ideal for
>>>> Hive, but every Hive query for a given table should use some specific
>>>> config.
>>>>
>>>> Hive may need configuration, but I think these should still be kept in
>>>> the Iceberg table. There is no reason to make Hive config inaccessible from
>>>> other engines. If someone wants to view all of the config for a table from
>>>> Spark, the Hive config should also be included right?
>>>>
>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>>>
>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>
>>>>> I see both pros and cons of catalog definition in config files. If
>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>
>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>> - no restriction is in place (I am not even sure that the currently
>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>> configs to table level when the general engine independent configuration is
>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>> specific config.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1.,
>>>>> Ke 17:06):
>>>>>
>>>>>> Would someone be willing to create a document that states the current
>>>>>> proposal?
>>>>>>
>>>>>> It is becoming somewhat difficult to follow this thread. I also worry
>>>>>> that without a complete statement of the current shape that people may be
>>>>>> incorrectly thinking they are in alignment.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>
>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>
>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>>>
>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie catalog,
>>>>>>>>    not just Hive or Hadoop. I don’t think it makes much sense to use separate
>>>>>>>>    hadoop.catalog and hive.catalog values. Those should just be names for
>>>>>>>>    catalogs configured in Configuration, i.e., via hive-site.xml.
>>>>>>>>    We then only need a special value for loading Hadoop tables from paths.
>>>>>>>>
>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>
>>>>>>>>
>>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>    think we need iceberg.catalog_location. Instead, we should have
>>>>>>>>    a way to define catalogs in the Configuration for tables in the
>>>>>>>>    metastore to reference.
>>>>>>>>
>>>>>>>>  I think it makes sense, on the other hand it would make adding new
>>>>>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>>>>>> and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>> environments.
>>>>>>>
>>>>>>>>
>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>
>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>> behavior would cause the least surprises.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>
>>>>>>>>>> Answers below:
>>>>>>>>>>
>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>> follows
>>>>>>>>>>
>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>>>>>       avoid any implicit behavior
>>>>>>>>>>
>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>> default catalog.
>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>>>>>> currently in Impala.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>       confusing.
>>>>>>>>>>
>>>>>>>>>> It could definitely work, but it is somewhat strange that we have
>>>>>>>>>> an external table location set to an arbitrary path, and we have a
>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>> E.g. if the user issues
>>>>>>>>>
>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>> STORED AS ICEBERG
>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>
>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>
>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>       HMSes.
>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>       opinion.
>>>>>>>>>>
>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>>>>>> prefix for separating real Iceberg table properties from other properties,
>>>>>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>     Zoltan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Based on the discussion below I understand we have the following
>>>>>>>>>>> kinds of properties:
>>>>>>>>>>>
>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>    related parameters
>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Based on this:
>>>>>>>>>>>
>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>> round...
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>
>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>> confusing.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>
>>>>>>>>>>>> The only question I have is should we have this as the default?
>>>>>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and use the
>>>>>>>>>>> location? Or should we require this property to be here to be consistent
>>>>>>>>>>> and avoid any "magic"?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>
>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>>>>>>
>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>>>> files.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>
>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. All other
>>>>>>>>>>>>>>>>    config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine how
>>>>>>>>>>>>>>>>    to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>>>       use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@dremio.com>.

>
> Who cares if there are a few extra properties from Hive? Users may expect
> those properties to be there anyway.


Yeah, what is the key argument against letting them leak? What problem are
people trying to solve?


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:

> I agree that #2 doesn’t really work. I also think that #4 can’t work
> either. There is no way to add a prefix for HMS properties that already
> exist, so the only option is to have a list of properties to suppress,
> which is option #1.
>
> I think that option #3 is a bad idea because it would lead to surprising
> behavior for users. If a user creates a table using Hive DDL and sets table
> properties, those properties should be present in the source of truth
> Iceberg table. If a prefix was required to forward them to Iceberg, that
> would create a situation where properties appear to be missing because the
> user tried to use syntax that works for nearly every other table.
>
> That leaves either option #1 or doing nothing. I actually think that
> there’s a strong argument to do nothing here and allow Hive and Iceberg
> properties to be mixed in the Iceberg table. Who cares if there are a few
> extra properties from Hive? Users may expect those properties to be there
> anyway.
>
> rb
>
> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
> <ht...@dremio.com> wrote:
>
> Hey Peter, thanks for updating the doc and your heads up in the other
>> thread on your capacity to look at this before EOY.
>>
>> I'm going to try to create a specification document based on the
>> discussion document you put together. I think there is general consensus
>> around what you call "Spark-like catalog configuration" so I'd like to
>> formalize that more.
>>
>> It seems like there is less consensus around the whitelist/blacklist side
>> of things. You outline four approaches:
>>
>>    1. Hard coded HMS only property list
>>    2. Hard coded Iceberg only property list
>>    3. Prefix for Iceberg properties
>>    4. Prefix for HMS only properties
>>
>> I generally think #2 is a no-go as it creates too much coupling between
>> catalog implementations and core iceberg. It seems like Ryan Blue would
>> prefer #4 (correct?). Any other strong opinions?
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>> current state and the proposed solutions in a document:
>>>
>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>
>>> My feeling is that we do not have a final decision, so tried to list all
>>> the possible solutions.
>>> Please comment!
>>>
>>> Thanks,
>>> Peter
>>>
>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>
>>> When I was working on the CREATE TABLE patch I found the following
>>> TBLPROPERTIES on newly created tables:
>>>
>>>    - external.table.purge
>>>    - EXTERNAL
>>>    - bucketing_version
>>>    - numRows
>>>    - rawDataSize
>>>    - totalSize
>>>    - numFiles
>>>    - numFileErasureCoded
>>>
>>>
>>> I am afraid that we can not change the name of most of these properties,
>>> and might not be useful to have most of them along with Iceberg statistics
>>> already there. Also my feeling is that this is only the top of the Iceberg
>>> (pun intended :)) so this is why I think we should be more targeted way to
>>> push properties to the Iceberg tables.
>>>
>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>>>
>>> Also I expect that we want to add Hive write specific configs to table
>>> level when the general engine independent configuration is not ideal for
>>> Hive, but every Hive query for a given table should use some specific
>>> config.
>>>
>>> Hive may need configuration, but I think these should still be kept in
>>> the Iceberg table. There is no reason to make Hive config inaccessible from
>>> other engines. If someone wants to view all of the config for a table from
>>> Spark, the Hive config should also be included right?
>>>
>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>>
>>>> I will ask Laszlo if he wants to update his doc.
>>>>
>>>> I see both pros and cons of catalog definition in config files. If
>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>
>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>> solution, because in Hive it is common to add new keys to the property list
>>>> - no restriction is in place (I am not even sure that the currently
>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>> configs to table level when the general engine independent configuration is
>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>> specific config.
>>>>
>>>> Thanks, Peter
>>>>
>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1.,
>>>> Ke 17:06):
>>>>
>>>>> Would someone be willing to create a document that states the current
>>>>> proposal?
>>>>>
>>>>> It is becoming somewhat difficult to follow this thread. I also worry
>>>>> that without a complete statement of the current shape that people may be
>>>>> incorrectly thinking they are in alignment.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>> boroknagyz@cloudera.com> wrote:
>>>>>
>>>>>> Thanks, Ryan. I answered inline.
>>>>>>
>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>>
>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie catalog,
>>>>>>>    not just Hive or Hadoop. I don’t think it makes much sense to use separate
>>>>>>>    hadoop.catalog and hive.catalog values. Those should just be names for
>>>>>>>    catalogs configured in Configuration, i.e., via hive-site.xml.
>>>>>>>    We then only need a special value for loading Hadoop tables from paths.
>>>>>>>
>>>>>>> About extensibility, I think the usual Hive way is to use Java class
>>>>>> names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>> a properties object (Map<String, String>). E.g.
>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>
>>>>>>>
>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>    think we need iceberg.catalog_location. Instead, we should have
>>>>>>>    a way to define catalogs in the Configuration for tables in the
>>>>>>>    metastore to reference.
>>>>>>>
>>>>>>>  I think it makes sense, on the other hand it would make adding new
>>>>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>>>>> and restart/reinit services. Maybe it can be cumbersome in some
>>>>>> environments.
>>>>>>
>>>>>>>
>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>
>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>> behavior would cause the least surprises.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>
>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Zoltan,
>>>>>>>>>
>>>>>>>>> Answers below:
>>>>>>>>>
>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>> follows
>>>>>>>>>
>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>>>>       avoid any implicit behavior
>>>>>>>>>
>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>> default catalog.
>>>>>>>>> It might be worthwhile to use the same class in Impala as well, so
>>>>>>>>> the behavior is consistent.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>>>>> currently in Impala.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>       confusing.
>>>>>>>>>
>>>>>>>>> It could definitely work, but it is somewhat strange that we have
>>>>>>>>> an external table location set to an arbitrary path, and we have a
>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>> E.g. if the user issues
>>>>>>>>
>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>> STORED AS ICEBERG
>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>
>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>
>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>
>>>>>>>>
>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>       HMSes.
>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>       - I see that these properties are used by the SerDe classes
>>>>>>>>>       in Hive, but I feel that these properties are just not about serialization
>>>>>>>>>       and deserialization. And as I see the current SERDEPROPERTIES are things
>>>>>>>>>       like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about
>>>>>>>>>       table loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>>>>>
>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>>>>> prefix for separating real Iceberg table properties from other properties,
>>>>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>>>>
>>>>>>>>
>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>     Zoltan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Based on the discussion below I understand we have the following
>>>>>>>>>> kinds of properties:
>>>>>>>>>>
>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>    related parameters
>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Based on this:
>>>>>>>>>>
>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>> round...
>>>>>>>>>>
>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>
>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>> confusing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>
>>>>>>>>>>> The only question I have is should we have this as the default?
>>>>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and use the
>>>>>>>>>> location? Or should we require this property to be here to be consistent
>>>>>>>>>> and avoid any "magic"?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>
>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>>>>>
>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>>> files.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>     Zoltan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>
>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>
>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>    3. There could be default "how to get to" information set
>>>>>>>>>>>>>>    at a global level
>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>>>>>>> Hive config.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has
>>>>>>>>>>>>>>> a Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. All other
>>>>>>>>>>>>>>>    config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>>>>>> data configuration should be maintained at the table level. This is cleaner
>>>>>>>>>>>>>>> for users because config is standardized across engines and in one place.
>>>>>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>>>>>> configuration to tune options that users almost never do, like row group or
>>>>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is used to
>>>>>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its
>>>>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>    - HMS table properties should be used to determine how
>>>>>>>>>>>>>>>    to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>>       use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>>>>>> used to load table locations. If it is set, then that catalog will be used
>>>>>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables from
>>>>>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but
>>>>>>>>>>>>>>> I think this is long enough for now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I agree that #2 doesn’t really work. I also think that #4 can’t work
either. There is no way to add a prefix for HMS properties that already
exist, so the only option is to have a list of properties to suppress,
which is option #1.

I think that option #3 is a bad idea because it would lead to surprising
behavior for users. If a user creates a table using Hive DDL and sets table
properties, those properties should be present in the source of truth
Iceberg table. If a prefix was required to forward them to Iceberg, that
would create a situation where properties appear to be missing because the
user tried to use syntax that works for nearly every other table.

That leaves either option #1 or doing nothing. I actually think that
there’s a strong argument to do nothing here and allow Hive and Iceberg
properties to be mixed in the Iceberg table. Who cares if there are a few
extra properties from Hive? Users may expect those properties to be there
anyway.

rb

On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
<ht...@dremio.com> wrote:

Hey Peter, thanks for updating the doc and your heads up in the other
> thread on your capacity to look at this before EOY.
>
> I'm going to try to create a specification document based on the
> discussion document you put together. I think there is general consensus
> around what you call "Spark-like catalog configuration" so I'd like to
> formalize that more.
>
> It seems like there is less consensus around the whitelist/blacklist side
> of things. You outline four approaches:
>
>    1. Hard coded HMS only property list
>    2. Hard coded Iceberg only property list
>    3. Prefix for Iceberg properties
>    4. Prefix for HMS only properties
>
> I generally think #2 is a no-go as it creates too much coupling between
> catalog implementations and core iceberg. It seems like Ryan Blue would
> prefer #4 (correct?). Any other strong opinions?
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> As Jacques suggested (with the help of Zoltan) I have collected the
>> current state and the proposed solutions in a document:
>>
>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>
>> My feeling is that we do not have a final decision, so tried to list all
>> the possible solutions.
>> Please comment!
>>
>> Thanks,
>> Peter
>>
>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>
>> When I was working on the CREATE TABLE patch I found the following
>> TBLPROPERTIES on newly created tables:
>>
>>    - external.table.purge
>>    - EXTERNAL
>>    - bucketing_version
>>    - numRows
>>    - rawDataSize
>>    - totalSize
>>    - numFiles
>>    - numFileErasureCoded
>>
>>
>> I am afraid that we can not change the name of most of these properties,
>> and might not be useful to have most of them along with Iceberg statistics
>> already there. Also my feeling is that this is only the top of the Iceberg
>> (pun intended :)) so this is why I think we should be more targeted way to
>> push properties to the Iceberg tables.
>>
>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>>
>> Also I expect that we want to add Hive write specific configs to table
>> level when the general engine independent configuration is not ideal for
>> Hive, but every Hive query for a given table should use some specific
>> config.
>>
>> Hive may need configuration, but I think these should still be kept in
>> the Iceberg table. There is no reason to make Hive config inaccessible from
>> other engines. If someone wants to view all of the config for a table from
>> Spark, the Hive config should also be included right?
>>
>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>
>>> I will ask Laszlo if he wants to update his doc.
>>>
>>> I see both pros and cons of catalog definition in config files. If there
>>> is an easy default then I do not mind any of the proposed solutions.
>>>
>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>> solution, because in Hive it is common to add new keys to the property list
>>> - no restriction is in place (I am not even sure that the currently
>>> implemented blacklist for preventing to propagate properties to Iceberg
>>> tables is complete). Also I expect that we want to add Hive write specific
>>> configs to table level when the general engine independent configuration is
>>> not ideal for Hive, but every Hive query for a given table should use some
>>> specific config.
>>>
>>> Thanks, Peter
>>>
>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1.,
>>> Ke 17:06):
>>>
>>>> Would someone be willing to create a document that states the current
>>>> proposal?
>>>>
>>>> It is becoming somewhat difficult to follow this thread. I also worry
>>>> that without a complete statement of the current shape that people may be
>>>> incorrectly thinking they are in alignment.
>>>>
>>>>
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>> boroknagyz@cloudera.com> wrote:
>>>>
>>>>> Thanks, Ryan. I answered inline.
>>>>>
>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>
>>>>>>    1. We need to keep in mind that users plug in their own catalogs,
>>>>>>    so iceberg.catalog could be a Glue or Nessie catalog, not just
>>>>>>    Hive or Hadoop. I don’t think it makes much sense to use separate
>>>>>>    hadoop.catalog and hive.catalog values. Those should just be names for
>>>>>>    catalogs configured in Configuration, i.e., via hive-site.xml. We
>>>>>>    then only need a special value for loading Hadoop tables from paths.
>>>>>>
>>>>>> About extensibility, I think the usual Hive way is to use Java class
>>>>> names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>> would need to have a factory method that constructs the catalog object from
>>>>> a properties object (Map<String, String>). E.g.
>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>
>>>>>>
>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>    think we need iceberg.catalog_location. Instead, we should have a
>>>>>>    way to define catalogs in the Configuration for tables in the
>>>>>>    metastore to reference.
>>>>>>
>>>>>>  I think it makes sense, on the other hand it would make adding new
>>>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>>>> and restart/reinit services. Maybe it can be cumbersome in some
>>>>> environments.
>>>>>
>>>>>>
>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>    locating the table are passed as the user would expect.
>>>>>>
>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>> behavior would cause the least surprises.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>> boroknagyz@apache.org> wrote:
>>>>>>
>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>
>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Zoltan,
>>>>>>>>
>>>>>>>> Answers below:
>>>>>>>>
>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>> follows
>>>>>>>>
>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>>>       avoid any implicit behavior
>>>>>>>>
>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>> default catalog.
>>>>>>>> It might be worthwhile to use the same class in Impala as well, so
>>>>>>>> the behavior is consistent.
>>>>>>>>
>>>>>>>
>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>>>> currently in Impala.
>>>>>>>
>>>>>>>>
>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>       confusing.
>>>>>>>>
>>>>>>>> It could definitely work, but it is somewhat strange that we have
>>>>>>>> an external table location set to an arbitrary path, and we have a
>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>> real location set in the external table location as well.
>>>>>>>>
>>>>>>>
>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>> E.g. if the user issues
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>> STORED AS ICEBERG
>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>
>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>
>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>
>>>>>>>
>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>       HMSes.
>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>       - I see that these properties are used by the SerDe classes
>>>>>>>>       in Hive, but I feel that these properties are just not about serialization
>>>>>>>>       and deserialization. And as I see the current SERDEPROPERTIES are things
>>>>>>>>       like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about
>>>>>>>>       table loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>>>>
>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>>>> prefix for separating real Iceberg table properties from other properties,
>>>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>>>
>>>>>>>
>>>>>>> In the google doc I also had a comment about prefixing iceberg table
>>>>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>>>>> passes every table property to the Iceberg table.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>     Zoltan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Based on the discussion below I understand we have the following
>>>>>>>>> kinds of properties:
>>>>>>>>>
>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>    related parameters
>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Based on this:
>>>>>>>>>
>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>> round...
>>>>>>>>>
>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>>>>>> catalog type.
>>>>>>>>>>
>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>> confusing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>
>>>>>>>>>> The only question I have is should we have this as the default?
>>>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and use the
>>>>>>>>> location? Or should we require this property to be here to be consistent
>>>>>>>>> and avoid any "magic"?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>       is used as table identifier
>>>>>>>>>>
>>>>>>>>>> I like this as it would allow you to use a different database and
>>>>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>>>>>> being set. That would probably be best in conjunction with this. Another
>>>>>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>>>>>> Hive point of view.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>    - hive.catalog
>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>       is used as table identifier
>>>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>>>>
>>>>>>>>>> These sound fine for Hive catalog tables that are created outside
>>>>>>>>> of the automatic Hive table creation (see
>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>>>>> need to document how you can create these yourself and that one could use a
>>>>>>>>> different Hive database and table etc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>> files.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>> metastore of tables.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>     Zoltan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>
>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>
>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>>>>> them:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>    3. There could be default "how to get to" information set
>>>>>>>>>>>>>    at a global level
>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>
>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>
>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>>>>>>> not the content/metadata.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>>>>>> Hive config.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>>>>>>> to clarify.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has
>>>>>>>>>>>>>> a Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. All other
>>>>>>>>>>>>>>    config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>>>>> data configuration should be maintained at the table level. This is cleaner
>>>>>>>>>>>>>> for users because config is standardized across engines and in one place.
>>>>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>>>>> configuration to tune options that users almost never do, like row group or
>>>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is used to
>>>>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its
>>>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>>>>    load a table: using a Hadoop location, using the default metastore catalog,
>>>>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>       use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>>>>> used to load table locations. If it is set, then that catalog will be used
>>>>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables from
>>>>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but
>>>>>>>>>>>>>> I think this is long enough for now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>> --
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Jacques,

Thanks for picking this up!

Coming from the Hive side I prefer solution #3.
Knowing Hive there will be plenty of surprising new properties which we either handle specifically (hard-coded omission) or will litter the quite neat list of Iceberg table properties. This will become even more pronounced if we implement HiveMetaHook.preAlterTable <https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hadoop/hive/metastore/HiveMetaHook.html#preAlterTable-org.apache.hadoop.hive.metastore.api.Table-org.apache.hadoop.hive.metastore.api.EnvironmentContext-> for propagating Hive ALTER TABLE changes to IcebergTables. Alter table is called by Hive every time for updating statistics on the table - if we forget to remove anything changing continuously we would end up creating 2 modifications on the Iceberg table every time when we insert a single row of data.

Thanks,
Peter

> On Dec 7, 2020, at 18:58, Jacques Nadeau <ja...@dremio.com> wrote:
> 
> Hey Peter, thanks for updating the doc and your heads up in the other thread on your capacity to look at this before EOY.
> 
> I'm going to try to create a specification document based on the discussion document you put together. I think there is general consensus around what you call "Spark-like catalog configuration" so I'd like to formalize that more. 
> 
> It seems like there is less consensus around the whitelist/blacklist side of things. You outline four approaches: 
> Hard coded HMS only property list
> Hard coded Iceberg only property list
> Prefix for Iceberg properties
> Prefix for HMS only properties
> I generally think #2 is a no-go as it creates too much coupling between catalog implementations and core iceberg. It seems like Ryan Blue would prefer #4 (correct?). Any other strong opinions?
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid> wrote:
> As Jacques suggested (with the help of Zoltan) I have collected the current state and the proposed solutions in a document:
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing <https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing>
> 
> My feeling is that we do not have a final decision, so tried to list all the possible solutions. 
> Please comment!
> 
> Thanks,
> Peter
> 
>> On Dec 2, 2020, at 18:10, Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
>> 
>> When I was working on the CREATE TABLE patch I found the following TBLPROPERTIES on newly created tables:
>> external.table.purge
>> EXTERNAL
>> bucketing_version
>> numRows
>> rawDataSize
>> totalSize
>> numFiles
>> numFileErasureCoded
>> 
>> I am afraid that we can not change the name of most of these properties, and might not be useful to have most of them along with Iceberg statistics already there. Also my feeling is that this is only the top of the Iceberg (pun intended :)) so this is why I think we should be more targeted way to push properties to the Iceberg tables.
>> 
>>> On Dec 2, 2020, at 18:04, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>>> 
>>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>>> 
>>> Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
>>> 
>>> Hive may need configuration, but I think these should still be kept in the Iceberg table. There is no reason to make Hive config inaccessible from other engines. If someone wants to view all of the config for a table from Spark, the Hive config should also be included right?
>>> 
>>> 
>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
>>> I will ask Laszlo if he wants to update his doc.
>>> 
>>> I see both pros and cons of catalog definition in config files. If there is an easy default then I do not mind any of the proposed solutions.
>>> 
>>> OTOH I am in favor of the "use prefix for Iceberg table properties" solution, because in Hive it is common to add new keys to the property list - no restriction is in place (I am not even sure that the currently implemented blacklist for preventing to propagate properties to Iceberg tables is complete). Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
>>> 
>>> Thanks, Peter 
>>> 
>>> Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> ezt írta (időpont: 2020. dec. 1., Ke 17:06):
>>> Would someone be willing to create a document that states the current proposal? 
>>> 
>>> It is becoming somewhat difficult to follow this thread. I also worry that without a complete statement of the current shape that people may be incorrectly thinking they are in alignment.
>>> 
>>> 
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> 
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <boroknagyz@cloudera.com <ma...@cloudera.com>> wrote:
>>> Thanks, Ryan. I answered inline.
>>> 
>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>>> This sounds like a good plan overall, but I have a couple of notes:
>>> 
>>> We need to keep in mind that users plug in their own catalogs, so iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. I don’t think it makes much sense to use separate hadoop.catalog and hive.catalog values. Those should just be names for catalogs configured in Configuration, i.e., via hive-site.xml. We then only need a special value for loading Hadoop tables from paths.
>>> About extensibility, I think the usual Hive way is to use Java class names. So this way the value for 'iceberg.catalog' could be e.g. 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation would need to have a factory method that constructs the catalog object from a properties object (Map<String, String>). E.g. 'org.apache.iceberg.hadoop.HadoopCatalog' would require 'iceberg.catalog_location' to be present in properties.
>>> I don’t think that catalog configuration should be kept in table properties. A catalog should not be loaded for each table. So I don’t think we need iceberg.catalog_location. Instead, we should have a way to define catalogs in the Configuration for tables in the metastore to reference.
>>>  I think it makes sense, on the other hand it would make adding new catalogs more heavy-weight, i.e. now you'd need to edit configuration files and restart/reinit services. Maybe it can be cumbersome in some environments.
>>> I’d rather use a prefix to exclude properties from being passed to Iceberg than to include them. Otherwise, users don’t know what to do to pass table properties from Hive or Impala. If we exclude a prefix or specific properties, then everything but the properties reserved for locating the table are passed as the user would expect.
>>> I don't have a strong opinion about this, but yeah, maybe this behavior would cause the least surprises.
>>> 
>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>>> Thanks, Peter. I answered inline.
>>> 
>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pvary@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>> Hi Zoltan,
>>> 
>>> Answers below:
>>> 
>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <boroknagyz@cloudera.com.INVALID <ma...@cloudera.com.INVALID>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Thanks for the replies. My take for the above questions are as follows
>>>> Should 'iceberg.catalog' be a required property?
>>>> Yeah, I think it would be nice if this would be required to avoid any implicit behavior
>>> Currently we have a Catalogs class to get/initialize/use the different Catalogs. At that time the decision was to use HadoopTables as a default catalog.
>>> It might be worthwhile to use the same class in Impala as well, so the behavior is consistent.
>>>  
>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever possible. The Catalogs class is very similar to what we have currently in Impala.  
>>>> 'hadoop.catalog' LOCATION and catalog_location
>>>> In Impala we don't allow setting LOCATION for tables stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg table's actual location. We were also thinking about using only the table LOCATION, and set it to the catalog location, but we also found it confusing.
>>> It could definitely work, but it is somewhat strange that we have an external table location set to an arbitrary path, and we have a different location generated by other configs. It would be nice to have the real location set in the external table location as well.
>>> 
>>> Impala sets the real Iceberg table location for external tables. E.g. if the user issues
>>> 
>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>> STORED AS ICEBERG
>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>> 
>>> If the end user had specified LOCATION, then Impala would have raised an error. But the above DDL statement is correct, so Impala loads the iceberg table via Iceberg API, then creates the HMS table and sets LOCATION to the Iceberg table location (something like /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>> 
>>> 
>>> I like the flexibility of setting the table_identifier on table level, which could help removing naming conflicts. We might want to have this in the Iceberg Catalog implementation.
>>>> 'iceberg.table_identifier' for HiveCatalog
>>>> Yeah, it doesn't add much if we only allow using the current HMS. I think it can be only useful if we are allowing external HMSes.
>>>> Moving properties to SERDEPROPERTIES
>>>> I see that these properties are used by the SerDe classes in Hive, but I feel that these properties are just not about serialization and deserialization. And as I see the current SERDEPROPERTIES are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table loading more naturally belong to TBLPROPERTIES in my opinion.
>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for separating real Iceberg table properties from other properties, then we can keep it at TBLPROPERTIES.
>>> 
>>> In the google doc I also had a comment about prefixing iceberg table properties. We could use a prefix like 'iceberg.tblproperties.', and pass every property with this prefix to the Iceberg table. Currently Impala passes every table property to the Iceberg table.
>>> 
>>> 
>>> 
>>>> Thanks,
>>>>     Zoltan
>>>> 
>>>> 
>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pvary@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>>> Hi,
>>>> 
>>>> Based on the discussion below I understand we have the following kinds of properties:
>>>> Iceberg table properties - Engine independent, storage related parameters
>>>> "how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs.
>>>> Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis. 
>>>> 
>>>> Based on this:
>>>> Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>> Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties?
>>>> 
>>>> Thanks,
>>>> Peter
>>>> 
>>>> 
>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdosage@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> I like these suggestions, comments inline below on the last round...
>>>>> 
>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>>>>> Hi,
>>>>> 
>>>>> The above aligns with what we did in Impala, i.e. we store information about table loading in HMS table properties. We are just a bit more explicit about which catalog to use.
>>>>> We have table property 'iceberg.catalog' to determine the catalog type, right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be set based on the catalog type.
>>>>> 
>>>>> So, if the value of 'iceberg.catalog' is
>>>>> 
>>>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>>>  
>>>>> hadoop.tables
>>>>> the table location is used to load the table
>>>>> The only question I have is should we have this as the default? i.e. if you don't set a catalog it will assume its HadoopTables and use the location? Or should we require this property to be here to be consistent and avoid any "magic"?
>>>>>  
>>>>> hadoop.catalog
>>>>> Required table property 'iceberg.catalog_location' specifies the location of the hadoop catalog in the file system
>>>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>>>> I like this as it would allow you to use a different database and table name in Hive as opposed to the Hadoop Catalog - at the moment they have to match. The only thing here is that I think Hive requires a table LOCATION to be set and it's then confusing as there are now two locations on the table. I'm not sure whether in the Hive storage handler or SerDe etc. we can get Hive to not require that and maybe even disallow it from being set. That would probably be best in conjunction with this. Another solution would be to not have the 'iceberg.catalog_location' property but instead use the table LOCATION for this but that's a bit confusing from a Hive point of view.
>>>>> 
>>>>> hive.catalog
>>>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>>>> We have the assumption that the current Hive metastore stores the table, i.e. we don't support external Hive metastores currently
>>>>> These sound fine for Hive catalog tables that are created outside of the automatic Hive table creation (see https://iceberg.apache.org/hive/ <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to document how you can create these yourself and that one could use a different Hive database and table etc.
>>>>>  
>>>>> Independent of catalog implementations, but we also have table property 'iceberg.file_format' to specify the file format for the data files.
>>>>> 
>>>>> OK, I don't think we need that for Hive?
>>>>>  
>>>>> We haven't released it yet, so we are open to changes, but I think these properties are reasonable and it would be great if we could standardize the properties across engines that use HMS as the primary metastore of tables.
>>>>> 
>>>>> 
>>>>> If others agree I think we should create an issue where we document the above changes so it's very clear what we're doing and can then go an implement them and update the docs etc.
>>>>>  
>>>>> Cheers,
>>>>>     Zoltan
>>>>> 
>>>>> 
>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>>>> Yes, I think that is a good summary of the principles.
>>>>> 
>>>>> #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>>>> Minor error, my last example should have been:
>>>>> 
>>>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>> Iceberg metadata describes all properties of a table
>>>>> Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>> There could be default "how to get to" information set at a global level
>>>>> Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
>>>>> Is that a fair summary of your statements Ryan (except 4, which I just added)?
>>>>> 
>>>>> One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2
>>>>> 
>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>> 
>>>>> Similarly, you might want to expose a particular branch version of a table. So it might say:
>>>>> 
>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>> 
>>>>> Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata.
>>>>> 
>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic?
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these problems as well, so this is a good time to have a discussion about Hive config.
>>>>> 
>>>>> I think that Hive configuration should work mostly like other engines, where different configurations are used for different purposes. Different purposes means that there is not a global configuration priority. Hopefully, I can explain how we use the different config sources elsewhere to clarify.
>>>>> 
>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop Configuration, but it also has its own global configuration. There are also Iceberg table properties, and all of the various Hive properties if you’re tracking tables with a Hive MetaStore.
>>>>> 
>>>>> The first step is to simplify where we can, so we effectively eliminate 2 sources of config:
>>>>> 
>>>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like FileSystem. Iceberg should not use it for any other config.
>>>>> Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.
>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>> 
>>>>> Iceberg differs from other tables because it is opinionated: data configuration should be maintained at the table level. This is cleaner for users because config is standardized across engines and in one place. And it also enables services that analyze a table and update its configuration to tune options that users almost never do, like row group or stripe size in the columnar formats. Iceberg table configuration is used to configure table-specific concerns and behavior.
>>>>> 
>>>>> Spark configuration is used for engine-specific concerns, and runtime overrides. A good example of an engine-specific concern is the catalogs that are available to load Iceberg tables. Spark has a way to load and configure catalog implementations and Iceberg uses that for all catalog-level config. Runtime overrides are things like target split size. Iceberg has a table-level default split size in table properties, but this can be overridden by a Spark option for each table, as well as an option passed to the individual read. Note that these necessarily have different config names for how they are used: Iceberg uses read.split.target-size and the read-specific option is target-size.
>>>>> 
>>>>> Applying this to Hive is a little strange for a couple reasons. First, Hive’s engine configuration is a Hadoop Configuration. As a result, I think the right place to store engine-specific config is there, including Iceberg catalogs using a strategy similar to what Spark does: what external Iceberg catalogs are available and their configuration should come from the HiveConf.
>>>>> 
>>>>> The second way Hive is strange is that Hive needs to use its own MetaStore to track Hive table concerns. The MetaStore may have tables created by an Iceberg HiveCatalog, and Hive also needs to be able to load tables from other Iceberg catalogs by creating table entries for them.
>>>>> 
>>>>> Here’s how I think Hive should work:
>>>>> 
>>>>> There should be a default HiveCatalog that uses the current MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>> Other catalogs should be defined in HiveConf
>>>>> HMS table properties should be used to determine how to load a table: using a Hadoop location, using the default metastore catalog, or using an external Iceberg catalog
>>>>> If there is a metadata_location, then use the HiveCatalog for this metastore (where it is tracked)
>>>>> If there is a catalog property, then load that catalog and use it to load the table identifier, or maybe an identifier from HMS table properties
>>>>> If there is no catalog or metadata_location, then use HadoopTables to load the table location as an Iceberg table
>>>>> This would make it possible to access all types of Iceberg tables in the same query, and would match how Spark and Flink configure catalogs. Other than the configuration above, I don’t think that config in HMS should be used at all, like how the other engines work. Iceberg is the source of truth for table metadata, HMS stores how to load the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>> 
>>>>> This isn’t quite how configuration works right now. Currently, the catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, HadoopTables will be used to load table locations. If it is set, then that catalog will be used to load all tables by name. This makes it impossible to load tables from different catalogs at the same time. That’s why I think the Iceberg catalog for a table should be stored in HMS table properties.
>>>>> 
>>>>> I should also explain iceberg.hive.engine.enabled flag, but I think this is long enough for now.
>>>>> 
>>>>> rb
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpinter@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>>>> Hi All,
>>>>> 
>>>>> I would like to start a discussion, how should we handle properties from various sources like Iceberg, Hive or global configuration. I've put together a short document <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, please have a look and let me know what you think.
>>>>> 
>>>>> Thanks,
>>>>> Laszlo
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@dremio.com>.

Hey Peter, thanks for updating the doc and your heads up in the other
thread on your capacity to look at this before EOY.

I'm going to try to create a specification document based on the discussion
document you put together. I think there is general consensus around what
you call "Spark-like catalog configuration" so I'd like to formalize that
more.

It seems like there is less consensus around the whitelist/blacklist side
of things. You outline four approaches:

   1. Hard coded HMS only property list
   2. Hard coded Iceberg only property list
   3. Prefix for Iceberg properties
   4. Prefix for HMS only properties

I generally think #2 is a no-go as it creates too much coupling between
catalog implementations and core iceberg. It seems like Ryan Blue would
prefer #4 (correct?). Any other strong opinions?
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> As Jacques suggested (with the help of Zoltan) I have collected the
> current state and the proposed solutions in a document:
>
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>
> My feeling is that we do not have a final decision, so tried to list all
> the possible solutions.
> Please comment!
>
> Thanks,
> Peter
>
> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>
> When I was working on the CREATE TABLE patch I found the following
> TBLPROPERTIES on newly created tables:
>
>    - external.table.purge
>    - EXTERNAL
>    - bucketing_version
>    - numRows
>    - rawDataSize
>    - totalSize
>    - numFiles
>    - numFileErasureCoded
>
>
> I am afraid that we can not change the name of most of these properties,
> and might not be useful to have most of them along with Iceberg statistics
> already there. Also my feeling is that this is only the top of the Iceberg
> (pun intended :)) so this is why I think we should be more targeted way to
> push properties to the Iceberg tables.
>
> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>
> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>
> Also I expect that we want to add Hive write specific configs to table
> level when the general engine independent configuration is not ideal for
> Hive, but every Hive query for a given table should use some specific
> config.
>
> Hive may need configuration, but I think these should still be kept in the
> Iceberg table. There is no reason to make Hive config inaccessible from
> other engines. If someone wants to view all of the config for a table from
> Spark, the Hive config should also be included right?
>
> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>
>> I will ask Laszlo if he wants to update his doc.
>>
>> I see both pros and cons of catalog definition in config files. If there
>> is an easy default then I do not mind any of the proposed solutions.
>>
>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>> solution, because in Hive it is common to add new keys to the property list
>> - no restriction is in place (I am not even sure that the currently
>> implemented blacklist for preventing to propagate properties to Iceberg
>> tables is complete). Also I expect that we want to add Hive write specific
>> configs to table level when the general engine independent configuration is
>> not ideal for Hive, but every Hive query for a given table should use some
>> specific config.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke
>> 17:06):
>>
>>> Would someone be willing to create a document that states the current
>>> proposal?
>>>
>>> It is becoming somewhat difficult to follow this thread. I also worry
>>> that without a complete statement of the current shape that people may be
>>> incorrectly thinking they are in alignment.
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>> boroknagyz@cloudera.com> wrote:
>>>
>>>> Thanks, Ryan. I answered inline.
>>>>
>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>
>>>>>    1. We need to keep in mind that users plug in their own catalogs,
>>>>>    so iceberg.catalog could be a Glue or Nessie catalog, not just
>>>>>    Hive or Hadoop. I don’t think it makes much sense to use separate
>>>>>    hadoop.catalog and hive.catalog values. Those should just be names for
>>>>>    catalogs configured in Configuration, i.e., via hive-site.xml. We
>>>>>    then only need a special value for loading Hadoop tables from paths.
>>>>>
>>>>> About extensibility, I think the usual Hive way is to use Java class
>>>> names. So this way the value for 'iceberg.catalog' could be e.g.
>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>> would need to have a factory method that constructs the catalog object from
>>>> a properties object (Map<String, String>). E.g.
>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>> 'iceberg.catalog_location' to be present in properties.
>>>>
>>>>>
>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>    think we need iceberg.catalog_location. Instead, we should have a
>>>>>    way to define catalogs in the Configuration for tables in the
>>>>>    metastore to reference.
>>>>>
>>>>>  I think it makes sense, on the other hand it would make adding new
>>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>>> and restart/reinit services. Maybe it can be cumbersome in some
>>>> environments.
>>>>
>>>>>
>>>>>    1. I’d rather use a prefix to exclude properties from being passed
>>>>>    to Iceberg than to include them. Otherwise, users don’t know what to do to
>>>>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>    specific properties, then everything but the properties reserved for
>>>>>    locating the table are passed as the user would expect.
>>>>>
>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>> behavior would cause the least surprises.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>> boroknagyz@apache.org> wrote:
>>>>>
>>>>>> Thanks, Peter. I answered inline.
>>>>>>
>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>> Answers below:
>>>>>>>
>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>> follows
>>>>>>>
>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>>       avoid any implicit behavior
>>>>>>>
>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>> default catalog.
>>>>>>> It might be worthwhile to use the same class in Impala as well, so
>>>>>>> the behavior is consistent.
>>>>>>>
>>>>>>
>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>>> currently in Impala.
>>>>>>
>>>>>>>
>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>       - In Impala we don't allow setting LOCATION for tables stored
>>>>>>>       in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>>>>>       table's actual location. We were also thinking about using only the table
>>>>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>>>>       confusing.
>>>>>>>
>>>>>>> It could definitely work, but it is somewhat strange that we have an
>>>>>>> external table location set to an arbitrary path, and we have a different
>>>>>>> location generated by other configs. It would be nice to have the real
>>>>>>> location set in the external table location as well.
>>>>>>>
>>>>>>
>>>>>> Impala sets the real Iceberg table location for external tables. E.g.
>>>>>> if the user issues
>>>>>>
>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>> STORED AS ICEBERG
>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>
>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>
>>>>>> If the end user had specified LOCATION, then Impala would have raised
>>>>>> an error. But the above DDL statement is correct, so Impala loads the
>>>>>> iceberg table via Iceberg API, then creates the HMS table and sets LOCATION
>>>>>> to the Iceberg table location (something like
>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>
>>>>>>
>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>
>>>>>>>
>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>       HMSes.
>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>       - I see that these properties are used by the SerDe classes
>>>>>>>       in Hive, but I feel that these properties are just not about serialization
>>>>>>>       and deserialization. And as I see the current SERDEPROPERTIES are things
>>>>>>>       like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about
>>>>>>>       table loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>>>
>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>>> prefix for separating real Iceberg table properties from other properties,
>>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>>
>>>>>>
>>>>>> In the google doc I also had a comment about prefixing iceberg table
>>>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>>>> passes every table property to the Iceberg table.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>     Zoltan
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Based on the discussion below I understand we have the following
>>>>>>>> kinds of properties:
>>>>>>>>
>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>    related parameters
>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>
>>>>>>>>
>>>>>>>> Based on this:
>>>>>>>>
>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I like these suggestions, comments inline below on the last round...
>>>>>>>>
>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>>>>> catalog type.
>>>>>>>>>
>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>> confusing.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - hadoop.tables
>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>
>>>>>>>>> The only question I have is should we have this as the default?
>>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and use the
>>>>>>>> location? Or should we require this property to be here to be consistent
>>>>>>>> and avoid any "magic"?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - hadoop.catalog
>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>       is used as table identifier
>>>>>>>>>
>>>>>>>>> I like this as it would allow you to use a different database and
>>>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>>>>> being set. That would probably be best in conjunction with this. Another
>>>>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>>>>> Hive point of view.
>>>>>>>>
>>>>>>>>
>>>>>>>>>    - hive.catalog
>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>       is used as table identifier
>>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>>>
>>>>>>>>> These sound fine for Hive catalog tables that are created outside
>>>>>>>> of the automatic Hive table creation (see
>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>>>> need to document how you can create these yourself and that one could use a
>>>>>>>> different Hive database and table etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>
>>>>>>>>
>>>>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>>>>> these properties are reasonable and it would be great if we could
>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>> metastore of tables.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> If others agree I think we should create an issue where we document
>>>>>>>> the above changes so it's very clear what we're doing and can then go an
>>>>>>>> implement them and update the docs etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>     Zoltan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>
>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>
>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>>>> them:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>    3. There could be default "how to get to" information set
>>>>>>>>>>>>    at a global level
>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>
>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which
>>>>>>>>>>>> I just added)?
>>>>>>>>>>>>
>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>
>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>
>>>>>>>>>>>> Similarly, you might want to expose a particular branch version
>>>>>>>>>>>> of a table. So it might say:
>>>>>>>>>>>>
>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>
>>>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>>>>>> not the content/metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>>>>> Hive config.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>>>>>> to clarify.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. All other
>>>>>>>>>>>>>    config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>>>> data configuration should be maintained at the table level. This is cleaner
>>>>>>>>>>>>> for users because config is standardized across engines and in one place.
>>>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>>>> configuration to tune options that users almost never do, like row group or
>>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is used to
>>>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its
>>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>>>    load a table: using a Hadoop location, using the default metastore catalog,
>>>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>
>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>>>> used to load table locations. If it is set, then that catalog will be used
>>>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables from
>>>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> rb
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

As Jacques suggested (with the help of Zoltan) I have collected the current state and the proposed solutions in a document:
https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing <https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing>

My feeling is that we do not have a final decision, so tried to list all the possible solutions. 
Please comment!

Thanks,
Peter

> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
> 
> When I was working on the CREATE TABLE patch I found the following TBLPROPERTIES on newly created tables:
> external.table.purge
> EXTERNAL
> bucketing_version
> numRows
> rawDataSize
> totalSize
> numFiles
> numFileErasureCoded
> 
> I am afraid that we can not change the name of most of these properties, and might not be useful to have most of them along with Iceberg statistics already there. Also my feeling is that this is only the top of the Iceberg (pun intended :)) so this is why I think we should be more targeted way to push properties to the Iceberg tables.
> 
>> On Dec 2, 2020, at 18:04, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>> 
>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>> 
>> Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
>> 
>> Hive may need configuration, but I think these should still be kept in the Iceberg table. There is no reason to make Hive config inaccessible from other engines. If someone wants to view all of the config for a table from Spark, the Hive config should also be included right?
>> 
>> 
>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
>> I will ask Laszlo if he wants to update his doc.
>> 
>> I see both pros and cons of catalog definition in config files. If there is an easy default then I do not mind any of the proposed solutions.
>> 
>> OTOH I am in favor of the "use prefix for Iceberg table properties" solution, because in Hive it is common to add new keys to the property list - no restriction is in place (I am not even sure that the currently implemented blacklist for preventing to propagate properties to Iceberg tables is complete). Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
>> 
>> Thanks, Peter 
>> 
>> Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> ezt írta (időpont: 2020. dec. 1., Ke 17:06):
>> Would someone be willing to create a document that states the current proposal? 
>> 
>> It is becoming somewhat difficult to follow this thread. I also worry that without a complete statement of the current shape that people may be incorrectly thinking they are in alignment.
>> 
>> 
>> 
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> 
>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <boroknagyz@cloudera.com <ma...@cloudera.com>> wrote:
>> Thanks, Ryan. I answered inline.
>> 
>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>> This sounds like a good plan overall, but I have a couple of notes:
>> 
>> We need to keep in mind that users plug in their own catalogs, so iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. I don’t think it makes much sense to use separate hadoop.catalog and hive.catalog values. Those should just be names for catalogs configured in Configuration, i.e., via hive-site.xml. We then only need a special value for loading Hadoop tables from paths.
>> About extensibility, I think the usual Hive way is to use Java class names. So this way the value for 'iceberg.catalog' could be e.g. 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation would need to have a factory method that constructs the catalog object from a properties object (Map<String, String>). E.g. 'org.apache.iceberg.hadoop.HadoopCatalog' would require 'iceberg.catalog_location' to be present in properties.
>> I don’t think that catalog configuration should be kept in table properties. A catalog should not be loaded for each table. So I don’t think we need iceberg.catalog_location. Instead, we should have a way to define catalogs in the Configuration for tables in the metastore to reference.
>>  I think it makes sense, on the other hand it would make adding new catalogs more heavy-weight, i.e. now you'd need to edit configuration files and restart/reinit services. Maybe it can be cumbersome in some environments.
>> I’d rather use a prefix to exclude properties from being passed to Iceberg than to include them. Otherwise, users don’t know what to do to pass table properties from Hive or Impala. If we exclude a prefix or specific properties, then everything but the properties reserved for locating the table are passed as the user would expect.
>> I don't have a strong opinion about this, but yeah, maybe this behavior would cause the least surprises.
>> 
>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>> Thanks, Peter. I answered inline.
>> 
>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pvary@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>> Hi Zoltan,
>> 
>> Answers below:
>> 
>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <boroknagyz@cloudera.com.INVALID <ma...@cloudera.com.INVALID>> wrote:
>>> 
>>> Hi,
>>> 
>>> Thanks for the replies. My take for the above questions are as follows
>>> Should 'iceberg.catalog' be a required property?
>>> Yeah, I think it would be nice if this would be required to avoid any implicit behavior
>> Currently we have a Catalogs class to get/initialize/use the different Catalogs. At that time the decision was to use HadoopTables as a default catalog.
>> It might be worthwhile to use the same class in Impala as well, so the behavior is consistent.
>>  
>> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever possible. The Catalogs class is very similar to what we have currently in Impala.  
>>> 'hadoop.catalog' LOCATION and catalog_location
>>> In Impala we don't allow setting LOCATION for tables stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg table's actual location. We were also thinking about using only the table LOCATION, and set it to the catalog location, but we also found it confusing.
>> It could definitely work, but it is somewhat strange that we have an external table location set to an arbitrary path, and we have a different location generated by other configs. It would be nice to have the real location set in the external table location as well.
>> 
>> Impala sets the real Iceberg table location for external tables. E.g. if the user issues
>> 
>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>> STORED AS ICEBERG
>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>> 
>> If the end user had specified LOCATION, then Impala would have raised an error. But the above DDL statement is correct, so Impala loads the iceberg table via Iceberg API, then creates the HMS table and sets LOCATION to the Iceberg table location (something like /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>> 
>> 
>> I like the flexibility of setting the table_identifier on table level, which could help removing naming conflicts. We might want to have this in the Iceberg Catalog implementation.
>>> 'iceberg.table_identifier' for HiveCatalog
>>> Yeah, it doesn't add much if we only allow using the current HMS. I think it can be only useful if we are allowing external HMSes.
>>> Moving properties to SERDEPROPERTIES
>>> I see that these properties are used by the SerDe classes in Hive, but I feel that these properties are just not about serialization and deserialization. And as I see the current SERDEPROPERTIES are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table loading more naturally belong to TBLPROPERTIES in my opinion.
>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for separating real Iceberg table properties from other properties, then we can keep it at TBLPROPERTIES.
>> 
>> In the google doc I also had a comment about prefixing iceberg table properties. We could use a prefix like 'iceberg.tblproperties.', and pass every property with this prefix to the Iceberg table. Currently Impala passes every table property to the Iceberg table.
>> 
>> 
>> 
>>> Thanks,
>>>     Zoltan
>>> 
>>> 
>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pvary@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>> Hi,
>>> 
>>> Based on the discussion below I understand we have the following kinds of properties:
>>> Iceberg table properties - Engine independent, storage related parameters
>>> "how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs.
>>> Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis. 
>>> 
>>> Based on this:
>>> Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>> Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties?
>>> 
>>> Thanks,
>>> Peter
>>> 
>>> 
>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdosage@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I like these suggestions, comments inline below on the last round...
>>>> 
>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>>>> Hi,
>>>> 
>>>> The above aligns with what we did in Impala, i.e. we store information about table loading in HMS table properties. We are just a bit more explicit about which catalog to use.
>>>> We have table property 'iceberg.catalog' to determine the catalog type, right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be set based on the catalog type.
>>>> 
>>>> So, if the value of 'iceberg.catalog' is
>>>> 
>>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>>  
>>>> hadoop.tables
>>>> the table location is used to load the table
>>>> The only question I have is should we have this as the default? i.e. if you don't set a catalog it will assume its HadoopTables and use the location? Or should we require this property to be here to be consistent and avoid any "magic"?
>>>>  
>>>> hadoop.catalog
>>>> Required table property 'iceberg.catalog_location' specifies the location of the hadoop catalog in the file system
>>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>>> I like this as it would allow you to use a different database and table name in Hive as opposed to the Hadoop Catalog - at the moment they have to match. The only thing here is that I think Hive requires a table LOCATION to be set and it's then confusing as there are now two locations on the table. I'm not sure whether in the Hive storage handler or SerDe etc. we can get Hive to not require that and maybe even disallow it from being set. That would probably be best in conjunction with this. Another solution would be to not have the 'iceberg.catalog_location' property but instead use the table LOCATION for this but that's a bit confusing from a Hive point of view.
>>>> 
>>>> hive.catalog
>>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>>> We have the assumption that the current Hive metastore stores the table, i.e. we don't support external Hive metastores currently
>>>> These sound fine for Hive catalog tables that are created outside of the automatic Hive table creation (see https://iceberg.apache.org/hive/ <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to document how you can create these yourself and that one could use a different Hive database and table etc.
>>>>  
>>>> Independent of catalog implementations, but we also have table property 'iceberg.file_format' to specify the file format for the data files.
>>>> 
>>>> OK, I don't think we need that for Hive?
>>>>  
>>>> We haven't released it yet, so we are open to changes, but I think these properties are reasonable and it would be great if we could standardize the properties across engines that use HMS as the primary metastore of tables.
>>>> 
>>>> 
>>>> If others agree I think we should create an issue where we document the above changes so it's very clear what we're doing and can then go an implement them and update the docs etc.
>>>>  
>>>> Cheers,
>>>>     Zoltan
>>>> 
>>>> 
>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>>> Yes, I think that is a good summary of the principles.
>>>> 
>>>> #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
>>>> 
>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>>> Minor error, my last example should have been:
>>>> 
>>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>> 
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>>> I agree with Ryan on the core principles here. As I understand them:
>>>> Iceberg metadata describes all properties of a table
>>>> Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>> There could be default "how to get to" information set at a global level
>>>> Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
>>>> Is that a fair summary of your statements Ryan (except 4, which I just added)?
>>>> 
>>>> One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2
>>>> 
>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>> 
>>>> Similarly, you might want to expose a particular branch version of a table. So it might say:
>>>> 
>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>> 
>>>> Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata.
>>>> 
>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic?
>>>> 
>>>> 
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>>> Thanks for working on this, Laszlo. I’ve been thinking about these problems as well, so this is a good time to have a discussion about Hive config.
>>>> 
>>>> I think that Hive configuration should work mostly like other engines, where different configurations are used for different purposes. Different purposes means that there is not a global configuration priority. Hopefully, I can explain how we use the different config sources elsewhere to clarify.
>>>> 
>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop Configuration, but it also has its own global configuration. There are also Iceberg table properties, and all of the various Hive properties if you’re tracking tables with a Hive MetaStore.
>>>> 
>>>> The first step is to simplify where we can, so we effectively eliminate 2 sources of config:
>>>> 
>>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like FileSystem. Iceberg should not use it for any other config.
>>>> Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.
>>>> That leaves Spark configuration and Iceberg table configuration.
>>>> 
>>>> Iceberg differs from other tables because it is opinionated: data configuration should be maintained at the table level. This is cleaner for users because config is standardized across engines and in one place. And it also enables services that analyze a table and update its configuration to tune options that users almost never do, like row group or stripe size in the columnar formats. Iceberg table configuration is used to configure table-specific concerns and behavior.
>>>> 
>>>> Spark configuration is used for engine-specific concerns, and runtime overrides. A good example of an engine-specific concern is the catalogs that are available to load Iceberg tables. Spark has a way to load and configure catalog implementations and Iceberg uses that for all catalog-level config. Runtime overrides are things like target split size. Iceberg has a table-level default split size in table properties, but this can be overridden by a Spark option for each table, as well as an option passed to the individual read. Note that these necessarily have different config names for how they are used: Iceberg uses read.split.target-size and the read-specific option is target-size.
>>>> 
>>>> Applying this to Hive is a little strange for a couple reasons. First, Hive’s engine configuration is a Hadoop Configuration. As a result, I think the right place to store engine-specific config is there, including Iceberg catalogs using a strategy similar to what Spark does: what external Iceberg catalogs are available and their configuration should come from the HiveConf.
>>>> 
>>>> The second way Hive is strange is that Hive needs to use its own MetaStore to track Hive table concerns. The MetaStore may have tables created by an Iceberg HiveCatalog, and Hive also needs to be able to load tables from other Iceberg catalogs by creating table entries for them.
>>>> 
>>>> Here’s how I think Hive should work:
>>>> 
>>>> There should be a default HiveCatalog that uses the current MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>> Other catalogs should be defined in HiveConf
>>>> HMS table properties should be used to determine how to load a table: using a Hadoop location, using the default metastore catalog, or using an external Iceberg catalog
>>>> If there is a metadata_location, then use the HiveCatalog for this metastore (where it is tracked)
>>>> If there is a catalog property, then load that catalog and use it to load the table identifier, or maybe an identifier from HMS table properties
>>>> If there is no catalog or metadata_location, then use HadoopTables to load the table location as an Iceberg table
>>>> This would make it possible to access all types of Iceberg tables in the same query, and would match how Spark and Flink configure catalogs. Other than the configuration above, I don’t think that config in HMS should be used at all, like how the other engines work. Iceberg is the source of truth for table metadata, HMS stores how to load the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>> 
>>>> This isn’t quite how configuration works right now. Currently, the catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, HadoopTables will be used to load table locations. If it is set, then that catalog will be used to load all tables by name. This makes it impossible to load tables from different catalogs at the same time. That’s why I think the Iceberg catalog for a table should be stored in HMS table properties.
>>>> 
>>>> I should also explain iceberg.hive.engine.enabled flag, but I think this is long enough for now.
>>>> 
>>>> rb
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpinter@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>>> Hi All,
>>>> 
>>>> I would like to start a discussion, how should we handle properties from various sources like Iceberg, Hive or global configuration. I've put together a short document <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, please have a look and let me know what you think.
>>>> 
>>>> Thanks,
>>>> Laszlo
>>>> 
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

When I was working on the CREATE TABLE patch I found the following TBLPROPERTIES on newly created tables:
external.table.purge
EXTERNAL
bucketing_version
numRows
rawDataSize
totalSize
numFiles
numFileErasureCoded

I am afraid that we can not change the name of most of these properties, and might not be useful to have most of them along with Iceberg statistics already there. Also my feeling is that this is only the top of the Iceberg (pun intended :)) so this is why I think we should be more targeted way to push properties to the Iceberg tables.

> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
> 
> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
> 
> Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
> 
> Hive may need configuration, but I think these should still be kept in the Iceberg table. There is no reason to make Hive config inaccessible from other engines. If someone wants to view all of the config for a table from Spark, the Hive config should also be included right?
> 
> 
> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
> I will ask Laszlo if he wants to update his doc.
> 
> I see both pros and cons of catalog definition in config files. If there is an easy default then I do not mind any of the proposed solutions.
> 
> OTOH I am in favor of the "use prefix for Iceberg table properties" solution, because in Hive it is common to add new keys to the property list - no restriction is in place (I am not even sure that the currently implemented blacklist for preventing to propagate properties to Iceberg tables is complete). Also I expect that we want to add Hive write specific configs to table level when the general engine independent configuration is not ideal for Hive, but every Hive query for a given table should use some specific config.
> 
> Thanks, Peter 
> 
> Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> ezt írta (időpont: 2020. dec. 1., Ke 17:06):
> Would someone be willing to create a document that states the current proposal? 
> 
> It is becoming somewhat difficult to follow this thread. I also worry that without a complete statement of the current shape that people may be incorrectly thinking they are in alignment.
> 
> 
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <boroknagyz@cloudera.com <ma...@cloudera.com>> wrote:
> Thanks, Ryan. I answered inline.
> 
> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> This sounds like a good plan overall, but I have a couple of notes:
> 
> We need to keep in mind that users plug in their own catalogs, so iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. I don’t think it makes much sense to use separate hadoop.catalog and hive.catalog values. Those should just be names for catalogs configured in Configuration, i.e., via hive-site.xml. We then only need a special value for loading Hadoop tables from paths.
> About extensibility, I think the usual Hive way is to use Java class names. So this way the value for 'iceberg.catalog' could be e.g. 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation would need to have a factory method that constructs the catalog object from a properties object (Map<String, String>). E.g. 'org.apache.iceberg.hadoop.HadoopCatalog' would require 'iceberg.catalog_location' to be present in properties.
> I don’t think that catalog configuration should be kept in table properties. A catalog should not be loaded for each table. So I don’t think we need iceberg.catalog_location. Instead, we should have a way to define catalogs in the Configuration for tables in the metastore to reference.
>  I think it makes sense, on the other hand it would make adding new catalogs more heavy-weight, i.e. now you'd need to edit configuration files and restart/reinit services. Maybe it can be cumbersome in some environments.
> I’d rather use a prefix to exclude properties from being passed to Iceberg than to include them. Otherwise, users don’t know what to do to pass table properties from Hive or Impala. If we exclude a prefix or specific properties, then everything but the properties reserved for locating the table are passed as the user would expect.
> I don't have a strong opinion about this, but yeah, maybe this behavior would cause the least surprises.
> 
> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
> Thanks, Peter. I answered inline.
> 
> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi Zoltan,
> 
> Answers below:
> 
>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <boroknagyz@cloudera.com.INVALID <ma...@cloudera.com.INVALID>> wrote:
>> 
>> Hi,
>> 
>> Thanks for the replies. My take for the above questions are as follows
>> Should 'iceberg.catalog' be a required property?
>> Yeah, I think it would be nice if this would be required to avoid any implicit behavior
> Currently we have a Catalogs class to get/initialize/use the different Catalogs. At that time the decision was to use HadoopTables as a default catalog.
> It might be worthwhile to use the same class in Impala as well, so the behavior is consistent.
>  
> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever possible. The Catalogs class is very similar to what we have currently in Impala.  
>> 'hadoop.catalog' LOCATION and catalog_location
>> In Impala we don't allow setting LOCATION for tables stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg table's actual location. We were also thinking about using only the table LOCATION, and set it to the catalog location, but we also found it confusing.
> It could definitely work, but it is somewhat strange that we have an external table location set to an arbitrary path, and we have a different location generated by other configs. It would be nice to have the real location set in the external table location as well.
> 
> Impala sets the real Iceberg table location for external tables. E.g. if the user issues
> 
> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
> STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
> 
> If the end user had specified LOCATION, then Impala would have raised an error. But the above DDL statement is correct, so Impala loads the iceberg table via Iceberg API, then creates the HMS table and sets LOCATION to the Iceberg table location (something like /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
> 
> 
> I like the flexibility of setting the table_identifier on table level, which could help removing naming conflicts. We might want to have this in the Iceberg Catalog implementation.
>> 'iceberg.table_identifier' for HiveCatalog
>> Yeah, it doesn't add much if we only allow using the current HMS. I think it can be only useful if we are allowing external HMSes.
>> Moving properties to SERDEPROPERTIES
>> I see that these properties are used by the SerDe classes in Hive, but I feel that these properties are just not about serialization and deserialization. And as I see the current SERDEPROPERTIES are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table loading more naturally belong to TBLPROPERTIES in my opinion.
> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for separating real Iceberg table properties from other properties, then we can keep it at TBLPROPERTIES.
> 
> In the google doc I also had a comment about prefixing iceberg table properties. We could use a prefix like 'iceberg.tblproperties.', and pass every property with this prefix to the Iceberg table. Currently Impala passes every table property to the Iceberg table.
> 
> 
> 
>> Thanks,
>>     Zoltan
>> 
>> 
>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pvary@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>> Hi,
>> 
>> Based on the discussion below I understand we have the following kinds of properties:
>> Iceberg table properties - Engine independent, storage related parameters
>> "how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs.
>> Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis. 
>> 
>> Based on this:
>> Shall we move the "how to get to" properties to SERDEPROPERTIES?
>> Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties?
>> 
>> Thanks,
>> Peter
>> 
>> 
>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdosage@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I like these suggestions, comments inline below on the last round...
>>> 
>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>>> Hi,
>>> 
>>> The above aligns with what we did in Impala, i.e. we store information about table loading in HMS table properties. We are just a bit more explicit about which catalog to use.
>>> We have table property 'iceberg.catalog' to determine the catalog type, right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be set based on the catalog type.
>>> 
>>> So, if the value of 'iceberg.catalog' is
>>> 
>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>  
>>> hadoop.tables
>>> the table location is used to load the table
>>> The only question I have is should we have this as the default? i.e. if you don't set a catalog it will assume its HadoopTables and use the location? Or should we require this property to be here to be consistent and avoid any "magic"?
>>>  
>>> hadoop.catalog
>>> Required table property 'iceberg.catalog_location' specifies the location of the hadoop catalog in the file system
>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>> I like this as it would allow you to use a different database and table name in Hive as opposed to the Hadoop Catalog - at the moment they have to match. The only thing here is that I think Hive requires a table LOCATION to be set and it's then confusing as there are now two locations on the table. I'm not sure whether in the Hive storage handler or SerDe etc. we can get Hive to not require that and maybe even disallow it from being set. That would probably be best in conjunction with this. Another solution would be to not have the 'iceberg.catalog_location' property but instead use the table LOCATION for this but that's a bit confusing from a Hive point of view.
>>> 
>>> hive.catalog
>>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>>> We have the assumption that the current Hive metastore stores the table, i.e. we don't support external Hive metastores currently
>>> These sound fine for Hive catalog tables that are created outside of the automatic Hive table creation (see https://iceberg.apache.org/hive/ <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to document how you can create these yourself and that one could use a different Hive database and table etc.
>>>  
>>> Independent of catalog implementations, but we also have table property 'iceberg.file_format' to specify the file format for the data files.
>>> 
>>> OK, I don't think we need that for Hive?
>>>  
>>> We haven't released it yet, so we are open to changes, but I think these properties are reasonable and it would be great if we could standardize the properties across engines that use HMS as the primary metastore of tables.
>>> 
>>> 
>>> If others agree I think we should create an issue where we document the above changes so it's very clear what we're doing and can then go an implement them and update the docs etc.
>>>  
>>> Cheers,
>>>     Zoltan
>>> 
>>> 
>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>> Yes, I think that is a good summary of the principles.
>>> 
>>> #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
>>> 
>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>> Minor error, my last example should have been:
>>> 
>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> 
>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>>> I agree with Ryan on the core principles here. As I understand them:
>>> Iceberg metadata describes all properties of a table
>>> Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>> There could be default "how to get to" information set at a global level
>>> Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
>>> Is that a fair summary of your statements Ryan (except 4, which I just added)?
>>> 
>>> One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2
>>> 
>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>> 
>>> Similarly, you might want to expose a particular branch version of a table. So it might say:
>>> 
>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>> 
>>> Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata.
>>> 
>>> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic?
>>> 
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> 
>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>>> Thanks for working on this, Laszlo. I’ve been thinking about these problems as well, so this is a good time to have a discussion about Hive config.
>>> 
>>> I think that Hive configuration should work mostly like other engines, where different configurations are used for different purposes. Different purposes means that there is not a global configuration priority. Hopefully, I can explain how we use the different config sources elsewhere to clarify.
>>> 
>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop Configuration, but it also has its own global configuration. There are also Iceberg table properties, and all of the various Hive properties if you’re tracking tables with a Hive MetaStore.
>>> 
>>> The first step is to simplify where we can, so we effectively eliminate 2 sources of config:
>>> 
>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like FileSystem. Iceberg should not use it for any other config.
>>> Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.
>>> That leaves Spark configuration and Iceberg table configuration.
>>> 
>>> Iceberg differs from other tables because it is opinionated: data configuration should be maintained at the table level. This is cleaner for users because config is standardized across engines and in one place. And it also enables services that analyze a table and update its configuration to tune options that users almost never do, like row group or stripe size in the columnar formats. Iceberg table configuration is used to configure table-specific concerns and behavior.
>>> 
>>> Spark configuration is used for engine-specific concerns, and runtime overrides. A good example of an engine-specific concern is the catalogs that are available to load Iceberg tables. Spark has a way to load and configure catalog implementations and Iceberg uses that for all catalog-level config. Runtime overrides are things like target split size. Iceberg has a table-level default split size in table properties, but this can be overridden by a Spark option for each table, as well as an option passed to the individual read. Note that these necessarily have different config names for how they are used: Iceberg uses read.split.target-size and the read-specific option is target-size.
>>> 
>>> Applying this to Hive is a little strange for a couple reasons. First, Hive’s engine configuration is a Hadoop Configuration. As a result, I think the right place to store engine-specific config is there, including Iceberg catalogs using a strategy similar to what Spark does: what external Iceberg catalogs are available and their configuration should come from the HiveConf.
>>> 
>>> The second way Hive is strange is that Hive needs to use its own MetaStore to track Hive table concerns. The MetaStore may have tables created by an Iceberg HiveCatalog, and Hive also needs to be able to load tables from other Iceberg catalogs by creating table entries for them.
>>> 
>>> Here’s how I think Hive should work:
>>> 
>>> There should be a default HiveCatalog that uses the current MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>> Other catalogs should be defined in HiveConf
>>> HMS table properties should be used to determine how to load a table: using a Hadoop location, using the default metastore catalog, or using an external Iceberg catalog
>>> If there is a metadata_location, then use the HiveCatalog for this metastore (where it is tracked)
>>> If there is a catalog property, then load that catalog and use it to load the table identifier, or maybe an identifier from HMS table properties
>>> If there is no catalog or metadata_location, then use HadoopTables to load the table location as an Iceberg table
>>> This would make it possible to access all types of Iceberg tables in the same query, and would match how Spark and Flink configure catalogs. Other than the configuration above, I don’t think that config in HMS should be used at all, like how the other engines work. Iceberg is the source of truth for table metadata, HMS stores how to load the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>> 
>>> This isn’t quite how configuration works right now. Currently, the catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, HadoopTables will be used to load table locations. If it is set, then that catalog will be used to load all tables by name. This makes it impossible to load tables from different catalogs at the same time. That’s why I think the Iceberg catalog for a table should be stored in HMS table properties.
>>> 
>>> I should also explain iceberg.hive.engine.enabled flag, but I think this is long enough for now.
>>> 
>>> rb
>>> 
>>> 
>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpinter@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>>> Hi All,
>>> 
>>> I would like to start a discussion, how should we handle properties from various sources like Iceberg, Hive or global configuration. I've put together a short document <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, please have a look and let me know what you think.
>>> 
>>> Thanks,
>>> Laszlo
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Iceberg/Hive properties handling

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Sorry, I accidentally didn’t copy the dev list on this reply. Resending:

Also I expect that we want to add Hive write specific configs to table
level when the general engine independent configuration is not ideal for
Hive, but every Hive query for a given table should use some specific
config.

Hive may need configuration, but I think these should still be kept in the
Iceberg table. There is no reason to make Hive config inaccessible from
other engines. If someone wants to view all of the config for a table from
Spark, the Hive config should also be included right?

On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:

> I will ask Laszlo if he wants to update his doc.
>
> I see both pros and cons of catalog definition in config files. If there
> is an easy default then I do not mind any of the proposed solutions.
>
> OTOH I am in favor of the "use prefix for Iceberg table properties"
> solution, because in Hive it is common to add new keys to the property list
> - no restriction is in place (I am not even sure that the currently
> implemented blacklist for preventing to propagate properties to Iceberg
> tables is complete). Also I expect that we want to add Hive write specific
> configs to table level when the general engine independent configuration is
> not ideal for Hive, but every Hive query for a given table should use some
> specific config.
>
> Thanks, Peter
>
> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke
> 17:06):
>
>> Would someone be willing to create a document that states the current
>> proposal?
>>
>> It is becoming somewhat difficult to follow this thread. I also worry
>> that without a complete statement of the current shape that people may be
>> incorrectly thinking they are in alignment.
>>
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <bo...@cloudera.com>
>> wrote:
>>
>>> Thanks, Ryan. I answered inline.
>>>
>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>
>>>>    1. We need to keep in mind that users plug in their own catalogs,
>>>>    so iceberg.catalog could be a Glue or Nessie catalog, not just Hive
>>>>    or Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>>>>    and hive.catalog values. Those should just be names for catalogs configured
>>>>    in Configuration, i.e., via hive-site.xml. We then only need a
>>>>    special value for loading Hadoop tables from paths.
>>>>
>>>> About extensibility, I think the usual Hive way is to use Java class
>>> names. So this way the value for 'iceberg.catalog' could be e.g.
>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>> would need to have a factory method that constructs the catalog object from
>>> a properties object (Map<String, String>). E.g.
>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>> 'iceberg.catalog_location' to be present in properties.
>>>
>>>>
>>>>    1. I don’t think that catalog configuration should be kept in table
>>>>    properties. A catalog should not be loaded for each table. So I don’t think
>>>>    we need iceberg.catalog_location. Instead, we should have a way to
>>>>    define catalogs in the Configuration for tables in the metastore to
>>>>    reference.
>>>>
>>>>  I think it makes sense, on the other hand it would make adding new
>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>> and restart/reinit services. Maybe it can be cumbersome in some
>>> environments.
>>>
>>>>
>>>>    1. I’d rather use a prefix to exclude properties from being passed
>>>>    to Iceberg than to include them. Otherwise, users don’t know what to do to
>>>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>>>    specific properties, then everything but the properties reserved for
>>>>    locating the table are passed as the user would expect.
>>>>
>>>> I don't have a strong opinion about this, but yeah, maybe this behavior
>>> would cause the least surprises.
>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>> boroknagyz@apache.org> wrote:
>>>>
>>>>> Thanks, Peter. I answered inline.
>>>>>
>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Zoltan,
>>>>>>
>>>>>> Answers below:
>>>>>>
>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for the replies. My take for the above questions are as follows
>>>>>>
>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>       avoid any implicit behavior
>>>>>>
>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>> default catalog.
>>>>>> It might be worthwhile to use the same class in Impala as well, so
>>>>>> the behavior is consistent.
>>>>>>
>>>>>
>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>> currently in Impala.
>>>>>
>>>>>>
>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>       - In Impala we don't allow setting LOCATION for tables stored
>>>>>>       in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>>>>       table's actual location. We were also thinking about using only the table
>>>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>>>       confusing.
>>>>>>
>>>>>> It could definitely work, but it is somewhat strange that we have an
>>>>>> external table location set to an arbitrary path, and we have a different
>>>>>> location generated by other configs. It would be nice to have the real
>>>>>> location set in the external table location as well.
>>>>>>
>>>>>
>>>>> Impala sets the real Iceberg table location for external tables. E.g.
>>>>> if the user issues
>>>>>
>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>> STORED AS ICEBERG
>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>
>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>
>>>>> If the end user had specified LOCATION, then Impala would have raised
>>>>> an error. But the above DDL statement is correct, so Impala loads the
>>>>> iceberg table via Iceberg API, then creates the HMS table and sets LOCATION
>>>>> to the Iceberg table location (something like
>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>
>>>>>
>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>> this in the Iceberg Catalog implementation.
>>>>>>
>>>>>>
>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>       - Yeah, it doesn't add much if we only allow using the current
>>>>>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>       - I see that these properties are used by the SerDe classes in
>>>>>>       Hive, but I feel that these properties are just not about serialization and
>>>>>>       deserialization. And as I see the current SERDEPROPERTIES are things like
>>>>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>>>>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>>
>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>> prefix for separating real Iceberg table properties from other properties,
>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>
>>>>>
>>>>> In the google doc I also had a comment about prefixing iceberg table
>>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>>> passes every table property to the Iceberg table.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>     Zoltan
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Based on the discussion below I understand we have the following
>>>>>>> kinds of properties:
>>>>>>>
>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>    related parameters
>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>
>>>>>>>
>>>>>>> Based on this:
>>>>>>>
>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>    SERDEPROPERTIES?
>>>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>>>>>>
>>>>>>> I like these suggestions, comments inline below on the last round...
>>>>>>>
>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>> more explicit about which catalog to use.
>>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>>>> catalog type.
>>>>>>>>
>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>
>>>>>>>
>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>> confusing.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>    - hadoop.tables
>>>>>>>>       - the table location is used to load the table
>>>>>>>>
>>>>>>>> The only question I have is should we have this as the default?
>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and use the
>>>>>>> location? Or should we require this property to be here to be consistent
>>>>>>> and avoid any "magic"?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>    - hadoop.catalog
>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>       is used as table identifier
>>>>>>>>
>>>>>>>> I like this as it would allow you to use a different database and
>>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>>>> being set. That would probably be best in conjunction with this. Another
>>>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>>>> Hive point of view.
>>>>>>>
>>>>>>>
>>>>>>>>    - hive.catalog
>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>       is used as table identifier
>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>>
>>>>>>>> These sound fine for Hive catalog tables that are created outside
>>>>>>> of the automatic Hive table creation (see
>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>>> need to document how you can create these yourself and that one could use a
>>>>>>> different Hive database and table etc.
>>>>>>>
>>>>>>>
>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>> files.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>
>>>>>>>
>>>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>>>> these properties are reasonable and it would be great if we could
>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>> metastore of tables.
>>>>>>>>
>>>>>>>>
>>>>>>> If others agree I think we should create an issue where we document
>>>>>>> the above changes so it's very clear what we're doing and can then go an
>>>>>>> implement them and update the docs etc.
>>>>>>>
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>     Zoltan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>
>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>
>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jacques Nadeau
>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>>> them:
>>>>>>>>>>>
>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>    3. There could be default "how to get to" information set at
>>>>>>>>>>>    a global level
>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>
>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which
>>>>>>>>>>> I just added)?
>>>>>>>>>>>
>>>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>
>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>
>>>>>>>>>>> Similarly, you might want to expose a particular branch version
>>>>>>>>>>> of a table. So it might say:
>>>>>>>>>>>
>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>
>>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>>>>> not the content/metadata.
>>>>>>>>>>>
>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>>>> Hive config.
>>>>>>>>>>>>
>>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>>>>> to clarify.
>>>>>>>>>>>>
>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>
>>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>>
>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>    config.
>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. All other
>>>>>>>>>>>>    config in HMS is informational. For example, the input format is
>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>
>>>>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>>>>
>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>>> data configuration should be maintained at the table level. This is cleaner
>>>>>>>>>>>> for users because config is standardized across engines and in one place.
>>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>>> configuration to tune options that users almost never do, like row group or
>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is used to
>>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>> target-size.
>>>>>>>>>>>>
>>>>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>
>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its
>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>
>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>
>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>>    load a table: using a Hadoop location, using the default metastore catalog,
>>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>>>>       and use it to load the table identifier, or maybe an identifier from HMS
>>>>>>>>>>>>       table properties
>>>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>
>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>
>>>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>>> used to load table locations. If it is set, then that catalog will be used
>>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables from
>>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>>
>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>>>
>>>>>>>>>>>> rb
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

I will ask Laszlo if he wants to update his doc.

I see both pros and cons of catalog definition in config files. If there is
an easy default then I do not mind any of the proposed solutions.

OTOH I am in favor of the "use prefix for Iceberg table properties"
solution, because in Hive it is common to add new keys to the property list
- no restriction is in place (I am not even sure that the currently
implemented blacklist for preventing to propagate properties to Iceberg
tables is complete). Also I expect that we want to add Hive write specific
configs to table level when the general engine independent configuration is
not ideal for Hive, but every Hive query for a given table should use some
specific config.

Thanks, Peter

Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke
17:06):

> Would someone be willing to create a document that states the current
> proposal?
>
> It is becoming somewhat difficult to follow this thread. I also worry that
> without a complete statement of the current shape that people may be
> incorrectly thinking they are in alignment.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <bo...@cloudera.com>
> wrote:
>
>> Thanks, Ryan. I answered inline.
>>
>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> This sounds like a good plan overall, but I have a couple of notes:
>>>
>>>    1. We need to keep in mind that users plug in their own catalogs, so
>>>    iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>>>    Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>>>    and hive.catalog values. Those should just be names for catalogs configured
>>>    in Configuration, i.e., via hive-site.xml. We then only need a
>>>    special value for loading Hadoop tables from paths.
>>>
>>> About extensibility, I think the usual Hive way is to use Java class
>> names. So this way the value for 'iceberg.catalog' could be e.g.
>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>> would need to have a factory method that constructs the catalog object from
>> a properties object (Map<String, String>). E.g.
>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>> 'iceberg.catalog_location' to be present in properties.
>>
>>>
>>>    1. I don’t think that catalog configuration should be kept in table
>>>    properties. A catalog should not be loaded for each table. So I don’t think
>>>    we need iceberg.catalog_location. Instead, we should have a way to
>>>    define catalogs in the Configuration for tables in the metastore to
>>>    reference.
>>>
>>>  I think it makes sense, on the other hand it would make adding new
>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>> and restart/reinit services. Maybe it can be cumbersome in some
>> environments.
>>
>>>
>>>    1. I’d rather use a prefix to exclude properties from being passed
>>>    to Iceberg than to include them. Otherwise, users don’t know what to do to
>>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>>    specific properties, then everything but the properties reserved for
>>>    locating the table are passed as the user would expect.
>>>
>>> I don't have a strong opinion about this, but yeah, maybe this behavior
>> would cause the least surprises.
>>
>>>
>>>
>>>
>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <bo...@apache.org>
>>> wrote:
>>>
>>>> Thanks, Peter. I answered inline.
>>>>
>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Zoltan,
>>>>>
>>>>> Answers below:
>>>>>
>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for the replies. My take for the above questions are as follows
>>>>>
>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>       avoid any implicit behavior
>>>>>
>>>>> Currently we have a Catalogs class to get/initialize/use the different
>>>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>>>> catalog.
>>>>> It might be worthwhile to use the same class in Impala as well, so the
>>>>> behavior is consistent.
>>>>>
>>>>
>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>> whenever possible. The Catalogs class is very similar to what we have
>>>> currently in Impala.
>>>>
>>>>>
>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>       - In Impala we don't allow setting LOCATION for tables stored
>>>>>       in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>>>       table's actual location. We were also thinking about using only the table
>>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>>       confusing.
>>>>>
>>>>> It could definitely work, but it is somewhat strange that we have an
>>>>> external table location set to an arbitrary path, and we have a different
>>>>> location generated by other configs. It would be nice to have the real
>>>>> location set in the external table location as well.
>>>>>
>>>>
>>>> Impala sets the real Iceberg table location for external tables. E.g.
>>>> if the user issues
>>>>
>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>> STORED AS ICEBERG
>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>
>>>> If the end user had specified LOCATION, then Impala would have raised
>>>> an error. But the above DDL statement is correct, so Impala loads the
>>>> iceberg table via Iceberg API, then creates the HMS table and sets LOCATION
>>>> to the Iceberg table location (something like
>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>
>>>>
>>>>> I like the flexibility of setting the table_identifier on table level,
>>>>> which could help removing naming conflicts. We might want to have this in
>>>>> the Iceberg Catalog implementation.
>>>>>
>>>>>
>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>       - Yeah, it doesn't add much if we only allow using the current
>>>>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>       - I see that these properties are used by the SerDe classes in
>>>>>       Hive, but I feel that these properties are just not about serialization and
>>>>>       deserialization. And as I see the current SERDEPROPERTIES are things like
>>>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>>>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>
>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>>>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>>>>> for separating real Iceberg table properties from other properties, then we
>>>>> can keep it at TBLPROPERTIES.
>>>>>
>>>>
>>>> In the google doc I also had a comment about prefixing iceberg table
>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>> passes every table property to the Iceberg table.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>     Zoltan
>>>>>
>>>>>
>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Based on the discussion below I understand we have the following
>>>>>> kinds of properties:
>>>>>>
>>>>>>    1. Iceberg table properties - Engine independent, storage related
>>>>>>    parameters
>>>>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>>>>    properties, since for Spark, the Spark catalog configuration serves for the
>>>>>>    same purpose. I think the best place for storing these would be the
>>>>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>>>>    SerDe. Sidenote: I think we should decide if we allow
>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>    names and mapped on a case-by-case basis.
>>>>>>
>>>>>>
>>>>>> Based on this:
>>>>>>
>>>>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>>>>>
>>>>>> I like these suggestions, comments inline below on the last round...
>>>>>>
>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>> boroknagyz@apache.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>> more explicit about which catalog to use.
>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>>> catalog type.
>>>>>>>
>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>
>>>>>>
>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>> confusing.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>    - hadoop.tables
>>>>>>>       - the table location is used to load the table
>>>>>>>
>>>>>>> The only question I have is should we have this as the default? i.e.
>>>>>> if you don't set a catalog it will assume its HadoopTables and use the
>>>>>> location? Or should we require this property to be here to be consistent
>>>>>> and avoid any "magic"?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>    - hadoop.catalog
>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>       is used as table identifier
>>>>>>>
>>>>>>> I like this as it would allow you to use a different database and
>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>>> being set. That would probably be best in conjunction with this. Another
>>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>>> Hive point of view.
>>>>>>
>>>>>>
>>>>>>>    - hive.catalog
>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>       is used as table identifier
>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>>
>>>>>>> These sound fine for Hive catalog tables that are created outside of
>>>>>> the automatic Hive table creation (see
>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>> need to document how you can create these yourself and that one could use a
>>>>>> different Hive database and table etc.
>>>>>>
>>>>>>
>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>> files.
>>>>>>>
>>>>>>
>>>>>> OK, I don't think we need that for Hive?
>>>>>>
>>>>>>
>>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>>> these properties are reasonable and it would be great if we could
>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>> metastore of tables.
>>>>>>>
>>>>>>>
>>>>>> If others agree I think we should create an issue where we document
>>>>>> the above changes so it's very clear what we're doing and can then go an
>>>>>> implement them and update the docs etc.
>>>>>>
>>>>>>
>>>>>>> Cheers,
>>>>>>>     Zoltan
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>
>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>
>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>> them:
>>>>>>>>>>
>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>    3. There could be default "how to get to" information set at
>>>>>>>>>>    a global level
>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>
>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>>>>> just added)?
>>>>>>>>>>
>>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>
>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>
>>>>>>>>>> Similarly, you might want to expose a particular branch version
>>>>>>>>>> of a table. So it might say:
>>>>>>>>>>
>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>
>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>>>> not the content/metadata.
>>>>>>>>>>
>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jacques Nadeau
>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>>> Hive config.
>>>>>>>>>>>
>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>>>> to clarify.
>>>>>>>>>>>
>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>
>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>
>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>    config.
>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify that
>>>>>>>>>>>    a table is Iceberg and point to its metadata location. All other config in
>>>>>>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>>>>    properties.
>>>>>>>>>>>
>>>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>>>
>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>> data configuration should be maintained at the table level. This is cleaner
>>>>>>>>>>> for users because config is standardized across engines and in one place.
>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>> configuration to tune options that users almost never do, like row group or
>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is used to
>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>
>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>> target-size.
>>>>>>>>>>>
>>>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration.
>>>>>>>>>>> As a result, I think the right place to store engine-specific config is
>>>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what Spark
>>>>>>>>>>> does: what external Iceberg catalogs are available and their configuration
>>>>>>>>>>> should come from the HiveConf.
>>>>>>>>>>>
>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>
>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>
>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>    MetaStore
>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>    load a table: using a Hadoop location, using the default metastore catalog,
>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>>>       and use it to load the table identifier, or maybe an identifier from HMS
>>>>>>>>>>>       table properties
>>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>
>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>
>>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>> used to load table locations. If it is set, then that catalog will be used
>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables from
>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>
>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@dremio.com>.

Would someone be willing to create a document that states the current
proposal?

It is becoming somewhat difficult to follow this thread. I also worry that
without a complete statement of the current shape that people may be
incorrectly thinking they are in alignment.



--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <bo...@cloudera.com>
wrote:

> Thanks, Ryan. I answered inline.
>
> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> This sounds like a good plan overall, but I have a couple of notes:
>>
>>    1. We need to keep in mind that users plug in their own catalogs, so
>>    iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>>    Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>>    and hive.catalog values. Those should just be names for catalogs configured
>>    in Configuration, i.e., via hive-site.xml. We then only need a
>>    special value for loading Hadoop tables from paths.
>>
>> About extensibility, I think the usual Hive way is to use Java class
> names. So this way the value for 'iceberg.catalog' could be e.g.
> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
> would need to have a factory method that constructs the catalog object from
> a properties object (Map<String, String>). E.g.
> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
> 'iceberg.catalog_location' to be present in properties.
>
>>
>>    1. I don’t think that catalog configuration should be kept in table
>>    properties. A catalog should not be loaded for each table. So I don’t think
>>    we need iceberg.catalog_location. Instead, we should have a way to
>>    define catalogs in the Configuration for tables in the metastore to
>>    reference.
>>
>>  I think it makes sense, on the other hand it would make adding new
> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
> and restart/reinit services. Maybe it can be cumbersome in some
> environments.
>
>>
>>    1. I’d rather use a prefix to exclude properties from being passed to
>>    Iceberg than to include them. Otherwise, users don’t know what to do to
>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>    specific properties, then everything but the properties reserved for
>>    locating the table are passed as the user would expect.
>>
>> I don't have a strong opinion about this, but yeah, maybe this behavior
> would cause the least surprises.
>
>>
>>
>>
>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <bo...@apache.org>
>> wrote:
>>
>>> Thanks, Peter. I answered inline.
>>>
>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi Zoltan,
>>>>
>>>> Answers below:
>>>>
>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Thanks for the replies. My take for the above questions are as follows
>>>>
>>>>    - Should 'iceberg.catalog' be a required property?
>>>>    - Yeah, I think it would be nice if this would be required to avoid
>>>>       any implicit behavior
>>>>
>>>> Currently we have a Catalogs class to get/initialize/use the different
>>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>>> catalog.
>>>> It might be worthwhile to use the same class in Impala as well, so the
>>>> behavior is consistent.
>>>>
>>>
>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>> whenever possible. The Catalogs class is very similar to what we have
>>> currently in Impala.
>>>
>>>>
>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>       - In Impala we don't allow setting LOCATION for tables stored in
>>>>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>>       table's actual location. We were also thinking about using only the table
>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>       confusing.
>>>>
>>>> It could definitely work, but it is somewhat strange that we have an
>>>> external table location set to an arbitrary path, and we have a different
>>>> location generated by other configs. It would be nice to have the real
>>>> location set in the external table location as well.
>>>>
>>>
>>> Impala sets the real Iceberg table location for external tables. E.g. if
>>> the user issues
>>>
>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>> STORED AS ICEBERG
>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>
>>> If the end user had specified LOCATION, then Impala would have raised an
>>> error. But the above DDL statement is correct, so Impala loads the iceberg
>>> table via Iceberg API, then creates the HMS table and sets LOCATION to the
>>> Iceberg table location (something like
>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>
>>>
>>>> I like the flexibility of setting the table_identifier on table level,
>>>> which could help removing naming conflicts. We might want to have this in
>>>> the Iceberg Catalog implementation.
>>>>
>>>>
>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>       - Yeah, it doesn't add much if we only allow using the current
>>>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>>>    - Moving properties to SERDEPROPERTIES
>>>>       - I see that these properties are used by the SerDe classes in
>>>>       Hive, but I feel that these properties are just not about serialization and
>>>>       deserialization. And as I see the current SERDEPROPERTIES are things like
>>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>
>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>>>> for separating real Iceberg table properties from other properties, then we
>>>> can keep it at TBLPROPERTIES.
>>>>
>>>
>>> In the google doc I also had a comment about prefixing iceberg table
>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>> every property with this prefix to the Iceberg table. Currently Impala
>>> passes every table property to the Iceberg table.
>>>
>>>
>>>>
>>>> Thanks,
>>>>     Zoltan
>>>>
>>>>
>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Based on the discussion below I understand we have the following kinds
>>>>> of properties:
>>>>>
>>>>>    1. Iceberg table properties - Engine independent, storage related
>>>>>    parameters
>>>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>>>    properties, since for Spark, the Spark catalog configuration serves for the
>>>>>    same purpose. I think the best place for storing these would be the
>>>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>>>>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>>>>>    sense only if we allow having multiple catalogs.
>>>>>    3. Query specific properties - These are engine specific and might
>>>>>    be mapped to / even override the Iceberg table properties on the engine
>>>>>    specific code paths, but currently these properties have independent names
>>>>>    and mapped on a case-by-case basis.
>>>>>
>>>>>
>>>>> Based on this:
>>>>>
>>>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>>>>
>>>>> I like these suggestions, comments inline below on the last round...
>>>>>
>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>> more explicit about which catalog to use.
>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>> catalog type.
>>>>>>
>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>
>>>>>
>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>> confusing.
>>>>>
>>>>>
>>>>>>
>>>>>>    - hadoop.tables
>>>>>>       - the table location is used to load the table
>>>>>>
>>>>>> The only question I have is should we have this as the default? i.e.
>>>>> if you don't set a catalog it will assume its HadoopTables and use the
>>>>> location? Or should we require this property to be here to be consistent
>>>>> and avoid any "magic"?
>>>>>
>>>>>
>>>>>>
>>>>>>    - hadoop.catalog
>>>>>>       - Required table property 'iceberg.catalog_location' specifies
>>>>>>       the location of the hadoop catalog in the file system
>>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>>>       table identifier
>>>>>>
>>>>>> I like this as it would allow you to use a different database and
>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>> being set. That would probably be best in conjunction with this. Another
>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>> Hive point of view.
>>>>>
>>>>>
>>>>>>    - hive.catalog
>>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>>>       table identifier
>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>       stores the table, i.e. we don't support external Hive metastores currently
>>>>>>
>>>>>> These sound fine for Hive catalog tables that are created outside of
>>>>> the automatic Hive table creation (see
>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>> need to document how you can create these yourself and that one could use a
>>>>> different Hive database and table etc.
>>>>>
>>>>>
>>>>>> Independent of catalog implementations, but we also have table
>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>> files.
>>>>>>
>>>>>
>>>>> OK, I don't think we need that for Hive?
>>>>>
>>>>>
>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>> these properties are reasonable and it would be great if we could
>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>> metastore of tables.
>>>>>>
>>>>>>
>>>>> If others agree I think we should create an issue where we document
>>>>> the above changes so it's very clear what we're doing and can then go an
>>>>> implement them and update the docs etc.
>>>>>
>>>>>
>>>>>> Cheers,
>>>>>>     Zoltan
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>
>>>>>>> #4 is correct because we provide some information that is
>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>> gives us a way to handle name conflicts.
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Minor error, my last example should have been:
>>>>>>>>
>>>>>>>> db1.table1_etl_branch =>
>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>> them:
>>>>>>>>>
>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>>>>    global level
>>>>>>>>>    4. Best-effort schema should stored be in the table properties
>>>>>>>>>    in HMS. This should be done for information schema retrieval purposes
>>>>>>>>>    within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>
>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>>>> just added)?
>>>>>>>>>
>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>
>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>
>>>>>>>>> Similarly, you might want to expose a particular branch version of
>>>>>>>>> a table. So it might say:
>>>>>>>>>
>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>
>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>>> not the content/metadata.
>>>>>>>>>
>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>> these problems as well, so this is a good time to have a discussion about
>>>>>>>>>> Hive config.
>>>>>>>>>>
>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>>> to clarify.
>>>>>>>>>>
>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>
>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>
>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify that
>>>>>>>>>>    a table is Iceberg and point to its metadata location. All other config in
>>>>>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>>>    properties.
>>>>>>>>>>
>>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>>
>>>>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>>>>>>> users because config is standardized across engines and in one place. And
>>>>>>>>>> it also enables services that analyze a table and update its configuration
>>>>>>>>>> to tune options that users almost never do, like row group or stripe size
>>>>>>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>>>>>>> table-specific concerns and behavior.
>>>>>>>>>>
>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>> target-size.
>>>>>>>>>>
>>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration.
>>>>>>>>>> As a result, I think the right place to store engine-specific config is
>>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what Spark
>>>>>>>>>> does: what external Iceberg catalogs are available and their configuration
>>>>>>>>>> should come from the HiveConf.
>>>>>>>>>>
>>>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>
>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>
>>>>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>    load a table: using a Hadoop location, using the default metastore catalog,
>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>>       and use it to load the table identifier, or maybe an identifier from HMS
>>>>>>>>>>       table properties
>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>
>>>>>>>>>> This would make it possible to access all types of Iceberg tables
>>>>>>>>>> in the same query, and would match how Spark and Flink configure catalogs.
>>>>>>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>>>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>
>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be used
>>>>>>>>>> to load table locations. If it is set, then that catalog will be used to
>>>>>>>>>> load all tables by name. This makes it impossible to load tables from
>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg catalog
>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>
>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>> I've put together a short document
>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Laszlo
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg/Hive properties handling

Posted by Zoltán Borók-Nagy <bo...@cloudera.com.INVALID>.

Thanks, Ryan. I answered inline.

On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:

> This sounds like a good plan overall, but I have a couple of notes:
>
>    1. We need to keep in mind that users plug in their own catalogs, so
>    iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>    Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>    and hive.catalog values. Those should just be names for catalogs configured
>    in Configuration, i.e., via hive-site.xml. We then only need a special
>    value for loading Hadoop tables from paths.
>
> About extensibility, I think the usual Hive way is to use Java class
names. So this way the value for 'iceberg.catalog' could be e.g.
'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
would need to have a factory method that constructs the catalog object from
a properties object (Map<String, String>). E.g.
'org.apache.iceberg.hadoop.HadoopCatalog' would require
'iceberg.catalog_location' to be present in properties.

>
>    1. I don’t think that catalog configuration should be kept in table
>    properties. A catalog should not be loaded for each table. So I don’t think
>    we need iceberg.catalog_location. Instead, we should have a way to
>    define catalogs in the Configuration for tables in the metastore to
>    reference.
>
>  I think it makes sense, on the other hand it would make adding new
catalogs more heavy-weight, i.e. now you'd need to edit configuration files
and restart/reinit services. Maybe it can be cumbersome in some
environments.

>
>    1. I’d rather use a prefix to exclude properties from being passed to
>    Iceberg than to include them. Otherwise, users don’t know what to do to
>    pass table properties from Hive or Impala. If we exclude a prefix or
>    specific properties, then everything but the properties reserved for
>    locating the table are passed as the user would expect.
>
> I don't have a strong opinion about this, but yeah, maybe this behavior
would cause the least surprises.

>
>
>
> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <bo...@apache.org>
> wrote:
>
>> Thanks, Peter. I answered inline.
>>
>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Zoltan,
>>>
>>> Answers below:
>>>
>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for the replies. My take for the above questions are as follows
>>>
>>>    - Should 'iceberg.catalog' be a required property?
>>>    - Yeah, I think it would be nice if this would be required to avoid
>>>       any implicit behavior
>>>
>>> Currently we have a Catalogs class to get/initialize/use the different
>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>> catalog.
>>> It might be worthwhile to use the same class in Impala as well, so the
>>> behavior is consistent.
>>>
>>
>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>> whenever possible. The Catalogs class is very similar to what we have
>> currently in Impala.
>>
>>>
>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>       - In Impala we don't allow setting LOCATION for tables stored in
>>>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>       table's actual location. We were also thinking about using only the table
>>>       LOCATION, and set it to the catalog location, but we also found it
>>>       confusing.
>>>
>>> It could definitely work, but it is somewhat strange that we have an
>>> external table location set to an arbitrary path, and we have a different
>>> location generated by other configs. It would be nice to have the real
>>> location set in the external table location as well.
>>>
>>
>> Impala sets the real Iceberg table location for external tables. E.g. if
>> the user issues
>>
>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>> STORED AS ICEBERG
>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>
>> If the end user had specified LOCATION, then Impala would have raised an
>> error. But the above DDL statement is correct, so Impala loads the iceberg
>> table via Iceberg API, then creates the HMS table and sets LOCATION to the
>> Iceberg table location (something like
>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>
>>
>>> I like the flexibility of setting the table_identifier on table level,
>>> which could help removing naming conflicts. We might want to have this in
>>> the Iceberg Catalog implementation.
>>>
>>>
>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>       - Yeah, it doesn't add much if we only allow using the current
>>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>>    - Moving properties to SERDEPROPERTIES
>>>       - I see that these properties are used by the SerDe classes in
>>>       Hive, but I feel that these properties are just not about serialization and
>>>       deserialization. And as I see the current SERDEPROPERTIES are things like
>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>
>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>>> for separating real Iceberg table properties from other properties, then we
>>> can keep it at TBLPROPERTIES.
>>>
>>
>> In the google doc I also had a comment about prefixing iceberg table
>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>> every property with this prefix to the Iceberg table. Currently Impala
>> passes every table property to the Iceberg table.
>>
>>
>>>
>>> Thanks,
>>>     Zoltan
>>>
>>>
>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Based on the discussion below I understand we have the following kinds
>>>> of properties:
>>>>
>>>>    1. Iceberg table properties - Engine independent, storage related
>>>>    parameters
>>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>>    properties, since for Spark, the Spark catalog configuration serves for the
>>>>    same purpose. I think the best place for storing these would be the
>>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>>>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>>>>    sense only if we allow having multiple catalogs.
>>>>    3. Query specific properties - These are engine specific and might
>>>>    be mapped to / even override the Iceberg table properties on the engine
>>>>    specific code paths, but currently these properties have independent names
>>>>    and mapped on a case-by-case basis.
>>>>
>>>>
>>>> Based on this:
>>>>
>>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>    from Hive queries and omitting other engine specific properties?
>>>>
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>>>
>>>> I like these suggestions, comments inline below on the last round...
>>>>
>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The above aligns with what we did in Impala, i.e. we store information
>>>>> about table loading in HMS table properties. We are just a bit more
>>>>> explicit about which catalog to use.
>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>> type, right now the supported values are 'hadoop.tables', 'hadoop.catalog',
>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>> catalog type.
>>>>>
>>>>> So, if the value of 'iceberg.catalog' is
>>>>>
>>>>
>>>> I'm all for renaming this, having "mr" in the property name is
>>>> confusing.
>>>>
>>>>
>>>>>
>>>>>    - hadoop.tables
>>>>>       - the table location is used to load the table
>>>>>
>>>>> The only question I have is should we have this as the default? i.e.
>>>> if you don't set a catalog it will assume its HadoopTables and use the
>>>> location? Or should we require this property to be here to be consistent
>>>> and avoid any "magic"?
>>>>
>>>>
>>>>>
>>>>>    - hadoop.catalog
>>>>>       - Required table property 'iceberg.catalog_location' specifies
>>>>>       the location of the hadoop catalog in the file system
>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>>       table identifier
>>>>>
>>>>> I like this as it would allow you to use a different database and
>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>> have to match. The only thing here is that I think Hive requires a table
>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>> being set. That would probably be best in conjunction with this. Another
>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>> Hive point of view.
>>>>
>>>>
>>>>>    - hive.catalog
>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>>       table identifier
>>>>>       - We have the assumption that the current Hive metastore stores
>>>>>       the table, i.e. we don't support external Hive metastores currently
>>>>>
>>>>> These sound fine for Hive catalog tables that are created outside of
>>>> the automatic Hive table creation (see https://iceberg.apache.org/hive/
>>>> -> Using Hive Catalog) we'd just need to document how you can create these
>>>> yourself and that one could use a different Hive database and table etc.
>>>>
>>>>
>>>>> Independent of catalog implementations, but we also have table
>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>> files.
>>>>>
>>>>
>>>> OK, I don't think we need that for Hive?
>>>>
>>>>
>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>> these properties are reasonable and it would be great if we could
>>>>> standardize the properties across engines that use HMS as the primary
>>>>> metastore of tables.
>>>>>
>>>>>
>>>> If others agree I think we should create an issue where we document the
>>>> above changes so it's very clear what we're doing and can then go an
>>>> implement them and update the docs etc.
>>>>
>>>>
>>>>> Cheers,
>>>>>     Zoltan
>>>>>
>>>>>
>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>
>>>>>> #4 is correct because we provide some information that is
>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>> current user). I also agree that it would be good to have a table
>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>> gives us a way to handle name conflicts.
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Minor error, my last example should have been:
>>>>>>>
>>>>>>> db1.table1_etl_branch =>
>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>>>>
>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>>>    global level
>>>>>>>>    4. Best-effort schema should stored be in the table properties
>>>>>>>>    in HMS. This should be done for information schema retrieval purposes
>>>>>>>>    within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>
>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>>> just added)?
>>>>>>>>
>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>
>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>
>>>>>>>> Similarly, you might want to expose a particular branch version of
>>>>>>>> a table. So it might say:
>>>>>>>>
>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>
>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>>> not the content/metadata.
>>>>>>>>
>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>>>>> problems as well, so this is a good time to have a discussion about Hive
>>>>>>>>> config.
>>>>>>>>>
>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>>> to clarify.
>>>>>>>>>
>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>
>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>
>>>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>>    properties.
>>>>>>>>>
>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>
>>>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>>>>>> users because config is standardized across engines and in one place. And
>>>>>>>>> it also enables services that analyze a table and update its configuration
>>>>>>>>> to tune options that users almost never do, like row group or stripe size
>>>>>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>>>>>> table-specific concerns and behavior.
>>>>>>>>>
>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>> read.split.target-size and the read-specific option is target-size
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration.
>>>>>>>>> As a result, I think the right place to store engine-specific config is
>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what Spark
>>>>>>>>> does: what external Iceberg catalogs are available and their configuration
>>>>>>>>> should come from the HiveConf.
>>>>>>>>>
>>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>
>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>
>>>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>    - HMS table properties should be used to determine how to load
>>>>>>>>>    a table: using a Hadoop location, using the default metastore catalog, or
>>>>>>>>>    using an external Iceberg catalog
>>>>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>>>>       for this metastore (where it is tracked)
>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>       and use it to load the table identifier, or maybe an identifier from HMS
>>>>>>>>>       table properties
>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>
>>>>>>>>> This would make it possible to access all types of Iceberg tables
>>>>>>>>> in the same query, and would match how Spark and Flink configure catalogs.
>>>>>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>
>>>>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog.
>>>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. If it
>>>>>>>>> is set, then that catalog will be used to load all tables by name. This
>>>>>>>>> makes it impossible to load tables from different catalogs at the same
>>>>>>>>> time. That’s why I think the Iceberg catalog for a table should be stored
>>>>>>>>> in HMS table properties.
>>>>>>>>>
>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>> think this is long enough for now.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>> I've put together a short document
>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Laszlo
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

This sounds like a good plan overall, but I have a couple of notes:

   1. We need to keep in mind that users plug in their own catalogs, so
   iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
   Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
   and hive.catalog values. Those should just be names for catalogs configured
   in Configuration, i.e., via hive-site.xml. We then only need a special
   value for loading Hadoop tables from paths.
   2. I don’t think that catalog configuration should be kept in table
   properties. A catalog should not be loaded for each table. So I don’t think
   we need iceberg.catalog_location. Instead, we should have a way to
   define catalogs in the Configuration for tables in the metastore to
   reference.
   3. I’d rather use a prefix to exclude properties from being passed to
   Iceberg than to include them. Otherwise, users don’t know what to do to
   pass table properties from Hive or Impala. If we exclude a prefix or
   specific properties, then everything but the properties reserved for
   locating the table are passed as the user would expect.


On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <bo...@apache.org>
wrote:

> Thanks, Peter. I answered inline.
>
> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Zoltan,
>>
>> Answers below:
>>
>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>> boroknagyz@cloudera.com.INVALID> wrote:
>>
>> Hi,
>>
>> Thanks for the replies. My take for the above questions are as follows
>>
>>    - Should 'iceberg.catalog' be a required property?
>>    - Yeah, I think it would be nice if this would be required to avoid
>>       any implicit behavior
>>
>> Currently we have a Catalogs class to get/initialize/use the different
>> Catalogs. At that time the decision was to use HadoopTables as a default
>> catalog.
>> It might be worthwhile to use the same class in Impala as well, so the
>> behavior is consistent.
>>
>
> Yeah, I think it'd be beneficial for us to use the Iceberg classes
> whenever possible. The Catalogs class is very similar to what we have
> currently in Impala.
>
>>
>>    - 'hadoop.catalog' LOCATION and catalog_location
>>       - In Impala we don't allow setting LOCATION for tables stored in
>>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>       table's actual location. We were also thinking about using only the table
>>       LOCATION, and set it to the catalog location, but we also found it
>>       confusing.
>>
>> It could definitely work, but it is somewhat strange that we have an
>> external table location set to an arbitrary path, and we have a different
>> location generated by other configs. It would be nice to have the real
>> location set in the external table location as well.
>>
>
> Impala sets the real Iceberg table location for external tables. E.g. if
> the user issues
>
> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
> STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>
> If the end user had specified LOCATION, then Impala would have raised an
> error. But the above DDL statement is correct, so Impala loads the iceberg
> table via Iceberg API, then creates the HMS table and sets LOCATION to the
> Iceberg table location (something like
> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>
>
>> I like the flexibility of setting the table_identifier on table level,
>> which could help removing naming conflicts. We might want to have this in
>> the Iceberg Catalog implementation.
>>
>>
>>    - 'iceberg.table_identifier' for HiveCatalog
>>       - Yeah, it doesn't add much if we only allow using the current
>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>    - Moving properties to SERDEPROPERTIES
>>       - I see that these properties are used by the SerDe classes in
>>       Hive, but I feel that these properties are just not about serialization and
>>       deserialization. And as I see the current SERDEPROPERTIES are things like
>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>
>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>> for separating real Iceberg table properties from other properties, then we
>> can keep it at TBLPROPERTIES.
>>
>
> In the google doc I also had a comment about prefixing iceberg table
> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
> every property with this prefix to the Iceberg table. Currently Impala
> passes every table property to the Iceberg table.
>
>
>>
>> Thanks,
>>     Zoltan
>>
>>
>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>> Based on the discussion below I understand we have the following kinds
>>> of properties:
>>>
>>>    1. Iceberg table properties - Engine independent, storage related
>>>    parameters
>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>    properties, since for Spark, the Spark catalog configuration serves for the
>>>    same purpose. I think the best place for storing these would be the
>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>>>    sense only if we allow having multiple catalogs.
>>>    3. Query specific properties - These are engine specific and might
>>>    be mapped to / even override the Iceberg table properties on the engine
>>>    specific code paths, but currently these properties have independent names
>>>    and mapped on a case-by-case basis.
>>>
>>>
>>> Based on this:
>>>
>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>    - Shall we define a prefix for setting Iceberg table properties from
>>>    Hive queries and omitting other engine specific properties?
>>>
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>>
>>> I like these suggestions, comments inline below on the last round...
>>>
>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The above aligns with what we did in Impala, i.e. we store information
>>>> about table loading in HMS table properties. We are just a bit more
>>>> explicit about which catalog to use.
>>>> We have table property 'iceberg.catalog' to determine the catalog type,
>>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
>>>> 'hive.catalog'. Additional table properties can be set based on the catalog
>>>> type.
>>>>
>>>> So, if the value of 'iceberg.catalog' is
>>>>
>>>
>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>
>>>
>>>>
>>>>    - hadoop.tables
>>>>       - the table location is used to load the table
>>>>
>>>> The only question I have is should we have this as the default? i.e. if
>>> you don't set a catalog it will assume its HadoopTables and use the
>>> location? Or should we require this property to be here to be consistent
>>> and avoid any "magic"?
>>>
>>>
>>>>
>>>>    - hadoop.catalog
>>>>       - Required table property 'iceberg.catalog_location' specifies
>>>>       the location of the hadoop catalog in the file system
>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>       table identifier
>>>>
>>>> I like this as it would allow you to use a different database and table
>>> name in Hive as opposed to the Hadoop Catalog - at the moment they have to
>>> match. The only thing here is that I think Hive requires a table LOCATION
>>> to be set and it's then confusing as there are now two locations on the
>>> table. I'm not sure whether in the Hive storage handler or SerDe etc. we
>>> can get Hive to not require that and maybe even disallow it from being set.
>>> That would probably be best in conjunction with this. Another solution
>>> would be to not have the 'iceberg.catalog_location' property but instead
>>> use the table LOCATION for this but that's a bit confusing from a Hive
>>> point of view.
>>>
>>>
>>>>    - hive.catalog
>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>>       table identifier
>>>>       - We have the assumption that the current Hive metastore stores
>>>>       the table, i.e. we don't support external Hive metastores currently
>>>>
>>>> These sound fine for Hive catalog tables that are created outside of
>>> the automatic Hive table creation (see https://iceberg.apache.org/hive/
>>> -> Using Hive Catalog) we'd just need to document how you can create these
>>> yourself and that one could use a different Hive database and table etc.
>>>
>>>
>>>> Independent of catalog implementations, but we also have table property
>>>> 'iceberg.file_format' to specify the file format for the data files.
>>>>
>>>
>>> OK, I don't think we need that for Hive?
>>>
>>>
>>>> We haven't released it yet, so we are open to changes, but I think
>>>> these properties are reasonable and it would be great if we could
>>>> standardize the properties across engines that use HMS as the primary
>>>> metastore of tables.
>>>>
>>>>
>>> If others agree I think we should create an issue where we document the
>>> above changes so it's very clear what we're doing and can then go an
>>> implement them and update the docs etc.
>>>
>>>
>>>> Cheers,
>>>>     Zoltan
>>>>
>>>>
>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Yes, I think that is a good summary of the principles.
>>>>>
>>>>> #4 is correct because we provide some information that is
>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>> current user). I also agree that it would be good to have a table
>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>> gives us a way to handle name conflicts.
>>>>>
>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>>> wrote:
>>>>>
>>>>>> Minor error, my last example should have been:
>>>>>>
>>>>>> db1.table1_etl_branch =>
>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>>>
>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>>    global level
>>>>>>>    4. Best-effort schema should stored be in the table properties
>>>>>>>    in HMS. This should be done for information schema retrieval purposes
>>>>>>>    within Hive but should be ignored during Hive/other tool execution.
>>>>>>>
>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>> just added)?
>>>>>>>
>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>
>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>
>>>>>>> Similarly, you might want to expose a particular branch version of a
>>>>>>> table. So it might say:
>>>>>>>
>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>
>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>> are, we should follow #1 and only store properties that are about the ptr,
>>>>>>> not the content/metadata.
>>>>>>>
>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>>>> problems as well, so this is a good time to have a discussion about Hive
>>>>>>>> config.
>>>>>>>>
>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>>> to clarify.
>>>>>>>>
>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>
>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>> eliminate 2 sources of config:
>>>>>>>>
>>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>    properties.
>>>>>>>>
>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>
>>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>>>>> users because config is standardized across engines and in one place. And
>>>>>>>> it also enables services that analyze a table and update its configuration
>>>>>>>> to tune options that users almost never do, like row group or stripe size
>>>>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>>>>> table-specific concerns and behavior.
>>>>>>>>
>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>> read.split.target-size and the read-specific option is target-size.
>>>>>>>>
>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. As
>>>>>>>> a result, I think the right place to store engine-specific config is there,
>>>>>>>> including Iceberg catalogs using a strategy similar to what Spark does:
>>>>>>>> what external Iceberg catalogs are available and their configuration should
>>>>>>>> come from the HiveConf.
>>>>>>>>
>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>
>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>
>>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>    - HMS table properties should be used to determine how to load
>>>>>>>>    a table: using a Hadoop location, using the default metastore catalog, or
>>>>>>>>    using an external Iceberg catalog
>>>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>>>       for this metastore (where it is tracked)
>>>>>>>>       - If there is a catalog property, then load that catalog and
>>>>>>>>       use it to load the table identifier, or maybe an identifier from HMS table
>>>>>>>>       properties
>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>
>>>>>>>> This would make it possible to access all types of Iceberg tables
>>>>>>>> in the same query, and would match how Spark and Flink configure catalogs.
>>>>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>
>>>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog.
>>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. If it
>>>>>>>> is set, then that catalog will be used to load all tables by name. This
>>>>>>>> makes it impossible to load tables from different catalogs at the same
>>>>>>>> time. That’s why I think the Iceberg catalog for a table should be stored
>>>>>>>> in HMS table properties.
>>>>>>>>
>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>> think this is long enough for now.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>> I've put together a short document
>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Laszlo
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Posted by Zoltán Borók-Nagy <bo...@apache.org>.

Thanks, Peter. I answered inline.

On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Zoltan,
>
> Answers below:
>
> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
> boroknagyz@cloudera.com.INVALID> wrote:
>
> Hi,
>
> Thanks for the replies. My take for the above questions are as follows
>
>    - Should 'iceberg.catalog' be a required property?
>    - Yeah, I think it would be nice if this would be required to avoid
>       any implicit behavior
>
> Currently we have a Catalogs class to get/initialize/use the different
> Catalogs. At that time the decision was to use HadoopTables as a default
> catalog.
> It might be worthwhile to use the same class in Impala as well, so the
> behavior is consistent.
>

Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever
possible. The Catalogs class is very similar to what we have currently in
Impala.

>
>    - 'hadoop.catalog' LOCATION and catalog_location
>       - In Impala we don't allow setting LOCATION for tables stored in
>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>       table's actual location. We were also thinking about using only the table
>       LOCATION, and set it to the catalog location, but we also found it
>       confusing.
>
> It could definitely work, but it is somewhat strange that we have an
> external table location set to an arbitrary path, and we have a different
> location generated by other configs. It would be nice to have the real
> location set in the external table location as well.
>

Impala sets the real Iceberg table location for external tables. E.g. if
the user issues

CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/path/to/hadoop/catalog',
              'iceberg.table_identifier'='namespace1.namespace2.ice_t');

If the end user had specified LOCATION, then Impala would have raised an
error. But the above DDL statement is correct, so Impala loads the iceberg
table via Iceberg API, then creates the HMS table and sets LOCATION to the
Iceberg table location (something like
/path/to/hadoop/catalog/namespace1/namespace2/ice_t).


> I like the flexibility of setting the table_identifier on table level,
> which could help removing naming conflicts. We might want to have this in
> the Iceberg Catalog implementation.
>
>
>    - 'iceberg.table_identifier' for HiveCatalog
>       - Yeah, it doesn't add much if we only allow using the current HMS.
>       I think it can be only useful if we are allowing external HMSes.
>    - Moving properties to SERDEPROPERTIES
>       - I see that these properties are used by the SerDe classes in
>       Hive, but I feel that these properties are just not about serialization and
>       deserialization. And as I see the current SERDEPROPERTIES are things like
>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table
>       loading more naturally belong to TBLPROPERTIES in my opinion.
>
> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
> for separating real Iceberg table properties from other properties, then we
> can keep it at TBLPROPERTIES.
>

In the google doc I also had a comment about prefixing iceberg table
properties. We could use a prefix like 'iceberg.tblproperties.', and pass
every property with this prefix to the Iceberg table. Currently Impala
passes every table property to the Iceberg table.


>
> Thanks,
>     Zoltan
>
>
> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi,
>>
>> Based on the discussion below I understand we have the following kinds of
>> properties:
>>
>>    1. Iceberg table properties - Engine independent, storage related
>>    parameters
>>    2. "how to get to" - I think these are mostly Hive table specific
>>    properties, since for Spark, the Spark catalog configuration serves for the
>>    same purpose. I think the best place for storing these would be the
>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>>    sense only if we allow having multiple catalogs.
>>    3. Query specific properties - These are engine specific and might be
>>    mapped to / even override the Iceberg table properties on the engine
>>    specific code paths, but currently these properties have independent names
>>    and mapped on a case-by-case basis.
>>
>>
>> Based on this:
>>
>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>    - Shall we define a prefix for setting Iceberg table properties from
>>    Hive queries and omitting other engine specific properties?
>>
>>
>> Thanks,
>> Peter
>>
>>
>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>>
>> I like these suggestions, comments inline below on the last round...
>>
>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> The above aligns with what we did in Impala, i.e. we store information
>>> about table loading in HMS table properties. We are just a bit more
>>> explicit about which catalog to use.
>>> We have table property 'iceberg.catalog' to determine the catalog type,
>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
>>> 'hive.catalog'. Additional table properties can be set based on the catalog
>>> type.
>>>
>>> So, if the value of 'iceberg.catalog' is
>>>
>>
>> I'm all for renaming this, having "mr" in the property name is confusing.
>>
>>
>>>
>>>    - hadoop.tables
>>>       - the table location is used to load the table
>>>
>>> The only question I have is should we have this as the default? i.e. if
>> you don't set a catalog it will assume its HadoopTables and use the
>> location? Or should we require this property to be here to be consistent
>> and avoid any "magic"?
>>
>>
>>>
>>>    - hadoop.catalog
>>>       - Required table property 'iceberg.catalog_location' specifies
>>>       the location of the hadoop catalog in the file system
>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>       table identifier
>>>
>>> I like this as it would allow you to use a different database and table
>> name in Hive as opposed to the Hadoop Catalog - at the moment they have to
>> match. The only thing here is that I think Hive requires a table LOCATION
>> to be set and it's then confusing as there are now two locations on the
>> table. I'm not sure whether in the Hive storage handler or SerDe etc. we
>> can get Hive to not require that and maybe even disallow it from being set.
>> That would probably be best in conjunction with this. Another solution
>> would be to not have the 'iceberg.catalog_location' property but instead
>> use the table LOCATION for this but that's a bit confusing from a Hive
>> point of view.
>>
>>
>>>    - hive.catalog
>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>       the table id. If it's not set, then <database_name>.<table_name> is used as
>>>       table identifier
>>>       - We have the assumption that the current Hive metastore stores
>>>       the table, i.e. we don't support external Hive metastores currently
>>>
>>> These sound fine for Hive catalog tables that are created outside of the
>> automatic Hive table creation (see https://iceberg.apache.org/hive/ ->
>> Using Hive Catalog) we'd just need to document how you can create these
>> yourself and that one could use a different Hive database and table etc.
>>
>>
>>> Independent of catalog implementations, but we also have table property
>>> 'iceberg.file_format' to specify the file format for the data files.
>>>
>>
>> OK, I don't think we need that for Hive?
>>
>>
>>> We haven't released it yet, so we are open to changes, but I think these
>>> properties are reasonable and it would be great if we could standardize the
>>> properties across engines that use HMS as the primary metastore of tables.
>>>
>>>
>> If others agree I think we should create an issue where we document the
>> above changes so it's very clear what we're doing and can then go an
>> implement them and update the docs etc.
>>
>>
>>> Cheers,
>>>     Zoltan
>>>
>>>
>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Yes, I think that is a good summary of the principles.
>>>>
>>>> #4 is correct because we provide some information that is informational
>>>> (Hive schema) or tracked only by the metastore (best-effort current user).
>>>> I also agree that it would be good to have a table identifier in HMS table
>>>> metadata when loading from an external table. That gives us a way to handle
>>>> name conflicts.
>>>>
>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>>> wrote:
>>>>
>>>>> Minor error, my last example should have been:
>>>>>
>>>>> db1.table1_etl_branch =>
>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>>> wrote:
>>>>>
>>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>>
>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>    global level
>>>>>>    4. Best-effort schema should stored be in the table properties in
>>>>>>    HMS. This should be done for information schema retrieval purposes within
>>>>>>    Hive but should be ignored during Hive/other tool execution.
>>>>>>
>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>> just added)?
>>>>>>
>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>
>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>
>>>>>> Similarly, you might want to expose a particular branch version of a
>>>>>> table. So it might say:
>>>>>>
>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>
>>>>>> Just saying that the address to the table in the catalog could itself
>>>>>> have several properties. The key being that no matter what those are, we
>>>>>> should follow #1 and only store properties that are about the ptr, not the
>>>>>> content/metadata.
>>>>>>
>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>>> problems as well, so this is a good time to have a discussion about Hive
>>>>>>> config.
>>>>>>>
>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>> engines, where different configurations are used for different purposes.
>>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>>> to clarify.
>>>>>>>
>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>> Hadoop Configuration, but it also has its own global configuration. There
>>>>>>> are also Iceberg table properties, and all of the various Hive properties
>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>
>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>> eliminate 2 sources of config:
>>>>>>>
>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>    properties.
>>>>>>>
>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>
>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>>>> users because config is standardized across engines and in one place. And
>>>>>>> it also enables services that analyze a table and update its configuration
>>>>>>> to tune options that users almost never do, like row group or stripe size
>>>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>>>> table-specific concerns and behavior.
>>>>>>>
>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>> config names for how they are used: Iceberg uses
>>>>>>> read.split.target-size and the read-specific option is target-size.
>>>>>>>
>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. As
>>>>>>> a result, I think the right place to store engine-specific config is there,
>>>>>>> including Iceberg catalogs using a strategy similar to what Spark does:
>>>>>>> what external Iceberg catalogs are available and their configuration should
>>>>>>> come from the HiveConf.
>>>>>>>
>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>
>>>>>>> Here’s how I think Hive should work:
>>>>>>>
>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>    - HMS table properties should be used to determine how to load a
>>>>>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>>>>>    using an external Iceberg catalog
>>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>>       for this metastore (where it is tracked)
>>>>>>>       - If there is a catalog property, then load that catalog and
>>>>>>>       use it to load the table identifier, or maybe an identifier from HMS table
>>>>>>>       properties
>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>
>>>>>>> This would make it possible to access all types of Iceberg tables in
>>>>>>> the same query, and would match how Spark and Flink configure catalogs.
>>>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>
>>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog.
>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. If it
>>>>>>> is set, then that catalog will be used to load all tables by name. This
>>>>>>> makes it impossible to load tables from different catalogs at the same
>>>>>>> time. That’s why I think the Iceberg catalog for a table should be stored
>>>>>>> in HMS table properties.
>>>>>>>
>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>>>>>> this is long enough for now.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I would like to start a discussion, how should we handle properties
>>>>>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>>>>>> together a short document
>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>> please have a look and let me know what you think.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Laszlo
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Zoltan,

Answers below:

> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <bo...@cloudera.com.INVALID> wrote:
> 
> Hi,
> 
> Thanks for the replies. My take for the above questions are as follows
> Should 'iceberg.catalog' be a required property?
> Yeah, I think it would be nice if this would be required to avoid any implicit behavior
Currently we have a Catalogs class to get/initialize/use the different Catalogs. At that time the decision was to use HadoopTables as a default catalog.
It might be worthwhile to use the same class in Impala as well, so the behavior is consistent.
> 'hadoop.catalog' LOCATION and catalog_location
> In Impala we don't allow setting LOCATION for tables stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg table's actual location. We were also thinking about using only the table LOCATION, and set it to the catalog location, but we also found it confusing.
It could definitely work, but it is somewhat strange that we have an external table location set to an arbitrary path, and we have a different location generated by other configs. It would be nice to have the real location set in the external table location as well.

I like the flexibility of setting the table_identifier on table level, which could help removing naming conflicts. We might want to have this in the Iceberg Catalog implementation.
> 'iceberg.table_identifier' for HiveCatalog
> Yeah, it doesn't add much if we only allow using the current HMS. I think it can be only useful if we are allowing external HMSes.
> Moving properties to SERDEPROPERTIES
> I see that these properties are used by the SerDe classes in Hive, but I feel that these properties are just not about serialization and deserialization. And as I see the current SERDEPROPERTIES are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table loading more naturally belong to TBLPROPERTIES in my opinion.
I have seen it used both ways for HBaseSerDe. (even the wiki page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for separating real Iceberg table properties from other properties, then we can keep it at TBLPROPERTIES.


> Thanks,
>     Zoltan
> 
> 
> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi,
> 
> Based on the discussion below I understand we have the following kinds of properties:
> Iceberg table properties - Engine independent, storage related parameters
> "how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs.
> Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis. 
> 
> Based on this:
> Shall we move the "how to get to" properties to SERDEPROPERTIES?
> Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties?
> 
> Thanks,
> Peter
> 
> 
>> On Nov 27, 2020, at 17:45, Mass Dosage <massdosage@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I like these suggestions, comments inline below on the last round...
>> 
>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
>> Hi,
>> 
>> The above aligns with what we did in Impala, i.e. we store information about table loading in HMS table properties. We are just a bit more explicit about which catalog to use.
>> We have table property 'iceberg.catalog' to determine the catalog type, right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be set based on the catalog type.
>> 
>> So, if the value of 'iceberg.catalog' is
>> 
>> I'm all for renaming this, having "mr" in the property name is confusing.
>>  
>> hadoop.tables
>> the table location is used to load the table
>> The only question I have is should we have this as the default? i.e. if you don't set a catalog it will assume its HadoopTables and use the location? Or should we require this property to be here to be consistent and avoid any "magic"?
>>  
>> hadoop.catalog
>> Required table property 'iceberg.catalog_location' specifies the location of the hadoop catalog in the file system
>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>> I like this as it would allow you to use a different database and table name in Hive as opposed to the Hadoop Catalog - at the moment they have to match. The only thing here is that I think Hive requires a table LOCATION to be set and it's then confusing as there are now two locations on the table. I'm not sure whether in the Hive storage handler or SerDe etc. we can get Hive to not require that and maybe even disallow it from being set. That would probably be best in conjunction with this. Another solution would be to not have the 'iceberg.catalog_location' property but instead use the table LOCATION for this but that's a bit confusing from a Hive point of view.
>> 
>> hive.catalog
>> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
>> We have the assumption that the current Hive metastore stores the table, i.e. we don't support external Hive metastores currently
>> These sound fine for Hive catalog tables that are created outside of the automatic Hive table creation (see https://iceberg.apache.org/hive/ <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to document how you can create these yourself and that one could use a different Hive database and table etc.
>>  
>> Independent of catalog implementations, but we also have table property 'iceberg.file_format' to specify the file format for the data files.
>> 
>> OK, I don't think we need that for Hive?
>>  
>> We haven't released it yet, so we are open to changes, but I think these properties are reasonable and it would be great if we could standardize the properties across engines that use HMS as the primary metastore of tables.
>> 
>> 
>> If others agree I think we should create an issue where we document the above changes so it's very clear what we're doing and can then go an implement them and update the docs etc.
>>  
>> Cheers,
>>     Zoltan
>> 
>> 
>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>> Yes, I think that is a good summary of the principles.
>> 
>> #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
>> 
>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>> Minor error, my last example should have been:
>> 
>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>> 
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> 
>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
>> I agree with Ryan on the core principles here. As I understand them:
>> Iceberg metadata describes all properties of a table
>> Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>> There could be default "how to get to" information set at a global level
>> Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
>> Is that a fair summary of your statements Ryan (except 4, which I just added)?
>> 
>> One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2
>> 
>> db1.table1 => nessie:folder1.folder2.folder3.table1
>> 
>> Similarly, you might want to expose a particular branch version of a table. So it might say:
>> 
>> db1.table1_etl_branch => nessie.folder1@etl_branch
>> 
>> Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata.
>> 
>> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic?
>> 
>> 
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> 
>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
>> Thanks for working on this, Laszlo. I’ve been thinking about these problems as well, so this is a good time to have a discussion about Hive config.
>> 
>> I think that Hive configuration should work mostly like other engines, where different configurations are used for different purposes. Different purposes means that there is not a global configuration priority. Hopefully, I can explain how we use the different config sources elsewhere to clarify.
>> 
>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop Configuration, but it also has its own global configuration. There are also Iceberg table properties, and all of the various Hive properties if you’re tracking tables with a Hive MetaStore.
>> 
>> The first step is to simplify where we can, so we effectively eliminate 2 sources of config:
>> 
>> The Hadoop Configuration is only used to instantiate Hadoop classes, like FileSystem. Iceberg should not use it for any other config.
>> Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.
>> That leaves Spark configuration and Iceberg table configuration.
>> 
>> Iceberg differs from other tables because it is opinionated: data configuration should be maintained at the table level. This is cleaner for users because config is standardized across engines and in one place. And it also enables services that analyze a table and update its configuration to tune options that users almost never do, like row group or stripe size in the columnar formats. Iceberg table configuration is used to configure table-specific concerns and behavior.
>> 
>> Spark configuration is used for engine-specific concerns, and runtime overrides. A good example of an engine-specific concern is the catalogs that are available to load Iceberg tables. Spark has a way to load and configure catalog implementations and Iceberg uses that for all catalog-level config. Runtime overrides are things like target split size. Iceberg has a table-level default split size in table properties, but this can be overridden by a Spark option for each table, as well as an option passed to the individual read. Note that these necessarily have different config names for how they are used: Iceberg uses read.split.target-size and the read-specific option is target-size.
>> 
>> Applying this to Hive is a little strange for a couple reasons. First, Hive’s engine configuration is a Hadoop Configuration. As a result, I think the right place to store engine-specific config is there, including Iceberg catalogs using a strategy similar to what Spark does: what external Iceberg catalogs are available and their configuration should come from the HiveConf.
>> 
>> The second way Hive is strange is that Hive needs to use its own MetaStore to track Hive table concerns. The MetaStore may have tables created by an Iceberg HiveCatalog, and Hive also needs to be able to load tables from other Iceberg catalogs by creating table entries for them.
>> 
>> Here’s how I think Hive should work:
>> 
>> There should be a default HiveCatalog that uses the current MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>> Other catalogs should be defined in HiveConf
>> HMS table properties should be used to determine how to load a table: using a Hadoop location, using the default metastore catalog, or using an external Iceberg catalog
>> If there is a metadata_location, then use the HiveCatalog for this metastore (where it is tracked)
>> If there is a catalog property, then load that catalog and use it to load the table identifier, or maybe an identifier from HMS table properties
>> If there is no catalog or metadata_location, then use HadoopTables to load the table location as an Iceberg table
>> This would make it possible to access all types of Iceberg tables in the same query, and would match how Spark and Flink configure catalogs. Other than the configuration above, I don’t think that config in HMS should be used at all, like how the other engines work. Iceberg is the source of truth for table metadata, HMS stores how to load the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>> 
>> This isn’t quite how configuration works right now. Currently, the catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, HadoopTables will be used to load table locations. If it is set, then that catalog will be used to load all tables by name. This makes it impossible to load tables from different catalogs at the same time. That’s why I think the Iceberg catalog for a table should be stored in HMS table properties.
>> 
>> I should also explain iceberg.hive.engine.enabled flag, but I think this is long enough for now.
>> 
>> rb
>> 
>> 
>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpinter@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>> Hi All,
>> 
>> I would like to start a discussion, how should we handle properties from various sources like Iceberg, Hive or global configuration. I've put together a short document <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, please have a look and let me know what you think.
>> 
>> Thanks,
>> Laszlo
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Zoltán Borók-Nagy <bo...@cloudera.com.INVALID>.

Hi,

Thanks for the replies. My take for the above questions are as follows

   - Should 'iceberg.catalog' be a required property?
   - Yeah, I think it would be nice if this would be required to avoid any
      implicit behavior
   - 'hadoop.catalog' LOCATION and catalog_location
      - In Impala we don't allow setting LOCATION for tables stored in
      'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
      table's actual location. We were also thinking about using only the table
      LOCATION, and set it to the catalog location, but we also found it
      confusing.
   - 'iceberg.table_identifier' for HiveCatalog
      - Yeah, it doesn't add much if we only allow using the current HMS. I
      think it can be only useful if we are allowing external HMSes.
   - Moving properties to SERDEPROPERTIES
      - I see that these properties are used by the SerDe classes in Hive,
      but I feel that these properties are just not about serialization and
      deserialization. And as I see the current SERDEPROPERTIES are things like
      'field.delim', 'separatorChar', 'quoteChar', etc. So properties
about table
      loading more naturally belong to TBLPROPERTIES in my opinion.

Thanks,
    Zoltan


On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi,
>
> Based on the discussion below I understand we have the following kinds of
> properties:
>
>    1. Iceberg table properties - Engine independent, storage related
>    parameters
>    2. "how to get to" - I think these are mostly Hive table specific
>    properties, since for Spark, the Spark catalog configuration serves for the
>    same purpose. I think the best place for storing these would be the
>    Hive SERDEPROPERTIES, as this describes the access information for the
>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>    sense only if we allow having multiple catalogs.
>    3. Query specific properties - These are engine specific and might be
>    mapped to / even override the Iceberg table properties on the engine
>    specific code paths, but currently these properties have independent names
>    and mapped on a case-by-case basis.
>
>
> Based on this:
>
>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>    - Shall we define a prefix for setting Iceberg table properties from
>    Hive queries and omitting other engine specific properties?
>
>
> Thanks,
> Peter
>
>
> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
>
> I like these suggestions, comments inline below on the last round...
>
> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
> wrote:
>
>> Hi,
>>
>> The above aligns with what we did in Impala, i.e. we store information
>> about table loading in HMS table properties. We are just a bit more
>> explicit about which catalog to use.
>> We have table property 'iceberg.catalog' to determine the catalog type,
>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
>> 'hive.catalog'. Additional table properties can be set based on the catalog
>> type.
>>
>> So, if the value of 'iceberg.catalog' is
>>
>
> I'm all for renaming this, having "mr" in the property name is confusing.
>
>
>>
>>    - hadoop.tables
>>       - the table location is used to load the table
>>
>> The only question I have is should we have this as the default? i.e. if
> you don't set a catalog it will assume its HadoopTables and use the
> location? Or should we require this property to be here to be consistent
> and avoid any "magic"?
>
>
>>
>>    - hadoop.catalog
>>       - Required table property 'iceberg.catalog_location' specifies the
>>       location of the hadoop catalog in the file system
>>       - Optional table property 'iceberg.table_identifier' specifies the
>>       table id. If it's not set, then <database_name>.<table_name> is used as
>>       table identifier
>>
>> I like this as it would allow you to use a different database and table
> name in Hive as opposed to the Hadoop Catalog - at the moment they have to
> match. The only thing here is that I think Hive requires a table LOCATION
> to be set and it's then confusing as there are now two locations on the
> table. I'm not sure whether in the Hive storage handler or SerDe etc. we
> can get Hive to not require that and maybe even disallow it from being set.
> That would probably be best in conjunction with this. Another solution
> would be to not have the 'iceberg.catalog_location' property but instead
> use the table LOCATION for this but that's a bit confusing from a Hive
> point of view.
>
>
>>    - hive.catalog
>>       - Optional table property 'iceberg.table_identifier' specifies the
>>       table id. If it's not set, then <database_name>.<table_name> is used as
>>       table identifier
>>       - We have the assumption that the current Hive metastore stores
>>       the table, i.e. we don't support external Hive metastores currently
>>
>> These sound fine for Hive catalog tables that are created outside of the
> automatic Hive table creation (see https://iceberg.apache.org/hive/ ->
> Using Hive Catalog) we'd just need to document how you can create these
> yourself and that one could use a different Hive database and table etc.
>
>
>> Independent of catalog implementations, but we also have table property
>> 'iceberg.file_format' to specify the file format for the data files.
>>
>
> OK, I don't think we need that for Hive?
>
>
>> We haven't released it yet, so we are open to changes, but I think these
>> properties are reasonable and it would be great if we could standardize the
>> properties across engines that use HMS as the primary metastore of tables.
>>
>>
> If others agree I think we should create an issue where we document the
> above changes so it's very clear what we're doing and can then go an
> implement them and update the docs etc.
>
>
>> Cheers,
>>     Zoltan
>>
>>
>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Yes, I think that is a good summary of the principles.
>>>
>>> #4 is correct because we provide some information that is informational
>>> (Hive schema) or tracked only by the metastore (best-effort current user).
>>> I also agree that it would be good to have a table identifier in HMS table
>>> metadata when loading from an external table. That gives us a way to handle
>>> name conflicts.
>>>
>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>>> wrote:
>>>
>>>> Minor error, my last example should have been:
>>>>
>>>> db1.table1_etl_branch =>
>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>>> wrote:
>>>>
>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>
>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>>>>    (which catalog + possibly ptr, path, token, etc)
>>>>>    3. There could be default "how to get to" information set at a
>>>>>    global level
>>>>>    4. Best-effort schema should stored be in the table properties in
>>>>>    HMS. This should be done for information schema retrieval purposes within
>>>>>    Hive but should be ignored during Hive/other tool execution.
>>>>>
>>>>> Is that a fair summary of your statements Ryan (except 4, which I just
>>>>> added)?
>>>>>
>>>>> One comment I have on #2 is that for different catalogs and use cases,
>>>>> I think it can be somewhat more complex where it would be desirable for a
>>>>> table that initially existed without Hive that was later exposed in Hive to
>>>>> support a ptr/path/token for how the table is named externally. For
>>>>> example, in a Nessie context we support arbitrary paths for an Iceberg
>>>>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>>>>> that table to Hive, you might have this mapping for #2
>>>>>
>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>
>>>>> Similarly, you might want to expose a particular branch version of a
>>>>> table. So it might say:
>>>>>
>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>
>>>>> Just saying that the address to the table in the catalog could itself
>>>>> have several properties. The key being that no matter what those are, we
>>>>> should follow #1 and only store properties that are about the ptr, not the
>>>>> content/metadata.
>>>>>
>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>
>>>>>
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>> problems as well, so this is a good time to have a discussion about Hive
>>>>>> config.
>>>>>>
>>>>>> I think that Hive configuration should work mostly like other
>>>>>> engines, where different configurations are used for different purposes.
>>>>>> Different purposes means that there is not a global configuration priority.
>>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>>> to clarify.
>>>>>>
>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>>>>> Configuration, but it also has its own global configuration. There are also
>>>>>> Iceberg table properties, and all of the various Hive properties if you’re
>>>>>> tracking tables with a Hive MetaStore.
>>>>>>
>>>>>> The first step is to simplify where we can, so we effectively
>>>>>> eliminate 2 sources of config:
>>>>>>
>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>    properties.
>>>>>>
>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>
>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>>> users because config is standardized across engines and in one place. And
>>>>>> it also enables services that analyze a table and update its configuration
>>>>>> to tune options that users almost never do, like row group or stripe size
>>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>>> table-specific concerns and behavior.
>>>>>>
>>>>>> Spark configuration is used for engine-specific concerns, and runtime
>>>>>> overrides. A good example of an engine-specific concern is the catalogs
>>>>>> that are available to load Iceberg tables. Spark has a way to load and
>>>>>> configure catalog implementations and Iceberg uses that for all
>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>> config names for how they are used: Iceberg uses
>>>>>> read.split.target-size and the read-specific option is target-size.
>>>>>>
>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. As a
>>>>>> result, I think the right place to store engine-specific config is there,
>>>>>> including Iceberg catalogs using a strategy similar to what Spark does:
>>>>>> what external Iceberg catalogs are available and their configuration should
>>>>>> come from the HiveConf.
>>>>>>
>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>
>>>>>> Here’s how I think Hive should work:
>>>>>>
>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>    - HMS table properties should be used to determine how to load a
>>>>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>>>>    using an external Iceberg catalog
>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>       for this metastore (where it is tracked)
>>>>>>       - If there is a catalog property, then load that catalog and
>>>>>>       use it to load the table identifier, or maybe an identifier from HMS table
>>>>>>       properties
>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>
>>>>>> This would make it possible to access all types of Iceberg tables in
>>>>>> the same query, and would match how Spark and Flink configure catalogs.
>>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>
>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>>>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>>>>> set, then that catalog will be used to load all tables by name. This makes
>>>>>> it impossible to load tables from different catalogs at the same time.
>>>>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>>>>> table properties.
>>>>>>
>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>>>>> this is long enough for now.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I would like to start a discussion, how should we handle properties
>>>>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>>>>> together a short document
>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>> please have a look and let me know what you think.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Laszlo
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>

Re: Iceberg/Hive properties handling

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi,

Based on the discussion below I understand we have the following kinds of properties:
Iceberg table properties - Engine independent, storage related parameters
"how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs.
Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis. 

Based on this:
Shall we move the "how to get to" properties to SERDEPROPERTIES?
Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties?

Thanks,
Peter


> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com> wrote:
> 
> I like these suggestions, comments inline below on the last round...
> 
> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
> Hi,
> 
> The above aligns with what we did in Impala, i.e. we store information about table loading in HMS table properties. We are just a bit more explicit about which catalog to use.
> We have table property 'iceberg.catalog' to determine the catalog type, right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be set based on the catalog type.
> 
> So, if the value of 'iceberg.catalog' is
> 
> I'm all for renaming this, having "mr" in the property name is confusing.
>  
> hadoop.tables
> the table location is used to load the table
> The only question I have is should we have this as the default? i.e. if you don't set a catalog it will assume its HadoopTables and use the location? Or should we require this property to be here to be consistent and avoid any "magic"?
>  
> hadoop.catalog
> Required table property 'iceberg.catalog_location' specifies the location of the hadoop catalog in the file system
> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
> I like this as it would allow you to use a different database and table name in Hive as opposed to the Hadoop Catalog - at the moment they have to match. The only thing here is that I think Hive requires a table LOCATION to be set and it's then confusing as there are now two locations on the table. I'm not sure whether in the Hive storage handler or SerDe etc. we can get Hive to not require that and maybe even disallow it from being set. That would probably be best in conjunction with this. Another solution would be to not have the 'iceberg.catalog_location' property but instead use the table LOCATION for this but that's a bit confusing from a Hive point of view.
> 
> hive.catalog
> Optional table property 'iceberg.table_identifier' specifies the table id. If it's not set, then <database_name>.<table_name> is used as table identifier
> We have the assumption that the current Hive metastore stores the table, i.e. we don't support external Hive metastores currently
> These sound fine for Hive catalog tables that are created outside of the automatic Hive table creation (see https://iceberg.apache.org/hive/ <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to document how you can create these yourself and that one could use a different Hive database and table etc.
>  
> Independent of catalog implementations, but we also have table property 'iceberg.file_format' to specify the file format for the data files.
> 
> OK, I don't think we need that for Hive?
>  
> We haven't released it yet, so we are open to changes, but I think these properties are reasonable and it would be great if we could standardize the properties across engines that use HMS as the primary metastore of tables.
> 
> 
> If others agree I think we should create an issue where we document the above changes so it's very clear what we're doing and can then go an implement them and update the docs etc.
>  
> Cheers,
>     Zoltan
> 
> 
> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Yes, I think that is a good summary of the principles.
> 
> #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
> 
> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
> Minor error, my last example should have been:
> 
> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacques@dremio.com <ma...@dremio.com>> wrote:
> I agree with Ryan on the core principles here. As I understand them:
> Iceberg metadata describes all properties of a table
> Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc)
> There could be default "how to get to" information set at a global level
> Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
> Is that a fair summary of your statements Ryan (except 4, which I just added)?
> 
> One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2
> 
> db1.table1 => nessie:folder1.folder2.folder3.table1
> 
> Similarly, you might want to expose a particular branch version of a table. So it might say:
> 
> db1.table1_etl_branch => nessie.folder1@etl_branch
> 
> Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata.
> 
> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic?
> 
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Thanks for working on this, Laszlo. I’ve been thinking about these problems as well, so this is a good time to have a discussion about Hive config.
> 
> I think that Hive configuration should work mostly like other engines, where different configurations are used for different purposes. Different purposes means that there is not a global configuration priority. Hopefully, I can explain how we use the different config sources elsewhere to clarify.
> 
> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop Configuration, but it also has its own global configuration. There are also Iceberg table properties, and all of the various Hive properties if you’re tracking tables with a Hive MetaStore.
> 
> The first step is to simplify where we can, so we effectively eliminate 2 sources of config:
> 
> The Hadoop Configuration is only used to instantiate Hadoop classes, like FileSystem. Iceberg should not use it for any other config.
> Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.
> That leaves Spark configuration and Iceberg table configuration.
> 
> Iceberg differs from other tables because it is opinionated: data configuration should be maintained at the table level. This is cleaner for users because config is standardized across engines and in one place. And it also enables services that analyze a table and update its configuration to tune options that users almost never do, like row group or stripe size in the columnar formats. Iceberg table configuration is used to configure table-specific concerns and behavior.
> 
> Spark configuration is used for engine-specific concerns, and runtime overrides. A good example of an engine-specific concern is the catalogs that are available to load Iceberg tables. Spark has a way to load and configure catalog implementations and Iceberg uses that for all catalog-level config. Runtime overrides are things like target split size. Iceberg has a table-level default split size in table properties, but this can be overridden by a Spark option for each table, as well as an option passed to the individual read. Note that these necessarily have different config names for how they are used: Iceberg uses read.split.target-size and the read-specific option is target-size.
> 
> Applying this to Hive is a little strange for a couple reasons. First, Hive’s engine configuration is a Hadoop Configuration. As a result, I think the right place to store engine-specific config is there, including Iceberg catalogs using a strategy similar to what Spark does: what external Iceberg catalogs are available and their configuration should come from the HiveConf.
> 
> The second way Hive is strange is that Hive needs to use its own MetaStore to track Hive table concerns. The MetaStore may have tables created by an Iceberg HiveCatalog, and Hive also needs to be able to load tables from other Iceberg catalogs by creating table entries for them.
> 
> Here’s how I think Hive should work:
> 
> There should be a default HiveCatalog that uses the current MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
> Other catalogs should be defined in HiveConf
> HMS table properties should be used to determine how to load a table: using a Hadoop location, using the default metastore catalog, or using an external Iceberg catalog
> If there is a metadata_location, then use the HiveCatalog for this metastore (where it is tracked)
> If there is a catalog property, then load that catalog and use it to load the table identifier, or maybe an identifier from HMS table properties
> If there is no catalog or metadata_location, then use HadoopTables to load the table location as an Iceberg table
> This would make it possible to access all types of Iceberg tables in the same query, and would match how Spark and Flink configure catalogs. Other than the configuration above, I don’t think that config in HMS should be used at all, like how the other engines work. Iceberg is the source of truth for table metadata, HMS stores how to load the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
> 
> This isn’t quite how configuration works right now. Currently, the catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, HadoopTables will be used to load table locations. If it is set, then that catalog will be used to load all tables by name. This makes it impossible to load tables from different catalogs at the same time. That’s why I think the Iceberg catalog for a table should be stored in HMS table properties.
> 
> I should also explain iceberg.hive.engine.enabled flag, but I think this is long enough for now.
> 
> rb
> 
> 
> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lp...@cloudera.com.invalid> wrote:
> Hi All,
> 
> I would like to start a discussion, how should we handle properties from various sources like Iceberg, Hive or global configuration. I've put together a short document <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, please have a look and let me know what you think.
> 
> Thanks,
> Laszlo
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Iceberg/Hive properties handling

Posted by Mass Dosage <ma...@gmail.com>.

I like these suggestions, comments inline below on the last round...

On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <bo...@apache.org>
wrote:

> Hi,
>
> The above aligns with what we did in Impala, i.e. we store information
> about table loading in HMS table properties. We are just a bit more
> explicit about which catalog to use.
> We have table property 'iceberg.catalog' to determine the catalog type,
> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
> 'hive.catalog'. Additional table properties can be set based on the catalog
> type.
>
> So, if the value of 'iceberg.catalog' is
>

I'm all for renaming this, having "mr" in the property name is confusing.


>
>    - hadoop.tables
>       - the table location is used to load the table
>
> The only question I have is should we have this as the default? i.e. if
you don't set a catalog it will assume its HadoopTables and use the
location? Or should we require this property to be here to be consistent
and avoid any "magic"?


>
>    - hadoop.catalog
>       - Required table property 'iceberg.catalog_location' specifies the
>       location of the hadoop catalog in the file system
>       - Optional table property 'iceberg.table_identifier' specifies the
>       table id. If it's not set, then <database_name>.<table_name> is used as
>       table identifier
>
> I like this as it would allow you to use a different database and table
name in Hive as opposed to the Hadoop Catalog - at the moment they have to
match. The only thing here is that I think Hive requires a table LOCATION
to be set and it's then confusing as there are now two locations on the
table. I'm not sure whether in the Hive storage handler or SerDe etc. we
can get Hive to not require that and maybe even disallow it from being set.
That would probably be best in conjunction with this. Another solution
would be to not have the 'iceberg.catalog_location' property but instead
use the table LOCATION for this but that's a bit confusing from a Hive
point of view.


>    - hive.catalog
>       - Optional table property 'iceberg.table_identifier' specifies the
>       table id. If it's not set, then <database_name>.<table_name> is used as
>       table identifier
>       - We have the assumption that the current Hive metastore stores the
>       table, i.e. we don't support external Hive metastores currently
>
> These sound fine for Hive catalog tables that are created outside of the
automatic Hive table creation (see https://iceberg.apache.org/hive/ ->
Using Hive Catalog) we'd just need to document how you can create these
yourself and that one could use a different Hive database and table etc.


> Independent of catalog implementations, but we also have table property
> 'iceberg.file_format' to specify the file format for the data files.
>

OK, I don't think we need that for Hive?


> We haven't released it yet, so we are open to changes, but I think these
> properties are reasonable and it would be great if we could standardize the
> properties across engines that use HMS as the primary metastore of tables.
>
>
If others agree I think we should create an issue where we document the
above changes so it's very clear what we're doing and can then go an
implement them and update the docs etc.


> Cheers,
>     Zoltan
>
>
> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Yes, I think that is a good summary of the principles.
>>
>> #4 is correct because we provide some information that is informational
>> (Hive schema) or tracked only by the metastore (best-effort current user).
>> I also agree that it would be good to have a table identifier in HMS table
>> metadata when loading from an external table. That gives us a way to handle
>> name conflicts.
>>
>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com>
>> wrote:
>>
>>> Minor error, my last example should have been:
>>>
>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>>> wrote:
>>>
>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>
>>>>    1. Iceberg metadata describes all properties of a table
>>>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>>>    (which catalog + possibly ptr, path, token, etc)
>>>>    3. There could be default "how to get to" information set at a
>>>>    global level
>>>>    4. Best-effort schema should stored be in the table properties in
>>>>    HMS. This should be done for information schema retrieval purposes within
>>>>    Hive but should be ignored during Hive/other tool execution.
>>>>
>>>> Is that a fair summary of your statements Ryan (except 4, which I just
>>>> added)?
>>>>
>>>> One comment I have on #2 is that for different catalogs and use cases,
>>>> I think it can be somewhat more complex where it would be desirable for a
>>>> table that initially existed without Hive that was later exposed in Hive to
>>>> support a ptr/path/token for how the table is named externally. For
>>>> example, in a Nessie context we support arbitrary paths for an Iceberg
>>>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>>>> that table to Hive, you might have this mapping for #2
>>>>
>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>
>>>> Similarly, you might want to expose a particular branch version of a
>>>> table. So it might say:
>>>>
>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>
>>>> Just saying that the address to the table in the catalog could itself
>>>> have several properties. The key being that no matter what those are, we
>>>> should follow #1 and only store properties that are about the ptr, not the
>>>> content/metadata.
>>>>
>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>> confirm that it is true? And that it is possible/not problematic?
>>>>
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>> problems as well, so this is a good time to have a discussion about Hive
>>>>> config.
>>>>>
>>>>> I think that Hive configuration should work mostly like other engines,
>>>>> where different configurations are used for different purposes. Different
>>>>> purposes means that there is not a global configuration priority.
>>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>>> to clarify.
>>>>>
>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>>>> Configuration, but it also has its own global configuration. There are also
>>>>> Iceberg table properties, and all of the various Hive properties if you’re
>>>>> tracking tables with a Hive MetaStore.
>>>>>
>>>>> The first step is to simplify where we can, so we effectively
>>>>> eliminate 2 sources of config:
>>>>>
>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>    properties.
>>>>>
>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>
>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>> configuration should be maintained at the table level. This is cleaner for
>>>>> users because config is standardized across engines and in one place. And
>>>>> it also enables services that analyze a table and update its configuration
>>>>> to tune options that users almost never do, like row group or stripe size
>>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>>> table-specific concerns and behavior.
>>>>>
>>>>> Spark configuration is used for engine-specific concerns, and runtime
>>>>> overrides. A good example of an engine-specific concern is the catalogs
>>>>> that are available to load Iceberg tables. Spark has a way to load and
>>>>> configure catalog implementations and Iceberg uses that for all
>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>> passed to the individual read. Note that these necessarily have different
>>>>> config names for how they are used: Iceberg uses
>>>>> read.split.target-size and the read-specific option is target-size.
>>>>>
>>>>> Applying this to Hive is a little strange for a couple reasons. First,
>>>>> Hive’s engine configuration *is* a Hadoop Configuration. As a result,
>>>>> I think the right place to store engine-specific config is there, including
>>>>> Iceberg catalogs using a strategy similar to what Spark does: what external
>>>>> Iceberg catalogs are available and their configuration should come from the
>>>>> HiveConf.
>>>>>
>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>
>>>>> Here’s how I think Hive should work:
>>>>>
>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>>    - Other catalogs should be defined in HiveConf
>>>>>    - HMS table properties should be used to determine how to load a
>>>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>>>    using an external Iceberg catalog
>>>>>       - If there is a metadata_location, then use the HiveCatalog for
>>>>>       this metastore (where it is tracked)
>>>>>       - If there is a catalog property, then load that catalog and
>>>>>       use it to load the table identifier, or maybe an identifier from HMS table
>>>>>       properties
>>>>>       - If there is no catalog or metadata_location, then use
>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>
>>>>> This would make it possible to access all types of Iceberg tables in
>>>>> the same query, and would match how Spark and Flink configure catalogs.
>>>>> Other than the configuration above, I don’t think that config in HMS should
>>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>
>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>>>> set, then that catalog will be used to load all tables by name. This makes
>>>>> it impossible to load tables from different catalogs at the same time.
>>>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>>>> table properties.
>>>>>
>>>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>>>> this is long enough for now.
>>>>>
>>>>> rb
>>>>>
>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>>>>> <lp...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I would like to start a discussion, how should we handle properties
>>>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>>>> together a short document
>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>> please have a look and let me know what you think.
>>>>>>
>>>>>> Thanks,
>>>>>> Laszlo
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg/Hive properties handling

Posted by Zoltán Borók-Nagy <bo...@apache.org>.

Hi,

The above aligns with what we did in Impala, i.e. we store information
about table loading in HMS table properties. We are just a bit more
explicit about which catalog to use.
We have table property 'iceberg.catalog' to determine the catalog type,
right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
'hive.catalog'. Additional table properties can be set based on the catalog
type.

So, if the value of 'iceberg.catalog' is

   - hadoop.tables
      - the table location is used to load the table
   - hadoop.catalog
      - Required table property 'iceberg.catalog_location' specifies the
      location of the hadoop catalog in the file system
      - Optional table property 'iceberg.table_identifier' specifies the
      table id. If it's not set, then <database_name>.<table_name> is used as
      table identifier
   - hive.catalog
      - Optional table property 'iceberg.table_identifier' specifies the
      table id. If it's not set, then <database_name>.<table_name> is used as
      table identifier
      - We have the assumption that the current Hive metastore stores the
      table, i.e. we don't support external Hive metastores currently

Independent of catalog implementations, but we also have table property
'iceberg.file_format' to specify the file format for the data files.

We haven't released it yet, so we are open to changes, but I think these
properties are reasonable and it would be great if we could standardize the
properties across engines that use HMS as the primary metastore of tables.

Cheers,
    Zoltan


On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Yes, I think that is a good summary of the principles.
>
> #4 is correct because we provide some information that is informational
> (Hive schema) or tracked only by the metastore (best-effort current user).
> I also agree that it would be good to have a table identifier in HMS table
> metadata when loading from an external table. That gives us a way to handle
> name conflicts.
>
> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com> wrote:
>
>> Minor error, my last example should have been:
>>
>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com>
>> wrote:
>>
>>> I agree with Ryan on the core principles here. As I understand them:
>>>
>>>    1. Iceberg metadata describes all properties of a table
>>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>>    (which catalog + possibly ptr, path, token, etc)
>>>    3. There could be default "how to get to" information set at a
>>>    global level
>>>    4. Best-effort schema should stored be in the table properties in
>>>    HMS. This should be done for information schema retrieval purposes within
>>>    Hive but should be ignored during Hive/other tool execution.
>>>
>>> Is that a fair summary of your statements Ryan (except 4, which I just
>>> added)?
>>>
>>> One comment I have on #2 is that for different catalogs and use cases, I
>>> think it can be somewhat more complex where it would be desirable for a
>>> table that initially existed without Hive that was later exposed in Hive to
>>> support a ptr/path/token for how the table is named externally. For
>>> example, in a Nessie context we support arbitrary paths for an Iceberg
>>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>>> that table to Hive, you might have this mapping for #2
>>>
>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>
>>> Similarly, you might want to expose a particular branch version of a
>>> table. So it might say:
>>>
>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>
>>> Just saying that the address to the table in the catalog could itself
>>> have several properties. The key being that no matter what those are, we
>>> should follow #1 and only store properties that are about the ptr, not the
>>> content/metadata.
>>>
>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>> confirm that it is true? And that it is possible/not problematic?
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>> problems as well, so this is a good time to have a discussion about Hive
>>>> config.
>>>>
>>>> I think that Hive configuration should work mostly like other engines,
>>>> where different configurations are used for different purposes. Different
>>>> purposes means that there is not a global configuration priority.
>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>> to clarify.
>>>>
>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>>> Configuration, but it also has its own global configuration. There are also
>>>> Iceberg table properties, and all of the various Hive properties if you’re
>>>> tracking tables with a Hive MetaStore.
>>>>
>>>> The first step is to simplify where we can, so we effectively eliminate
>>>> 2 sources of config:
>>>>
>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>    HMS is informational. For example, the input format is FileInputFormat so
>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>    class. Table-specific config should not be stored in table or serde
>>>>    properties.
>>>>
>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>
>>>> Iceberg differs from other tables because it is opinionated: data
>>>> configuration should be maintained at the table level. This is cleaner for
>>>> users because config is standardized across engines and in one place. And
>>>> it also enables services that analyze a table and update its configuration
>>>> to tune options that users almost never do, like row group or stripe size
>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>> table-specific concerns and behavior.
>>>>
>>>> Spark configuration is used for engine-specific concerns, and runtime
>>>> overrides. A good example of an engine-specific concern is the catalogs
>>>> that are available to load Iceberg tables. Spark has a way to load and
>>>> configure catalog implementations and Iceberg uses that for all
>>>> catalog-level config. Runtime overrides are things like target split size.
>>>> Iceberg has a table-level default split size in table properties, but this
>>>> can be overridden by a Spark option for each table, as well as an option
>>>> passed to the individual read. Note that these necessarily have different
>>>> config names for how they are used: Iceberg uses read.split.target-size
>>>> and the read-specific option is target-size.
>>>>
>>>> Applying this to Hive is a little strange for a couple reasons. First,
>>>> Hive’s engine configuration *is* a Hadoop Configuration. As a result,
>>>> I think the right place to store engine-specific config is there, including
>>>> Iceberg catalogs using a strategy similar to what Spark does: what external
>>>> Iceberg catalogs are available and their configuration should come from the
>>>> HiveConf.
>>>>
>>>> The second way Hive is strange is that Hive needs to use its own
>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>
>>>> Here’s how I think Hive should work:
>>>>
>>>>    - There should be a default HiveCatalog that uses the current
>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>    - Other catalogs should be defined in HiveConf
>>>>    - HMS table properties should be used to determine how to load a
>>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>>    using an external Iceberg catalog
>>>>       - If there is a metadata_location, then use the HiveCatalog for
>>>>       this metastore (where it is tracked)
>>>>       - If there is a catalog property, then load that catalog and use
>>>>       it to load the table identifier, or maybe an identifier from HMS table
>>>>       properties
>>>>       - If there is no catalog or metadata_location, then use
>>>>       HadoopTables to load the table location as an Iceberg table
>>>>
>>>> This would make it possible to access all types of Iceberg tables in
>>>> the same query, and would match how Spark and Flink configure catalogs.
>>>> Other than the configuration above, I don’t think that config in HMS should
>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>
>>>> This isn’t quite how configuration works right now. Currently, the
>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>>> set, then that catalog will be used to load all tables by name. This makes
>>>> it impossible to load tables from different catalogs at the same time.
>>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>>> table properties.
>>>>
>>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>>> this is long enough for now.
>>>>
>>>> rb
>>>>
>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>>>> <lp...@cloudera.com.invalid> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would like to start a discussion, how should we handle properties
>>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>>> together a short document
>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>> please have a look and let me know what you think.
>>>>>
>>>>> Thanks,
>>>>> Laszlo
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Yes, I think that is a good summary of the principles.

#4 is correct because we provide some information that is informational
(Hive schema) or tracked only by the metastore (best-effort current user).
I also agree that it would be good to have a table identifier in HMS table
metadata when loading from an external table. That gives us a way to handle
name conflicts.

On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <ja...@dremio.com> wrote:

> Minor error, my last example should have been:
>
> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com> wrote:
>
>> I agree with Ryan on the core principles here. As I understand them:
>>
>>    1. Iceberg metadata describes all properties of a table
>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>    (which catalog + possibly ptr, path, token, etc)
>>    3. There could be default "how to get to" information set at a global
>>    level
>>    4. Best-effort schema should stored be in the table properties in
>>    HMS. This should be done for information schema retrieval purposes within
>>    Hive but should be ignored during Hive/other tool execution.
>>
>> Is that a fair summary of your statements Ryan (except 4, which I just
>> added)?
>>
>> One comment I have on #2 is that for different catalogs and use cases, I
>> think it can be somewhat more complex where it would be desirable for a
>> table that initially existed without Hive that was later exposed in Hive to
>> support a ptr/path/token for how the table is named externally. For
>> example, in a Nessie context we support arbitrary paths for an Iceberg
>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>> that table to Hive, you might have this mapping for #2
>>
>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>
>> Similarly, you might want to expose a particular branch version of a
>> table. So it might say:
>>
>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>
>> Just saying that the address to the table in the catalog could itself
>> have several properties. The key being that no matter what those are, we
>> should follow #1 and only store properties that are about the ptr, not the
>> content/metadata.
>>
>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>> confirm that it is true? And that it is possible/not problematic?
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>> problems as well, so this is a good time to have a discussion about Hive
>>> config.
>>>
>>> I think that Hive configuration should work mostly like other engines,
>>> where different configurations are used for different purposes. Different
>>> purposes means that there is not a global configuration priority.
>>> Hopefully, I can explain how we use the different config sources elsewhere
>>> to clarify.
>>>
>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>> Configuration, but it also has its own global configuration. There are also
>>> Iceberg table properties, and all of the various Hive properties if you’re
>>> tracking tables with a Hive MetaStore.
>>>
>>> The first step is to simplify where we can, so we effectively eliminate
>>> 2 sources of config:
>>>
>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>    - Config in the Hive MetaStore is only used to identify that a table
>>>    is Iceberg and point to its metadata location. All other config in HMS is
>>>    informational. For example, the input format is FileInputFormat so that
>>>    non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>>>    but it is available so they also don’t fail trying to load the class.
>>>    Table-specific config should not be stored in table or serde properties.
>>>
>>> That leaves Spark configuration and Iceberg table configuration.
>>>
>>> Iceberg differs from other tables because it is opinionated: data
>>> configuration should be maintained at the table level. This is cleaner for
>>> users because config is standardized across engines and in one place. And
>>> it also enables services that analyze a table and update its configuration
>>> to tune options that users almost never do, like row group or stripe size
>>> in the columnar formats. Iceberg table configuration is used to configure
>>> table-specific concerns and behavior.
>>>
>>> Spark configuration is used for engine-specific concerns, and runtime
>>> overrides. A good example of an engine-specific concern is the catalogs
>>> that are available to load Iceberg tables. Spark has a way to load and
>>> configure catalog implementations and Iceberg uses that for all
>>> catalog-level config. Runtime overrides are things like target split size.
>>> Iceberg has a table-level default split size in table properties, but this
>>> can be overridden by a Spark option for each table, as well as an option
>>> passed to the individual read. Note that these necessarily have different
>>> config names for how they are used: Iceberg uses read.split.target-size
>>> and the read-specific option is target-size.
>>>
>>> Applying this to Hive is a little strange for a couple reasons. First,
>>> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
>>> think the right place to store engine-specific config is there, including
>>> Iceberg catalogs using a strategy similar to what Spark does: what external
>>> Iceberg catalogs are available and their configuration should come from the
>>> HiveConf.
>>>
>>> The second way Hive is strange is that Hive needs to use its own
>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>> tables from other Iceberg catalogs by creating table entries for them.
>>>
>>> Here’s how I think Hive should work:
>>>
>>>    - There should be a default HiveCatalog that uses the current
>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>    - Other catalogs should be defined in HiveConf
>>>    - HMS table properties should be used to determine how to load a
>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>    using an external Iceberg catalog
>>>       - If there is a metadata_location, then use the HiveCatalog for
>>>       this metastore (where it is tracked)
>>>       - If there is a catalog property, then load that catalog and use
>>>       it to load the table identifier, or maybe an identifier from HMS table
>>>       properties
>>>       - If there is no catalog or metadata_location, then use
>>>       HadoopTables to load the table location as an Iceberg table
>>>
>>> This would make it possible to access all types of Iceberg tables in the
>>> same query, and would match how Spark and Flink configure catalogs. Other
>>> than the configuration above, I don’t think that config in HMS should be
>>> used at all, like how the other engines work. Iceberg is the source of
>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>> HiveConf defines the catalogs (or runtime overrides).
>>>
>>> This isn’t quite how configuration works right now. Currently, the
>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>> set, then that catalog will be used to load all tables by name. This makes
>>> it impossible to load tables from different catalogs at the same time.
>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>> table properties.
>>>
>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>> this is long enough for now.
>>>
>>> rb
>>>
>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>>> <lp...@cloudera.com.invalid> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I would like to start a discussion, how should we handle properties
>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>> together a short document
>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>> please have a look and let me know what you think.
>>>>
>>>> Thanks,
>>>> Laszlo
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@dremio.com>.

Minor error, my last example should have been:

db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <ja...@dremio.com> wrote:

> I agree with Ryan on the core principles here. As I understand them:
>
>    1. Iceberg metadata describes all properties of a table
>    2. Hive table properties describe "how to get to" Iceberg metadata
>    (which catalog + possibly ptr, path, token, etc)
>    3. There could be default "how to get to" information set at a global
>    level
>    4. Best-effort schema should stored be in the table properties in HMS.
>    This should be done for information schema retrieval purposes within Hive
>    but should be ignored during Hive/other tool execution.
>
> Is that a fair summary of your statements Ryan (except 4, which I just
> added)?
>
> One comment I have on #2 is that for different catalogs and use cases, I
> think it can be somewhat more complex where it would be desirable for a
> table that initially existed without Hive that was later exposed in Hive to
> support a ptr/path/token for how the table is named externally. For
> example, in a Nessie context we support arbitrary paths for an Iceberg
> table (such as folder1.folder2.folder3.table1). If you then want to expose
> that table to Hive, you might have this mapping for #2
>
> db1.table1 => nessie:folder1.folder2.folder3.table1
>
> Similarly, you might want to expose a particular branch version of a
> table. So it might say:
>
> db1.table1_etl_branch => nessie.folder1@etl_branch
>
> Just saying that the address to the table in the catalog could itself have
> several properties. The key being that no matter what those are, we should
> follow #1 and only store properties that are about the ptr, not the
> content/metadata.
>
> Lastly, I believe #4 is the case but haven't tested it. Can someone
> confirm that it is true? And that it is possible/not problematic?
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks for working on this, Laszlo. I’ve been thinking about these
>> problems as well, so this is a good time to have a discussion about Hive
>> config.
>>
>> I think that Hive configuration should work mostly like other engines,
>> where different configurations are used for different purposes. Different
>> purposes means that there is not a global configuration priority.
>> Hopefully, I can explain how we use the different config sources elsewhere
>> to clarify.
>>
>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>> Configuration, but it also has its own global configuration. There are also
>> Iceberg table properties, and all of the various Hive properties if you’re
>> tracking tables with a Hive MetaStore.
>>
>> The first step is to simplify where we can, so we effectively eliminate 2
>> sources of config:
>>
>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>    - Config in the Hive MetaStore is only used to identify that a table
>>    is Iceberg and point to its metadata location. All other config in HMS is
>>    informational. For example, the input format is FileInputFormat so that
>>    non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>>    but it is available so they also don’t fail trying to load the class.
>>    Table-specific config should not be stored in table or serde properties.
>>
>> That leaves Spark configuration and Iceberg table configuration.
>>
>> Iceberg differs from other tables because it is opinionated: data
>> configuration should be maintained at the table level. This is cleaner for
>> users because config is standardized across engines and in one place. And
>> it also enables services that analyze a table and update its configuration
>> to tune options that users almost never do, like row group or stripe size
>> in the columnar formats. Iceberg table configuration is used to configure
>> table-specific concerns and behavior.
>>
>> Spark configuration is used for engine-specific concerns, and runtime
>> overrides. A good example of an engine-specific concern is the catalogs
>> that are available to load Iceberg tables. Spark has a way to load and
>> configure catalog implementations and Iceberg uses that for all
>> catalog-level config. Runtime overrides are things like target split size.
>> Iceberg has a table-level default split size in table properties, but this
>> can be overridden by a Spark option for each table, as well as an option
>> passed to the individual read. Note that these necessarily have different
>> config names for how they are used: Iceberg uses read.split.target-size
>> and the read-specific option is target-size.
>>
>> Applying this to Hive is a little strange for a couple reasons. First,
>> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
>> think the right place to store engine-specific config is there, including
>> Iceberg catalogs using a strategy similar to what Spark does: what external
>> Iceberg catalogs are available and their configuration should come from the
>> HiveConf.
>>
>> The second way Hive is strange is that Hive needs to use its own
>> MetaStore to track Hive table concerns. The MetaStore may have tables
>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>> tables from other Iceberg catalogs by creating table entries for them.
>>
>> Here’s how I think Hive should work:
>>
>>    - There should be a default HiveCatalog that uses the current
>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>    - Other catalogs should be defined in HiveConf
>>    - HMS table properties should be used to determine how to load a
>>    table: using a Hadoop location, using the default metastore catalog, or
>>    using an external Iceberg catalog
>>       - If there is a metadata_location, then use the HiveCatalog for
>>       this metastore (where it is tracked)
>>       - If there is a catalog property, then load that catalog and use
>>       it to load the table identifier, or maybe an identifier from HMS table
>>       properties
>>       - If there is no catalog or metadata_location, then use
>>       HadoopTables to load the table location as an Iceberg table
>>
>> This would make it possible to access all types of Iceberg tables in the
>> same query, and would match how Spark and Flink configure catalogs. Other
>> than the configuration above, I don’t think that config in HMS should be
>> used at all, like how the other engines work. Iceberg is the source of
>> truth for table metadata, HMS stores how to load the Iceberg table, and
>> HiveConf defines the catalogs (or runtime overrides).
>>
>> This isn’t quite how configuration works right now. Currently, the
>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>> that isn’t set, HadoopTables will be used to load table locations. If it is
>> set, then that catalog will be used to load all tables by name. This makes
>> it impossible to load tables from different catalogs at the same time.
>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>> table properties.
>>
>> I should also explain iceberg.hive.engine.enabled flag, but I think this
>> is long enough for now.
>>
>> rb
>>
>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>> <lp...@cloudera.com.invalid> wrote:
>>
>>> Hi All,
>>>
>>> I would like to start a discussion, how should we handle properties from
>>> various sources like Iceberg, Hive or global configuration. I've put
>>> together a short document
>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>> please have a look and let me know what you think.
>>>
>>> Thanks,
>>> Laszlo
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@dremio.com>.

I agree with Ryan on the core principles here. As I understand them:

   1. Iceberg metadata describes all properties of a table
   2. Hive table properties describe "how to get to" Iceberg metadata
   (which catalog + possibly ptr, path, token, etc)
   3. There could be default "how to get to" information set at a global
   level
   4. Best-effort schema should stored be in the table properties in HMS.
   This should be done for information schema retrieval purposes within Hive
   but should be ignored during Hive/other tool execution.

Is that a fair summary of your statements Ryan (except 4, which I just
added)?

One comment I have on #2 is that for different catalogs and use cases, I
think it can be somewhat more complex where it would be desirable for a
table that initially existed without Hive that was later exposed in Hive to
support a ptr/path/token for how the table is named externally. For
example, in a Nessie context we support arbitrary paths for an Iceberg
table (such as folder1.folder2.folder3.table1). If you then want to expose
that table to Hive, you might have this mapping for #2

db1.table1 => nessie:folder1.folder2.folder3.table1

Similarly, you might want to expose a particular branch version of a table.
So it might say:

db1.table1_etl_branch => nessie.folder1@etl_branch

Just saying that the address to the table in the catalog could itself have
several properties. The key being that no matter what those are, we should
follow #1 and only store properties that are about the ptr, not the
content/metadata.

Lastly, I believe #4 is the case but haven't tested it. Can someone confirm
that it is true? And that it is possible/not problematic?


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks for working on this, Laszlo. I’ve been thinking about these
> problems as well, so this is a good time to have a discussion about Hive
> config.
>
> I think that Hive configuration should work mostly like other engines,
> where different configurations are used for different purposes. Different
> purposes means that there is not a global configuration priority.
> Hopefully, I can explain how we use the different config sources elsewhere
> to clarify.
>
> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
> Configuration, but it also has its own global configuration. There are also
> Iceberg table properties, and all of the various Hive properties if you’re
> tracking tables with a Hive MetaStore.
>
> The first step is to simplify where we can, so we effectively eliminate 2
> sources of config:
>
>    - The Hadoop Configuration is only used to instantiate Hadoop classes,
>    like FileSystem. Iceberg should not use it for any other config.
>    - Config in the Hive MetaStore is only used to identify that a table
>    is Iceberg and point to its metadata location. All other config in HMS is
>    informational. For example, the input format is FileInputFormat so that
>    non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>    but it is available so they also don’t fail trying to load the class.
>    Table-specific config should not be stored in table or serde properties.
>
> That leaves Spark configuration and Iceberg table configuration.
>
> Iceberg differs from other tables because it is opinionated: data
> configuration should be maintained at the table level. This is cleaner for
> users because config is standardized across engines and in one place. And
> it also enables services that analyze a table and update its configuration
> to tune options that users almost never do, like row group or stripe size
> in the columnar formats. Iceberg table configuration is used to configure
> table-specific concerns and behavior.
>
> Spark configuration is used for engine-specific concerns, and runtime
> overrides. A good example of an engine-specific concern is the catalogs
> that are available to load Iceberg tables. Spark has a way to load and
> configure catalog implementations and Iceberg uses that for all
> catalog-level config. Runtime overrides are things like target split size.
> Iceberg has a table-level default split size in table properties, but this
> can be overridden by a Spark option for each table, as well as an option
> passed to the individual read. Note that these necessarily have different
> config names for how they are used: Iceberg uses read.split.target-size
> and the read-specific option is target-size.
>
> Applying this to Hive is a little strange for a couple reasons. First,
> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
> think the right place to store engine-specific config is there, including
> Iceberg catalogs using a strategy similar to what Spark does: what external
> Iceberg catalogs are available and their configuration should come from the
> HiveConf.
>
> The second way Hive is strange is that Hive needs to use its own MetaStore
> to track Hive table concerns. The MetaStore may have tables created by an
> Iceberg HiveCatalog, and Hive also needs to be able to load tables from
> other Iceberg catalogs by creating table entries for them.
>
> Here’s how I think Hive should work:
>
>    - There should be a default HiveCatalog that uses the current
>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>    - Other catalogs should be defined in HiveConf
>    - HMS table properties should be used to determine how to load a
>    table: using a Hadoop location, using the default metastore catalog, or
>    using an external Iceberg catalog
>       - If there is a metadata_location, then use the HiveCatalog for
>       this metastore (where it is tracked)
>       - If there is a catalog property, then load that catalog and use it
>       to load the table identifier, or maybe an identifier from HMS table
>       properties
>       - If there is no catalog or metadata_location, then use
>       HadoopTables to load the table location as an Iceberg table
>
> This would make it possible to access all types of Iceberg tables in the
> same query, and would match how Spark and Flink configure catalogs. Other
> than the configuration above, I don’t think that config in HMS should be
> used at all, like how the other engines work. Iceberg is the source of
> truth for table metadata, HMS stores how to load the Iceberg table, and
> HiveConf defines the catalogs (or runtime overrides).
>
> This isn’t quite how configuration works right now. Currently, the catalog
> is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t
> set, HadoopTables will be used to load table locations. If it is set, then
> that catalog will be used to load all tables by name. This makes it
> impossible to load tables from different catalogs at the same time. That’s
> why I think the Iceberg catalog for a table should be stored in HMS table
> properties.
>
> I should also explain iceberg.hive.engine.enabled flag, but I think this
> is long enough for now.
>
> rb
>
> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lp...@cloudera.com.invalid>
> wrote:
>
>> Hi All,
>>
>> I would like to start a discussion, how should we handle properties from
>> various sources like Iceberg, Hive or global configuration. I've put
>> together a short document
>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>> please have a look and let me know what you think.
>>
>> Thanks,
>> Laszlo
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Thanks for working on this, Laszlo. I’ve been thinking about these problems
as well, so this is a good time to have a discussion about Hive config.

I think that Hive configuration should work mostly like other engines,
where different configurations are used for different purposes. Different
purposes means that there is not a global configuration priority.
Hopefully, I can explain how we use the different config sources elsewhere
to clarify.

Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
Configuration, but it also has its own global configuration. There are also
Iceberg table properties, and all of the various Hive properties if you’re
tracking tables with a Hive MetaStore.

The first step is to simplify where we can, so we effectively eliminate 2
sources of config:

   - The Hadoop Configuration is only used to instantiate Hadoop classes,
   like FileSystem. Iceberg should not use it for any other config.
   - Config in the Hive MetaStore is only used to identify that a table is
   Iceberg and point to its metadata location. All other config in HMS is
   informational. For example, the input format is FileInputFormat so that
   non-Iceberg readers cannot actually instantiate the format (it’s abstract)
   but it is available so they also don’t fail trying to load the class.
   Table-specific config should not be stored in table or serde properties.

That leaves Spark configuration and Iceberg table configuration.

Iceberg differs from other tables because it is opinionated: data
configuration should be maintained at the table level. This is cleaner for
users because config is standardized across engines and in one place. And
it also enables services that analyze a table and update its configuration
to tune options that users almost never do, like row group or stripe size
in the columnar formats. Iceberg table configuration is used to configure
table-specific concerns and behavior.

Spark configuration is used for engine-specific concerns, and runtime
overrides. A good example of an engine-specific concern is the catalogs
that are available to load Iceberg tables. Spark has a way to load and
configure catalog implementations and Iceberg uses that for all
catalog-level config. Runtime overrides are things like target split size.
Iceberg has a table-level default split size in table properties, but this
can be overridden by a Spark option for each table, as well as an option
passed to the individual read. Note that these necessarily have different
config names for how they are used: Iceberg uses read.split.target-size and
the read-specific option is target-size.

Applying this to Hive is a little strange for a couple reasons. First,
Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
think the right place to store engine-specific config is there, including
Iceberg catalogs using a strategy similar to what Spark does: what external
Iceberg catalogs are available and their configuration should come from the
HiveConf.

The second way Hive is strange is that Hive needs to use its own MetaStore
to track Hive table concerns. The MetaStore may have tables created by an
Iceberg HiveCatalog, and Hive also needs to be able to load tables from
other Iceberg catalogs by creating table entries for them.

Here’s how I think Hive should work:

   - There should be a default HiveCatalog that uses the current MetaStore
   URI to be used for HiveCatalog tables tracked in the MetaStore
   - Other catalogs should be defined in HiveConf
   - HMS table properties should be used to determine how to load a table:
   using a Hadoop location, using the default metastore catalog, or using an
   external Iceberg catalog
      - If there is a metadata_location, then use the HiveCatalog for this
      metastore (where it is tracked)
      - If there is a catalog property, then load that catalog and use it
      to load the table identifier, or maybe an identifier from HMS table
      properties
      - If there is no catalog or metadata_location, then use HadoopTables
      to load the table location as an Iceberg table

This would make it possible to access all types of Iceberg tables in the
same query, and would match how Spark and Flink configure catalogs. Other
than the configuration above, I don’t think that config in HMS should be
used at all, like how the other engines work. Iceberg is the source of
truth for table metadata, HMS stores how to load the Iceberg table, and
HiveConf defines the catalogs (or runtime overrides).

This isn’t quite how configuration works right now. Currently, the catalog
is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t
set, HadoopTables will be used to load table locations. If it is set, then
that catalog will be used to load all tables by name. This makes it
impossible to load tables from different catalogs at the same time. That’s
why I think the Iceberg catalog for a table should be stored in HMS table
properties.

I should also explain iceberg.hive.engine.enabled flag, but I think this is
long enough for now.

rb

On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lp...@cloudera.com.invalid>
wrote:

> Hi All,
>
> I would like to start a discussion, how should we handle properties from
> various sources like Iceberg, Hive or global configuration. I've put
> together a short document
> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
> please have a look and let me know what you think.
>
> Thanks,
> Laszlo
>

-- 
Ryan Blue
Software Engineer
Netflix