You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2021/01/11 06:33:11 UTC

Re: Iceberg/Hive properties handling

Hi Team,

@Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
consolidate the thoughts in a document for the path forward. Did you have
time for that, or the holidays overwritten all of the plans as usual :)

Orher: Ryan convinced me that it would be good to move forward with the
synchronised Hive-Iceberg property list whenever it is possible, and use
the Iceberg Table properties as a master when not. This would be the
solution which is alligns most with the other integration solutios.

Thanks, Peter

Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
8:27):

> I like the strong coupling between Hive and Iceberg if we can make it
> work. It could be beneficial for the end users, but I still have some
> concerns.
> We should consider the following aspects:
> - Where has the change initiated (Hive or Spark)
> - Which Catalog is used (HiveCatalog or other)
> - Which Hive version is used (Hive 2/3)
>
> Some current constraints I think we have:
> - There could be multiple Hive tables above a single Iceberg table with
> most of the Catalogs (HiveCatalog being the single exception)
> - I see no ways to propagate Spark changes for HMS if the Catalog is not
> HiveCatalog
> - Only Hive3 has ways to propagate changes to the Iceberg table after
> creation
> - Hive inserts modify the table data (one Iceberg commit) and then the
> table metadata (another Iceberg commit). This could be suboptimal but
> solvable.
>
> My feeling is that the tight coupling could work as expected with only the
> HiveCatalog using Hive3. In every other case the Iceberg and the HMS
> properties will deviate. That is why I think it would be easier to
> understand for the user that Iceberg and Hive is a different system with
> different properties.
>
> All that said we will use Hive3 and HiveCatalog so I think we are fine
> with 1-on-1 mapping too.
> If we move this way we should remove the current property filtering from
> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>
> Thanks, Peter
>
> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9., Sze
> 23:01):
>
>> Who cares if there are a few extra properties from Hive? Users may expect
>>> those properties to be there anyway.
>>
>>
>> Yeah, what is the key argument against letting them leak? What problem
>> are people trying to solve?
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>> either. There is no way to add a prefix for HMS properties that already
>>> exist, so the only option is to have a list of properties to suppress,
>>> which is option #1.
>>>
>>> I think that option #3 is a bad idea because it would lead to surprising
>>> behavior for users. If a user creates a table using Hive DDL and sets table
>>> properties, those properties should be present in the source of truth
>>> Iceberg table. If a prefix was required to forward them to Iceberg, that
>>> would create a situation where properties appear to be missing because the
>>> user tried to use syntax that works for nearly every other table.
>>>
>>> That leaves either option #1 or doing nothing. I actually think that
>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>> extra properties from Hive? Users may expect those properties to be there
>>> anyway.
>>>
>>> rb
>>>
>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>> <ht...@dremio.com> wrote:
>>>
>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>> thread on your capacity to look at this before EOY.
>>>>
>>>> I'm going to try to create a specification document based on the
>>>> discussion document you put together. I think there is general consensus
>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>> formalize that more.
>>>>
>>>> It seems like there is less consensus around the whitelist/blacklist
>>>> side of things. You outline four approaches:
>>>>
>>>>    1. Hard coded HMS only property list
>>>>    2. Hard coded Iceberg only property list
>>>>    3. Prefix for Iceberg properties
>>>>    4. Prefix for HMS only properties
>>>>
>>>> I generally think #2 is a no-go as it creates too much coupling between
>>>> catalog implementations and core iceberg. It seems like Ryan Blue would
>>>> prefer #4 (correct?). Any other strong opinions?
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>> current state and the proposed solutions in a document:
>>>>>
>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>
>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>> all the possible solutions.
>>>>> Please comment!
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>> TBLPROPERTIES on newly created tables:
>>>>>
>>>>>    - external.table.purge
>>>>>    - EXTERNAL
>>>>>    - bucketing_version
>>>>>    - numRows
>>>>>    - rawDataSize
>>>>>    - totalSize
>>>>>    - numFiles
>>>>>    - numFileErasureCoded
>>>>>
>>>>>
>>>>> I am afraid that we can not change the name of most of these
>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>> targeted way to push properties to the Iceberg tables.
>>>>>
>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>> Resending:
>>>>>
>>>>> Also I expect that we want to add Hive write specific configs to table
>>>>> level when the general engine independent configuration is not ideal for
>>>>> Hive, but every Hive query for a given table should use some specific
>>>>> config.
>>>>>
>>>>> Hive may need configuration, but I think these should still be kept in
>>>>> the Iceberg table. There is no reason to make Hive config inaccessible from
>>>>> other engines. If someone wants to view all of the config for a table from
>>>>> Spark, the Hive config should also be included right?
>>>>>
>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>
>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>
>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>> configs to table level when the general engine independent configuration is
>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>> specific config.
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>> 1., Ke 17:06):
>>>>>>
>>>>>>> Would someone be willing to create a document that states the
>>>>>>> current proposal?
>>>>>>>
>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>>>>
>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>
>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>>    think we need iceberg.catalog_location. Instead, we should
>>>>>>>>>    have a way to define catalogs in the Configuration for tables
>>>>>>>>>    in the metastore to reference.
>>>>>>>>>
>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>> environments.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>
>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>> behavior would cause the least surprises.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>
>>>>>>>>>>> Answers below:
>>>>>>>>>>>
>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>> follows
>>>>>>>>>>>
>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>
>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>> default catalog.
>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>> have currently in Impala.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>       confusing.
>>>>>>>>>>>
>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>
>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>
>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>       HMSes.
>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>       opinion.
>>>>>>>>>>>
>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>     Zoltan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>> round...
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>
>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine how
>>>>>>>>>>>>>>>>>    to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>>>>       use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>

Re: Iceberg/Hive properties handling

Posted by Laszlo Pinter <lp...@cloudera.com.INVALID>.

Hi Team,

Based on the discussion on this thread  I implemented the Spark-like
catalog configuration in Hive. You can check the PR here
<https://github.com/apache/iceberg/pull/2129>.
Basically, I extended the current hive global config based catalog
configuration with table-level catalog configuration, but the latter one
has higher priority. I kept the old logic because I wanted to make sure
that I don't break any already implemented use cases using this config
approach. Sooner or later I would like to remove the old one because that
would reduce the code complexity and improve readability, but I'm not sure
when it's the right time.

Can I introduce such breaking changes between releases? Frankly, I would
keep both for time being, but deprecate the old one, and in a future
release completely remove it.

Thanks,
Laszlo




On Tue, Jan 12, 2021 at 4:20 AM Jacques Nadeau <ja...@gmail.com>
wrote:

> Hey Peter,
>
> Despite best intentions I made only nominal progress.
>
>
>
> On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
>> consolidate the thoughts in a document for the path forward. Did you have
>> time for that, or the holidays overwritten all of the plans as usual :)
>>
>> Orher: Ryan convinced me that it would be good to move forward with the
>> synchronised Hive-Iceberg property list whenever it is possible, and use
>> the Iceberg Table properties as a master when not. This would be the
>> solution which is alligns most with the other integration solutios.
>>
>> Thanks, Peter
>>
>> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
>> 8:27):
>>
>>> I like the strong coupling between Hive and Iceberg if we can make it
>>> work. It could be beneficial for the end users, but I still have some
>>> concerns.
>>> We should consider the following aspects:
>>> - Where has the change initiated (Hive or Spark)
>>> - Which Catalog is used (HiveCatalog or other)
>>> - Which Hive version is used (Hive 2/3)
>>>
>>> Some current constraints I think we have:
>>> - There could be multiple Hive tables above a single Iceberg table with
>>> most of the Catalogs (HiveCatalog being the single exception)
>>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>>> HiveCatalog
>>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>>> creation
>>> - Hive inserts modify the table data (one Iceberg commit) and then the
>>> table metadata (another Iceberg commit). This could be suboptimal but
>>> solvable.
>>>
>>> My feeling is that the tight coupling could work as expected with only
>>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>>> properties will deviate. That is why I think it would be easier to
>>> understand for the user that Iceberg and Hive is a different system with
>>> different properties.
>>>
>>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>>> with 1-on-1 mapping too.
>>> If we move this way we should remove the current property filtering from
>>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>>
>>> Thanks, Peter
>>>
>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>>> Sze 23:01):
>>>
>>>> Who cares if there are a few extra properties from Hive? Users may
>>>>> expect those properties to be there anyway.
>>>>
>>>>
>>>> Yeah, what is the key argument against letting them leak? What problem
>>>> are people trying to solve?
>>>>
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>>> either. There is no way to add a prefix for HMS properties that already
>>>>> exist, so the only option is to have a list of properties to suppress,
>>>>> which is option #1.
>>>>>
>>>>> I think that option #3 is a bad idea because it would lead to
>>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>>> sets table properties, those properties should be present in the source of
>>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>>> that would create a situation where properties appear to be missing because
>>>>> the user tried to use syntax that works for nearly every other table.
>>>>>
>>>>> That leaves either option #1 or doing nothing. I actually think that
>>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>>> extra properties from Hive? Users may expect those properties to be there
>>>>> anyway.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>>> <ht...@dremio.com> wrote:
>>>>>
>>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>>> thread on your capacity to look at this before EOY.
>>>>>>
>>>>>> I'm going to try to create a specification document based on the
>>>>>> discussion document you put together. I think there is general consensus
>>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>>> formalize that more.
>>>>>>
>>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>>> side of things. You outline four approaches:
>>>>>>
>>>>>>    1. Hard coded HMS only property list
>>>>>>    2. Hard coded Iceberg only property list
>>>>>>    3. Prefix for Iceberg properties
>>>>>>    4. Prefix for HMS only properties
>>>>>>
>>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>>> current state and the proposed solutions in a document:
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>>
>>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>>> all the possible solutions.
>>>>>>> Please comment!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>>
>>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>>
>>>>>>>    - external.table.purge
>>>>>>>    - EXTERNAL
>>>>>>>    - bucketing_version
>>>>>>>    - numRows
>>>>>>>    - rawDataSize
>>>>>>>    - totalSize
>>>>>>>    - numFiles
>>>>>>>    - numFileErasureCoded
>>>>>>>
>>>>>>>
>>>>>>> I am afraid that we can not change the name of most of these
>>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>>> Resending:
>>>>>>>
>>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>>> table level when the general engine independent configuration is not ideal
>>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>>> config.
>>>>>>>
>>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>>> from Spark, the Hive config should also be included right?
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>>
>>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>>
>>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>>> specific config.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>>> 1., Ke 17:06):
>>>>>>>>
>>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>>> current proposal?
>>>>>>>>>
>>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>>> notes:
>>>>>>>>>>>
>>>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>>>
>>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. I don’t think that catalog configuration should be kept
>>>>>>>>>>>    in table properties. A catalog should not be loaded for each table. So I
>>>>>>>>>>>    don’t think we need iceberg.catalog_location. Instead, we
>>>>>>>>>>>    should have a way to define catalogs in the Configuration
>>>>>>>>>>>    for tables in the metastore to reference.
>>>>>>>>>>>
>>>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>>> environments.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>>>
>>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>>> follows
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>>> default catalog.
>>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as
>>>>>>>>>>>>> well, so the behavior is consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>>>       confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Impala sets the real Iceberg table location for external
>>>>>>>>>>>> tables. E.g. if the user issues
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>>
>>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I like the flexibility of setting the table_identifier on
>>>>>>>>>>>>> table level, which could help removing naming conflicts. We might want to
>>>>>>>>>>>>> have this in the Iceberg Catalog implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>>>       HMSes.
>>>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>>>       opinion.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>>> round...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name
>>>>>>>>>>>>>> is confusing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I like this as it would allow you to use a different
>>>>>>>>>>>>>> database and table name in Hive as opposed to the Hadoop Catalog - at the
>>>>>>>>>>>>>> moment they have to match. The only thing here is that I think Hive
>>>>>>>>>>>>>> requires a table LOCATION to be set and it's then confusing as there are
>>>>>>>>>>>>>> now two locations on the table. I'm not sure whether in the Hive storage
>>>>>>>>>>>>>> handler or SerDe etc. we can get Hive to not require that and maybe even
>>>>>>>>>>>>>> disallow it from being set. That would probably be best in conjunction with
>>>>>>>>>>>>>> this. Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a
>>>>>>>>>>>>>>>>>>    table
>>>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>>>    instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>>>    any other config.
>>>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific
>>>>>>>>>>>>>>>>>>> concerns, and runtime overrides. A good example of an engine-specific
>>>>>>>>>>>>>>>>>>> concern is the catalogs that are available to load Iceberg tables. Spark
>>>>>>>>>>>>>>>>>>> has a way to load and configure catalog implementations and Iceberg uses
>>>>>>>>>>>>>>>>>>> that for all catalog-level config. Runtime overrides are things like target
>>>>>>>>>>>>>>>>>>> split size. Iceberg has a table-level default split size in table
>>>>>>>>>>>>>>>>>>> properties, but this can be overridden by a Spark option for each table, as
>>>>>>>>>>>>>>>>>>> well as an option passed to the individual read. Note that these
>>>>>>>>>>>>>>>>>>> necessarily have different config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses
>>>>>>>>>>>>>>>>>>>    the current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>>>    how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>>>       then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables
>>>>>>>>>>>>>>>>>>> will be used to load table locations. If it is set, then that catalog will
>>>>>>>>>>>>>>>>>>> be used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we
>>>>>>>>>>>>>>>>>>>> handle properties from various sources like Iceberg, Hive or global
>>>>>>>>>>>>>>>>>>>> configuration. I've put together a short document
>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Iceberg/Hive properties handling

Posted by Jacques Nadeau <ja...@gmail.com>.

Hey Peter,

Despite best intentions I made only nominal progress.



On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
> consolidate the thoughts in a document for the path forward. Did you have
> time for that, or the holidays overwritten all of the plans as usual :)
>
> Orher: Ryan convinced me that it would be good to move forward with the
> synchronised Hive-Iceberg property list whenever it is possible, and use
> the Iceberg Table properties as a master when not. This would be the
> solution which is alligns most with the other integration solutios.
>
> Thanks, Peter
>
> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
> 8:27):
>
>> I like the strong coupling between Hive and Iceberg if we can make it
>> work. It could be beneficial for the end users, but I still have some
>> concerns.
>> We should consider the following aspects:
>> - Where has the change initiated (Hive or Spark)
>> - Which Catalog is used (HiveCatalog or other)
>> - Which Hive version is used (Hive 2/3)
>>
>> Some current constraints I think we have:
>> - There could be multiple Hive tables above a single Iceberg table with
>> most of the Catalogs (HiveCatalog being the single exception)
>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>> HiveCatalog
>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>> creation
>> - Hive inserts modify the table data (one Iceberg commit) and then the
>> table metadata (another Iceberg commit). This could be suboptimal but
>> solvable.
>>
>> My feeling is that the tight coupling could work as expected with only
>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>> properties will deviate. That is why I think it would be easier to
>> understand for the user that Iceberg and Hive is a different system with
>> different properties.
>>
>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>> with 1-on-1 mapping too.
>> If we move this way we should remove the current property filtering from
>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>> Sze 23:01):
>>
>>> Who cares if there are a few extra properties from Hive? Users may
>>>> expect those properties to be there anyway.
>>>
>>>
>>> Yeah, what is the key argument against letting them leak? What problem
>>> are people trying to solve?
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>> either. There is no way to add a prefix for HMS properties that already
>>>> exist, so the only option is to have a list of properties to suppress,
>>>> which is option #1.
>>>>
>>>> I think that option #3 is a bad idea because it would lead to
>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>> sets table properties, those properties should be present in the source of
>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>> that would create a situation where properties appear to be missing because
>>>> the user tried to use syntax that works for nearly every other table.
>>>>
>>>> That leaves either option #1 or doing nothing. I actually think that
>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>> extra properties from Hive? Users may expect those properties to be there
>>>> anyway.
>>>>
>>>> rb
>>>>
>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>> <ht...@dremio.com> wrote:
>>>>
>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>> thread on your capacity to look at this before EOY.
>>>>>
>>>>> I'm going to try to create a specification document based on the
>>>>> discussion document you put together. I think there is general consensus
>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>> formalize that more.
>>>>>
>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>> side of things. You outline four approaches:
>>>>>
>>>>>    1. Hard coded HMS only property list
>>>>>    2. Hard coded Iceberg only property list
>>>>>    3. Prefix for Iceberg properties
>>>>>    4. Prefix for HMS only properties
>>>>>
>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>> current state and the proposed solutions in a document:
>>>>>>
>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>
>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>> all the possible solutions.
>>>>>> Please comment!
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>
>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>
>>>>>>    - external.table.purge
>>>>>>    - EXTERNAL
>>>>>>    - bucketing_version
>>>>>>    - numRows
>>>>>>    - rawDataSize
>>>>>>    - totalSize
>>>>>>    - numFiles
>>>>>>    - numFileErasureCoded
>>>>>>
>>>>>>
>>>>>> I am afraid that we can not change the name of most of these
>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>
>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>> Resending:
>>>>>>
>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>> table level when the general engine independent configuration is not ideal
>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>> config.
>>>>>>
>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>> from Spark, the Hive config should also be included right?
>>>>>>
>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>
>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>
>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>> specific config.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>> 1., Ke 17:06):
>>>>>>>
>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>> current proposal?
>>>>>>>>
>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>> notes:
>>>>>>>>>>
>>>>>>>>>>    1. We need to keep in mind that users plug in their own
>>>>>>>>>>    catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>>    catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>>    separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>>    for catalogs configured in Configuration, i.e., via
>>>>>>>>>>    hive-site.xml. We then only need a special value for loading
>>>>>>>>>>    Hadoop tables from paths.
>>>>>>>>>>
>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>>>>>>    table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>>>    think we need iceberg.catalog_location. Instead, we should
>>>>>>>>>>    have a way to define catalogs in the Configuration for tables
>>>>>>>>>>    in the metastore to reference.
>>>>>>>>>>
>>>>>>>>>>  I think it makes sense, on the other hand it would make adding
>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>> environments.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>>    passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>>    do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>>    specific properties, then everything but the properties reserved for
>>>>>>>>>>    locating the table are passed as the user would expect.
>>>>>>>>>>
>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>
>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>> follows
>>>>>>>>>>>>
>>>>>>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>>    - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>>       to avoid any implicit behavior
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>> default catalog.
>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>>       - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>>       stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>>       Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>>       table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>>       confusing.
>>>>>>>>>>>>
>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>>
>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>
>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>>       current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>>       HMSes.
>>>>>>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>>       - I see that these properties are used by the SerDe
>>>>>>>>>>>>       classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>>       serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>>       are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>>       properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>>       opinion.
>>>>>>>>>>>>
>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>>    related parameters
>>>>>>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>>    specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>>    serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>>    HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>>    3. Query specific properties - These are engine specific
>>>>>>>>>>>>>    and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>>    engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>>>>>>    - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>>    properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>> round...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hadoop.tables
>>>>>>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hadoop.catalog
>>>>>>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - hive.catalog
>>>>>>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>       specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>       is used as table identifier
>>>>>>>>>>>>>>       - We have the assumption that the current Hive
>>>>>>>>>>>>>>       metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>>       metastores currently
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>     Zoltan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>>>    2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>>    Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>>    3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>>    set at a global level
>>>>>>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>>    properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>>    instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>>    any other config.
>>>>>>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>>    identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>>    other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>>    - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>>    how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>>    catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>>>>>>       - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>>       then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>