You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2021/01/11 06:33:11 UTC
Re: Iceberg/Hive properties handling
Hi Team,
@Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
consolidate the thoughts in a document for the path forward. Did you have
time for that, or the holidays overwritten all of the plans as usual :)
Orher: Ryan convinced me that it would be good to move forward with the
synchronised Hive-Iceberg property list whenever it is possible, and use
the Iceberg Table properties as a master when not. This would be the
solution which is alligns most with the other integration solutios.
Thanks, Peter
Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
8:27):
> I like the strong coupling between Hive and Iceberg if we can make it
> work. It could be beneficial for the end users, but I still have some
> concerns.
> We should consider the following aspects:
> - Where has the change initiated (Hive or Spark)
> - Which Catalog is used (HiveCatalog or other)
> - Which Hive version is used (Hive 2/3)
>
> Some current constraints I think we have:
> - There could be multiple Hive tables above a single Iceberg table with
> most of the Catalogs (HiveCatalog being the single exception)
> - I see no ways to propagate Spark changes for HMS if the Catalog is not
> HiveCatalog
> - Only Hive3 has ways to propagate changes to the Iceberg table after
> creation
> - Hive inserts modify the table data (one Iceberg commit) and then the
> table metadata (another Iceberg commit). This could be suboptimal but
> solvable.
>
> My feeling is that the tight coupling could work as expected with only the
> HiveCatalog using Hive3. In every other case the Iceberg and the HMS
> properties will deviate. That is why I think it would be easier to
> understand for the user that Iceberg and Hive is a different system with
> different properties.
>
> All that said we will use Hive3 and HiveCatalog so I think we are fine
> with 1-on-1 mapping too.
> If we move this way we should remove the current property filtering from
> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>
> Thanks, Peter
>
> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9., Sze
> 23:01):
>
>> Who cares if there are a few extra properties from Hive? Users may expect
>>> those properties to be there anyway.
>>
>>
>> Yeah, what is the key argument against letting them leak? What problem
>> are people trying to solve?
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>> either. There is no way to add a prefix for HMS properties that already
>>> exist, so the only option is to have a list of properties to suppress,
>>> which is option #1.
>>>
>>> I think that option #3 is a bad idea because it would lead to surprising
>>> behavior for users. If a user creates a table using Hive DDL and sets table
>>> properties, those properties should be present in the source of truth
>>> Iceberg table. If a prefix was required to forward them to Iceberg, that
>>> would create a situation where properties appear to be missing because the
>>> user tried to use syntax that works for nearly every other table.
>>>
>>> That leaves either option #1 or doing nothing. I actually think that
>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>> extra properties from Hive? Users may expect those properties to be there
>>> anyway.
>>>
>>> rb
>>>
>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>> <ht...@dremio.com> wrote:
>>>
>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>> thread on your capacity to look at this before EOY.
>>>>
>>>> I'm going to try to create a specification document based on the
>>>> discussion document you put together. I think there is general consensus
>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>> formalize that more.
>>>>
>>>> It seems like there is less consensus around the whitelist/blacklist
>>>> side of things. You outline four approaches:
>>>>
>>>> 1. Hard coded HMS only property list
>>>> 2. Hard coded Iceberg only property list
>>>> 3. Prefix for Iceberg properties
>>>> 4. Prefix for HMS only properties
>>>>
>>>> I generally think #2 is a no-go as it creates too much coupling between
>>>> catalog implementations and core iceberg. It seems like Ryan Blue would
>>>> prefer #4 (correct?). Any other strong opinions?
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>> current state and the proposed solutions in a document:
>>>>>
>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>
>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>> all the possible solutions.
>>>>> Please comment!
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>> TBLPROPERTIES on newly created tables:
>>>>>
>>>>> - external.table.purge
>>>>> - EXTERNAL
>>>>> - bucketing_version
>>>>> - numRows
>>>>> - rawDataSize
>>>>> - totalSize
>>>>> - numFiles
>>>>> - numFileErasureCoded
>>>>>
>>>>>
>>>>> I am afraid that we can not change the name of most of these
>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>> targeted way to push properties to the Iceberg tables.
>>>>>
>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>> Resending:
>>>>>
>>>>> Also I expect that we want to add Hive write specific configs to table
>>>>> level when the general engine independent configuration is not ideal for
>>>>> Hive, but every Hive query for a given table should use some specific
>>>>> config.
>>>>>
>>>>> Hive may need configuration, but I think these should still be kept in
>>>>> the Iceberg table. There is no reason to make Hive config inaccessible from
>>>>> other engines. If someone wants to view all of the config for a table from
>>>>> Spark, the Hive config should also be included right?
>>>>>
>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>>>>>
>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>
>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>
>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>> configs to table level when the general engine independent configuration is
>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>> specific config.
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>> 1., Ke 17:06):
>>>>>>
>>>>>>> Would someone be willing to create a document that states the
>>>>>>> current proposal?
>>>>>>>
>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>>>>>
>>>>>>>>> 1. We need to keep in mind that users plug in their own
>>>>>>>>> catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>> catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>> separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>> for catalogs configured in Configuration, i.e., via
>>>>>>>>> hive-site.xml. We then only need a special value for loading
>>>>>>>>> Hadoop tables from paths.
>>>>>>>>>
>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. I don’t think that catalog configuration should be kept in
>>>>>>>>> table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>> think we need iceberg.catalog_location. Instead, we should
>>>>>>>>> have a way to define catalogs in the Configuration for tables
>>>>>>>>> in the metastore to reference.
>>>>>>>>>
>>>>>>>>> I think it makes sense, on the other hand it would make adding
>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>> environments.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. I’d rather use a prefix to exclude properties from being
>>>>>>>>> passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>> do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>> specific properties, then everything but the properties reserved for
>>>>>>>>> locating the table are passed as the user would expect.
>>>>>>>>>
>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>> behavior would cause the least surprises.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>
>>>>>>>>>>> Answers below:
>>>>>>>>>>>
>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>> follows
>>>>>>>>>>>
>>>>>>>>>>> - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>> - Yeah, I think it would be nice if this would be required
>>>>>>>>>>> to avoid any implicit behavior
>>>>>>>>>>>
>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>> default catalog.
>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>> have currently in Impala.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>> - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>> stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>> Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>> table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>> confusing.
>>>>>>>>>>>
>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>
>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>
>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>
>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>> - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>> current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>> HMSes.
>>>>>>>>>>> - Moving properties to SERDEPROPERTIES
>>>>>>>>>>> - I see that these properties are used by the SerDe
>>>>>>>>>>> classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>> serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>> are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>> properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>> opinion.
>>>>>>>>>>>
>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zoltan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>> related parameters
>>>>>>>>>>>> 2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>> specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>> serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>> be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>> the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>> HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>> would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>> 3. Query specific properties - These are engine specific
>>>>>>>>>>>> and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>> engine specific code paths, but currently these properties have independent
>>>>>>>>>>>> names and mapped on a case-by-case basis.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>
>>>>>>>>>>>> - Shall we move the "how to get to" properties to
>>>>>>>>>>>> SERDEPROPERTIES?
>>>>>>>>>>>> - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>> properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>> round...
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - hadoop.tables
>>>>>>>>>>>>> - the table location is used to load the table
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - hadoop.catalog
>>>>>>>>>>>>> - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>> specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> - hive.catalog
>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>> - We have the assumption that the current Hive
>>>>>>>>>>>>> metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>> metastores currently
>>>>>>>>>>>>>
>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>> 2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>> Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>> 3. There could be default "how to get to" information
>>>>>>>>>>>>>>>> set at a global level
>>>>>>>>>>>>>>>> 4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>> properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>> purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and
>>>>>>>>>>>>>>>> use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not problematic?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>>>>> Hadoop classes, like FileSystem. Iceberg should not use it for any other
>>>>>>>>>>>>>>>>> config.
>>>>>>>>>>>>>>>>> - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>> identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>> other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>> FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>> format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>> to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>> serde properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>> current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>> MetaStore
>>>>>>>>>>>>>>>>> - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>> - HMS table properties should be used to determine how
>>>>>>>>>>>>>>>>> to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>> catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>> - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>> HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>> - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>> catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>> from HMS table properties
>>>>>>>>>>>>>>>>> - If there is no catalog or metadata_location, then
>>>>>>>>>>>>>>>>> use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink configure
>>>>>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that config in
>>>>>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg is the
>>>>>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the Iceberg
>>>>>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
Re: Iceberg/Hive properties handling
Posted by Laszlo Pinter <lp...@cloudera.com.INVALID>.
Hi Team,
Based on the discussion on this thread I implemented the Spark-like
catalog configuration in Hive. You can check the PR here
<https://github.com/apache/iceberg/pull/2129>.
Basically, I extended the current hive global config based catalog
configuration with table-level catalog configuration, but the latter one
has higher priority. I kept the old logic because I wanted to make sure
that I don't break any already implemented use cases using this config
approach. Sooner or later I would like to remove the old one because that
would reduce the code complexity and improve readability, but I'm not sure
when it's the right time.
Can I introduce such breaking changes between releases? Frankly, I would
keep both for time being, but deprecate the old one, and in a future
release completely remove it.
Thanks,
Laszlo
On Tue, Jan 12, 2021 at 4:20 AM Jacques Nadeau <ja...@gmail.com>
wrote:
> Hey Peter,
>
> Despite best intentions I made only nominal progress.
>
>
>
> On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
>> consolidate the thoughts in a document for the path forward. Did you have
>> time for that, or the holidays overwritten all of the plans as usual :)
>>
>> Orher: Ryan convinced me that it would be good to move forward with the
>> synchronised Hive-Iceberg property list whenever it is possible, and use
>> the Iceberg Table properties as a master when not. This would be the
>> solution which is alligns most with the other integration solutios.
>>
>> Thanks, Peter
>>
>> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
>> 8:27):
>>
>>> I like the strong coupling between Hive and Iceberg if we can make it
>>> work. It could be beneficial for the end users, but I still have some
>>> concerns.
>>> We should consider the following aspects:
>>> - Where has the change initiated (Hive or Spark)
>>> - Which Catalog is used (HiveCatalog or other)
>>> - Which Hive version is used (Hive 2/3)
>>>
>>> Some current constraints I think we have:
>>> - There could be multiple Hive tables above a single Iceberg table with
>>> most of the Catalogs (HiveCatalog being the single exception)
>>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>>> HiveCatalog
>>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>>> creation
>>> - Hive inserts modify the table data (one Iceberg commit) and then the
>>> table metadata (another Iceberg commit). This could be suboptimal but
>>> solvable.
>>>
>>> My feeling is that the tight coupling could work as expected with only
>>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>>> properties will deviate. That is why I think it would be easier to
>>> understand for the user that Iceberg and Hive is a different system with
>>> different properties.
>>>
>>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>>> with 1-on-1 mapping too.
>>> If we move this way we should remove the current property filtering from
>>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>>
>>> Thanks, Peter
>>>
>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>>> Sze 23:01):
>>>
>>>> Who cares if there are a few extra properties from Hive? Users may
>>>>> expect those properties to be there anyway.
>>>>
>>>>
>>>> Yeah, what is the key argument against letting them leak? What problem
>>>> are people trying to solve?
>>>>
>>>>
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>>
>>>>
>>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>>> either. There is no way to add a prefix for HMS properties that already
>>>>> exist, so the only option is to have a list of properties to suppress,
>>>>> which is option #1.
>>>>>
>>>>> I think that option #3 is a bad idea because it would lead to
>>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>>> sets table properties, those properties should be present in the source of
>>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>>> that would create a situation where properties appear to be missing because
>>>>> the user tried to use syntax that works for nearly every other table.
>>>>>
>>>>> That leaves either option #1 or doing nothing. I actually think that
>>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>>> extra properties from Hive? Users may expect those properties to be there
>>>>> anyway.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>>> <ht...@dremio.com> wrote:
>>>>>
>>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>>> thread on your capacity to look at this before EOY.
>>>>>>
>>>>>> I'm going to try to create a specification document based on the
>>>>>> discussion document you put together. I think there is general consensus
>>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>>> formalize that more.
>>>>>>
>>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>>> side of things. You outline four approaches:
>>>>>>
>>>>>> 1. Hard coded HMS only property list
>>>>>> 2. Hard coded Iceberg only property list
>>>>>> 3. Prefix for Iceberg properties
>>>>>> 4. Prefix for HMS only properties
>>>>>>
>>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>>> current state and the proposed solutions in a document:
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>>
>>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>>> all the possible solutions.
>>>>>>> Please comment!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>>
>>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>>
>>>>>>> - external.table.purge
>>>>>>> - EXTERNAL
>>>>>>> - bucketing_version
>>>>>>> - numRows
>>>>>>> - rawDataSize
>>>>>>> - totalSize
>>>>>>> - numFiles
>>>>>>> - numFileErasureCoded
>>>>>>>
>>>>>>>
>>>>>>> I am afraid that we can not change the name of most of these
>>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>>
>>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>>> Resending:
>>>>>>>
>>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>>> table level when the general engine independent configuration is not ideal
>>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>>> config.
>>>>>>>
>>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>>> from Spark, the Hive config should also be included right?
>>>>>>>
>>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>>
>>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>>
>>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>>> specific config.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>>> 1., Ke 17:06):
>>>>>>>>
>>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>>> current proposal?
>>>>>>>>>
>>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>>> notes:
>>>>>>>>>>>
>>>>>>>>>>> 1. We need to keep in mind that users plug in their own
>>>>>>>>>>> catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>>> catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>>> separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>>> for catalogs configured in Configuration, i.e., via
>>>>>>>>>>> hive-site.xml. We then only need a special value for loading
>>>>>>>>>>> Hadoop tables from paths.
>>>>>>>>>>>
>>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1. I don’t think that catalog configuration should be kept
>>>>>>>>>>> in table properties. A catalog should not be loaded for each table. So I
>>>>>>>>>>> don’t think we need iceberg.catalog_location. Instead, we
>>>>>>>>>>> should have a way to define catalogs in the Configuration
>>>>>>>>>>> for tables in the metastore to reference.
>>>>>>>>>>>
>>>>>>>>>>> I think it makes sense, on the other hand it would make adding
>>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>>> environments.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>>> passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>>> do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>>> specific properties, then everything but the properties reserved for
>>>>>>>>>>> locating the table are passed as the user would expect.
>>>>>>>>>>>
>>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>>> follows
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>>> - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>>> to avoid any implicit behavior
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>>> default catalog.
>>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as
>>>>>>>>>>>>> well, so the behavior is consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>>> - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>>> stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>>> Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>>> table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Impala sets the real Iceberg table location for external
>>>>>>>>>>>> tables. E.g. if the user issues
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>>
>>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>>
>>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I like the flexibility of setting the table_identifier on
>>>>>>>>>>>>> table level, which could help removing naming conflicts. We might want to
>>>>>>>>>>>>> have this in the Iceberg Catalog implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>>> - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>>> current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>>> HMSes.
>>>>>>>>>>>>> - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>>> - I see that these properties are used by the SerDe
>>>>>>>>>>>>> classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>>> serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>>> are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>>> properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>>> opinion.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>>> related parameters
>>>>>>>>>>>>>> 2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>>> specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>>> serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>>> be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>>> the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>>> HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>>> would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>>> 3. Query specific properties - These are engine specific
>>>>>>>>>>>>>> and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>>> engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>>> names and mapped on a case-by-case basis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Shall we move the "how to get to" properties to
>>>>>>>>>>>>>> SERDEPROPERTIES?
>>>>>>>>>>>>>> - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>>> properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>>> round...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name
>>>>>>>>>>>>>> is confusing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - hadoop.tables
>>>>>>>>>>>>>>> - the table location is used to load the table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - hadoop.catalog
>>>>>>>>>>>>>>> - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>>> specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I like this as it would allow you to use a different
>>>>>>>>>>>>>> database and table name in Hive as opposed to the Hadoop Catalog - at the
>>>>>>>>>>>>>> moment they have to match. The only thing here is that I think Hive
>>>>>>>>>>>>>> requires a table LOCATION to be set and it's then confusing as there are
>>>>>>>>>>>>>> now two locations on the table. I'm not sure whether in the Hive storage
>>>>>>>>>>>>>> handler or SerDe etc. we can get Hive to not require that and maybe even
>>>>>>>>>>>>>> disallow it from being set. That would probably be best in conjunction with
>>>>>>>>>>>>>> this. Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - hive.catalog
>>>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>>>> - We have the assumption that the current Hive
>>>>>>>>>>>>>>> metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>>> metastores currently
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Iceberg metadata describes all properties of a
>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>> 2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>>> Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>>> 3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>>> set at a global level
>>>>>>>>>>>>>>>>>> 4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>>> properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>>> purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>>> instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>>> any other config.
>>>>>>>>>>>>>>>>>>> - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>>> identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>>> other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>>> FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>>> format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>>> to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>>> serde properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific
>>>>>>>>>>>>>>>>>>> concerns, and runtime overrides. A good example of an engine-specific
>>>>>>>>>>>>>>>>>>> concern is the catalogs that are available to load Iceberg tables. Spark
>>>>>>>>>>>>>>>>>>> has a way to load and configure catalog implementations and Iceberg uses
>>>>>>>>>>>>>>>>>>> that for all catalog-level config. Runtime overrides are things like target
>>>>>>>>>>>>>>>>>>> split size. Iceberg has a table-level default split size in table
>>>>>>>>>>>>>>>>>>> properties, but this can be overridden by a Spark option for each table, as
>>>>>>>>>>>>>>>>>>> well as an option passed to the individual read. Note that these
>>>>>>>>>>>>>>>>>>> necessarily have different config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - There should be a default HiveCatalog that uses
>>>>>>>>>>>>>>>>>>> the current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>>> MetaStore
>>>>>>>>>>>>>>>>>>> - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>>> - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>>> how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>>> catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>>> - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>>> HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>>> - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>>> catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>>> from HMS table properties
>>>>>>>>>>>>>>>>>>> - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>>> then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables
>>>>>>>>>>>>>>>>>>> will be used to load table locations. If it is set, then that catalog will
>>>>>>>>>>>>>>>>>>> be used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we
>>>>>>>>>>>>>>>>>>>> handle properties from various sources like Iceberg, Hive or global
>>>>>>>>>>>>>>>>>>>> configuration. I've put together a short document
>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
Re: Iceberg/Hive properties handling
Posted by Jacques Nadeau <ja...@gmail.com>.
Hey Peter,
Despite best intentions I made only nominal progress.
On Sun, Jan 10, 2021 at 10:33 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:
> Hi Team,
>
> @Jacques Nadeau <ja...@dremio.com>: you mentioned that you might
> consolidate the thoughts in a document for the path forward. Did you have
> time for that, or the holidays overwritten all of the plans as usual :)
>
> Orher: Ryan convinced me that it would be good to move forward with the
> synchronised Hive-Iceberg property list whenever it is possible, and use
> the Iceberg Table properties as a master when not. This would be the
> solution which is alligns most with the other integration solutios.
>
> Thanks, Peter
>
> Peter Vary <pv...@cloudera.com> ezt írta (időpont: 2020. dec. 10., Csü
> 8:27):
>
>> I like the strong coupling between Hive and Iceberg if we can make it
>> work. It could be beneficial for the end users, but I still have some
>> concerns.
>> We should consider the following aspects:
>> - Where has the change initiated (Hive or Spark)
>> - Which Catalog is used (HiveCatalog or other)
>> - Which Hive version is used (Hive 2/3)
>>
>> Some current constraints I think we have:
>> - There could be multiple Hive tables above a single Iceberg table with
>> most of the Catalogs (HiveCatalog being the single exception)
>> - I see no ways to propagate Spark changes for HMS if the Catalog is not
>> HiveCatalog
>> - Only Hive3 has ways to propagate changes to the Iceberg table after
>> creation
>> - Hive inserts modify the table data (one Iceberg commit) and then the
>> table metadata (another Iceberg commit). This could be suboptimal but
>> solvable.
>>
>> My feeling is that the tight coupling could work as expected with only
>> the HiveCatalog using Hive3. In every other case the Iceberg and the HMS
>> properties will deviate. That is why I think it would be easier to
>> understand for the user that Iceberg and Hive is a different system with
>> different properties.
>>
>> All that said we will use Hive3 and HiveCatalog so I think we are fine
>> with 1-on-1 mapping too.
>> If we move this way we should remove the current property filtering from
>> the HiveCatalog and from the HiveIcebergMetaHook, so we are consistent.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec. 9.,
>> Sze 23:01):
>>
>>> Who cares if there are a few extra properties from Hive? Users may
>>>> expect those properties to be there anyway.
>>>
>>>
>>> Yeah, what is the key argument against letting them leak? What problem
>>> are people trying to solve?
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Dec 9, 2020 at 12:47 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I agree that #2 doesn’t really work. I also think that #4 can’t work
>>>> either. There is no way to add a prefix for HMS properties that already
>>>> exist, so the only option is to have a list of properties to suppress,
>>>> which is option #1.
>>>>
>>>> I think that option #3 is a bad idea because it would lead to
>>>> surprising behavior for users. If a user creates a table using Hive DDL and
>>>> sets table properties, those properties should be present in the source of
>>>> truth Iceberg table. If a prefix was required to forward them to Iceberg,
>>>> that would create a situation where properties appear to be missing because
>>>> the user tried to use syntax that works for nearly every other table.
>>>>
>>>> That leaves either option #1 or doing nothing. I actually think that
>>>> there’s a strong argument to do nothing here and allow Hive and Iceberg
>>>> properties to be mixed in the Iceberg table. Who cares if there are a few
>>>> extra properties from Hive? Users may expect those properties to be there
>>>> anyway.
>>>>
>>>> rb
>>>>
>>>> On Mon, Dec 7, 2020 at 9:58 AM Jacques Nadeau jacques@dremio.com
>>>> <ht...@dremio.com> wrote:
>>>>
>>>> Hey Peter, thanks for updating the doc and your heads up in the other
>>>>> thread on your capacity to look at this before EOY.
>>>>>
>>>>> I'm going to try to create a specification document based on the
>>>>> discussion document you put together. I think there is general consensus
>>>>> around what you call "Spark-like catalog configuration" so I'd like to
>>>>> formalize that more.
>>>>>
>>>>> It seems like there is less consensus around the whitelist/blacklist
>>>>> side of things. You outline four approaches:
>>>>>
>>>>> 1. Hard coded HMS only property list
>>>>> 2. Hard coded Iceberg only property list
>>>>> 3. Prefix for Iceberg properties
>>>>> 4. Prefix for HMS only properties
>>>>>
>>>>> I generally think #2 is a no-go as it creates too much coupling
>>>>> between catalog implementations and core iceberg. It seems like Ryan Blue
>>>>> would prefer #4 (correct?). Any other strong opinions?
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> As Jacques suggested (with the help of Zoltan) I have collected the
>>>>>> current state and the proposed solutions in a document:
>>>>>>
>>>>>> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>>>>>>
>>>>>> My feeling is that we do not have a final decision, so tried to list
>>>>>> all the possible solutions.
>>>>>> Please comment!
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>>>>>>
>>>>>> When I was working on the CREATE TABLE patch I found the following
>>>>>> TBLPROPERTIES on newly created tables:
>>>>>>
>>>>>> - external.table.purge
>>>>>> - EXTERNAL
>>>>>> - bucketing_version
>>>>>> - numRows
>>>>>> - rawDataSize
>>>>>> - totalSize
>>>>>> - numFiles
>>>>>> - numFileErasureCoded
>>>>>>
>>>>>>
>>>>>> I am afraid that we can not change the name of most of these
>>>>>> properties, and might not be useful to have most of them along with Iceberg
>>>>>> statistics already there. Also my feeling is that this is only the top of
>>>>>> the Iceberg (pun intended :)) so this is why I think we should be more
>>>>>> targeted way to push properties to the Iceberg tables.
>>>>>>
>>>>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> Sorry, I accidentally didn’t copy the dev list on this reply.
>>>>>> Resending:
>>>>>>
>>>>>> Also I expect that we want to add Hive write specific configs to
>>>>>> table level when the general engine independent configuration is not ideal
>>>>>> for Hive, but every Hive query for a given table should use some specific
>>>>>> config.
>>>>>>
>>>>>> Hive may need configuration, but I think these should still be kept
>>>>>> in the Iceberg table. There is no reason to make Hive config inaccessible
>>>>>> from other engines. If someone wants to view all of the config for a table
>>>>>> from Spark, the Hive config should also be included right?
>>>>>>
>>>>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will ask Laszlo if he wants to update his doc.
>>>>>>>
>>>>>>> I see both pros and cons of catalog definition in config files. If
>>>>>>> there is an easy default then I do not mind any of the proposed solutions.
>>>>>>>
>>>>>>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>>>>>>> solution, because in Hive it is common to add new keys to the property list
>>>>>>> - no restriction is in place (I am not even sure that the currently
>>>>>>> implemented blacklist for preventing to propagate properties to Iceberg
>>>>>>> tables is complete). Also I expect that we want to add Hive write specific
>>>>>>> configs to table level when the general engine independent configuration is
>>>>>>> not ideal for Hive, but every Hive query for a given table should use some
>>>>>>> specific config.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> Jacques Nadeau <ja...@dremio.com> ezt írta (időpont: 2020. dec.
>>>>>>> 1., Ke 17:06):
>>>>>>>
>>>>>>>> Would someone be willing to create a document that states the
>>>>>>>> current proposal?
>>>>>>>>
>>>>>>>> It is becoming somewhat difficult to follow this thread. I also
>>>>>>>> worry that without a complete statement of the current shape that people
>>>>>>>> may be incorrectly thinking they are in alignment.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>>>>>>> boroknagyz@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Ryan. I answered inline.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This sounds like a good plan overall, but I have a couple of
>>>>>>>>>> notes:
>>>>>>>>>>
>>>>>>>>>> 1. We need to keep in mind that users plug in their own
>>>>>>>>>> catalogs, so iceberg.catalog could be a Glue or Nessie
>>>>>>>>>> catalog, not just Hive or Hadoop. I don’t think it makes much sense to use
>>>>>>>>>> separate hadoop.catalog and hive.catalog values. Those should just be names
>>>>>>>>>> for catalogs configured in Configuration, i.e., via
>>>>>>>>>> hive-site.xml. We then only need a special value for loading
>>>>>>>>>> Hadoop tables from paths.
>>>>>>>>>>
>>>>>>>>>> About extensibility, I think the usual Hive way is to use Java
>>>>>>>>> class names. So this way the value for 'iceberg.catalog' could be e.g.
>>>>>>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>>>>>>> would need to have a factory method that constructs the catalog object from
>>>>>>>>> a properties object (Map<String, String>). E.g.
>>>>>>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>>>>>>> 'iceberg.catalog_location' to be present in properties.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. I don’t think that catalog configuration should be kept in
>>>>>>>>>> table properties. A catalog should not be loaded for each table. So I don’t
>>>>>>>>>> think we need iceberg.catalog_location. Instead, we should
>>>>>>>>>> have a way to define catalogs in the Configuration for tables
>>>>>>>>>> in the metastore to reference.
>>>>>>>>>>
>>>>>>>>>> I think it makes sense, on the other hand it would make adding
>>>>>>>>> new catalogs more heavy-weight, i.e. now you'd need to edit configuration
>>>>>>>>> files and restart/reinit services. Maybe it can be cumbersome in some
>>>>>>>>> environments.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. I’d rather use a prefix to exclude properties from being
>>>>>>>>>> passed to Iceberg than to include them. Otherwise, users don’t know what to
>>>>>>>>>> do to pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>>>>>> specific properties, then everything but the properties reserved for
>>>>>>>>>> locating the table are passed as the user would expect.
>>>>>>>>>>
>>>>>>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>>>>>>> behavior would cause the least surprises.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Peter. I answered inline.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>
>>>>>>>>>>>> Answers below:
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>>>>>>> boroknagyz@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>>>>>>> follows
>>>>>>>>>>>>
>>>>>>>>>>>> - Should 'iceberg.catalog' be a required property?
>>>>>>>>>>>> - Yeah, I think it would be nice if this would be required
>>>>>>>>>>>> to avoid any implicit behavior
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>>>>>>> different Catalogs. At that time the decision was to use HadoopTables as a
>>>>>>>>>>>> default catalog.
>>>>>>>>>>>> It might be worthwhile to use the same class in Impala as well,
>>>>>>>>>>>> so the behavior is consistent.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg
>>>>>>>>>>> classes whenever possible. The Catalogs class is very similar to what we
>>>>>>>>>>> have currently in Impala.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>>>>>> - In Impala we don't allow setting LOCATION for tables
>>>>>>>>>>>> stored in 'hadoop.catalog'. But Impala internally sets LOCATION to the
>>>>>>>>>>>> Iceberg table's actual location. We were also thinking about using only the
>>>>>>>>>>>> table LOCATION, and set it to the catalog location, but we also found it
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>
>>>>>>>>>>>> It could definitely work, but it is somewhat strange that we
>>>>>>>>>>>> have an external table location set to an arbitrary path, and we have a
>>>>>>>>>>>> different location generated by other configs. It would be nice to have the
>>>>>>>>>>>> real location set in the external table location as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Impala sets the real Iceberg table location for external tables.
>>>>>>>>>>> E.g. if the user issues
>>>>>>>>>>>
>>>>>>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>>>>>>> STORED AS ICEBERG
>>>>>>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>>>>>>
>>>>>>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>>>>>>
>>>>>>>>>>> If the end user had specified LOCATION, then Impala would have
>>>>>>>>>>> raised an error. But the above DDL statement is correct, so Impala loads
>>>>>>>>>>> the iceberg table via Iceberg API, then creates the HMS table and sets
>>>>>>>>>>> LOCATION to the Iceberg table location (something like
>>>>>>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>>>>>> - Yeah, it doesn't add much if we only allow using the
>>>>>>>>>>>> current HMS. I think it can be only useful if we are allowing external
>>>>>>>>>>>> HMSes.
>>>>>>>>>>>> - Moving properties to SERDEPROPERTIES
>>>>>>>>>>>> - I see that these properties are used by the SerDe
>>>>>>>>>>>> classes in Hive, but I feel that these properties are just not about
>>>>>>>>>>>> serialization and deserialization. And as I see the current SERDEPROPERTIES
>>>>>>>>>>>> are things like 'field.delim', 'separatorChar', 'quoteChar', etc. So
>>>>>>>>>>>> properties about table loading more naturally belong to TBLPROPERTIES in my
>>>>>>>>>>>> opinion.
>>>>>>>>>>>>
>>>>>>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki
>>>>>>>>>>>> page uses both :) ). Since Impala prefers TBLPROPERTIES and if we start
>>>>>>>>>>>> using prefix for separating real Iceberg table properties from other
>>>>>>>>>>>> properties, then we can keep it at TBLPROPERTIES.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the google doc I also had a comment about prefixing iceberg
>>>>>>>>>>> table properties. We could use a prefix like 'iceberg.tblproperties.', and
>>>>>>>>>>> pass every property with this prefix to the Iceberg table. Currently Impala
>>>>>>>>>>> passes every table property to the Iceberg table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>>>>>>> pvary@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on the discussion below I understand we have the
>>>>>>>>>>>>> following kinds of properties:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Iceberg table properties - Engine independent, storage
>>>>>>>>>>>>> related parameters
>>>>>>>>>>>>> 2. "how to get to" - I think these are mostly Hive table
>>>>>>>>>>>>> specific properties, since for Spark, the Spark catalog configuration
>>>>>>>>>>>>> serves for the same purpose. I think the best place for storing these would
>>>>>>>>>>>>> be the Hive SERDEPROPERTIES, as this describes the access information for
>>>>>>>>>>>>> the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>>>>>> HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier'
>>>>>>>>>>>>> would make sense only if we allow having multiple catalogs.
>>>>>>>>>>>>> 3. Query specific properties - These are engine specific
>>>>>>>>>>>>> and might be mapped to / even override the Iceberg table properties on the
>>>>>>>>>>>>> engine specific code paths, but currently these properties have independent
>>>>>>>>>>>>> names and mapped on a case-by-case basis.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Shall we move the "how to get to" properties to
>>>>>>>>>>>>> SERDEPROPERTIES?
>>>>>>>>>>>>> - Shall we define a prefix for setting Iceberg table
>>>>>>>>>>>>> properties from Hive queries and omitting other engine specific properties?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <ma...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like these suggestions, comments inline below on the last
>>>>>>>>>>>>> round...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>>>>>>> boroknagyz@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>>>>>>> information about table loading in HMS table properties. We are just a bit
>>>>>>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>>>>>>> We have table property 'iceberg.catalog' to determine the
>>>>>>>>>>>>>> catalog type, right now the supported values are 'hadoop.tables',
>>>>>>>>>>>>>> 'hadoop.catalog', and 'hive.catalog'. Additional table properties can be
>>>>>>>>>>>>>> set based on the catalog type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - hadoop.tables
>>>>>>>>>>>>>> - the table location is used to load the table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only question I have is should we have this as the
>>>>>>>>>>>>> default? i.e. if you don't set a catalog it will assume its HadoopTables
>>>>>>>>>>>>> and use the location? Or should we require this property to be here to be
>>>>>>>>>>>>> consistent and avoid any "magic"?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - hadoop.catalog
>>>>>>>>>>>>>> - Required table property 'iceberg.catalog_location'
>>>>>>>>>>>>>> specifies the location of the hadoop catalog in the file system
>>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like this as it would allow you to use a different database
>>>>>>>>>>>>> and table name in Hive as opposed to the Hadoop Catalog - at the moment
>>>>>>>>>>>>> they have to match. The only thing here is that I think Hive requires a
>>>>>>>>>>>>> table LOCATION to be set and it's then confusing as there are now two
>>>>>>>>>>>>> locations on the table. I'm not sure whether in the Hive storage handler or
>>>>>>>>>>>>> SerDe etc. we can get Hive to not require that and maybe even disallow it
>>>>>>>>>>>>> from being set. That would probably be best in conjunction with this.
>>>>>>>>>>>>> Another solution would be to not have the 'iceberg.catalog_location'
>>>>>>>>>>>>> property but instead use the table LOCATION for this but that's a bit
>>>>>>>>>>>>> confusing from a Hive point of view.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - hive.catalog
>>>>>>>>>>>>>> - Optional table property 'iceberg.table_identifier'
>>>>>>>>>>>>>> specifies the table id. If it's not set, then <database_name>.<table_name>
>>>>>>>>>>>>>> is used as table identifier
>>>>>>>>>>>>>> - We have the assumption that the current Hive
>>>>>>>>>>>>>> metastore stores the table, i.e. we don't support external Hive
>>>>>>>>>>>>>> metastores currently
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These sound fine for Hive catalog tables that are created
>>>>>>>>>>>>> outside of the automatic Hive table creation (see
>>>>>>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd
>>>>>>>>>>>>> just need to document how you can create these yourself and that one could
>>>>>>>>>>>>> use a different Hive database and table etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Independent of catalog implementations, but we also have
>>>>>>>>>>>>>> table property 'iceberg.file_format' to specify the file format for the
>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We haven't released it yet, so we are open to changes, but I
>>>>>>>>>>>>>> think these properties are reasonable and it would be great if we could
>>>>>>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>>>>>>> metastore of tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> If others agree I think we should create an issue where we
>>>>>>>>>>>>> document the above changes so it's very clear what we're doing and can then
>>>>>>>>>>>>> go an implement them and update the docs etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>>>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>>>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>>>>>>> jacques@dremio.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree with Ryan on the core principles here. As I
>>>>>>>>>>>>>>>>> understand them:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>>>>>> 2. Hive table properties describe "how to get to"
>>>>>>>>>>>>>>>>> Iceberg metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>>>>>> 3. There could be default "how to get to" information
>>>>>>>>>>>>>>>>> set at a global level
>>>>>>>>>>>>>>>>> 4. Best-effort schema should stored be in the table
>>>>>>>>>>>>>>>>> properties in HMS. This should be done for information schema retrieval
>>>>>>>>>>>>>>>>> purposes within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4,
>>>>>>>>>>>>>>>>> which I just added)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One comment I have on #2 is that for different catalogs
>>>>>>>>>>>>>>>>> and use cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary paths for
>>>>>>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then want
>>>>>>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Similarly, you might want to expose a particular branch
>>>>>>>>>>>>>>>>> version of a table. So it might say:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just saying that the address to the table in the catalog
>>>>>>>>>>>>>>>>> could itself have several properties. The key being that no matter what
>>>>>>>>>>>>>>>>> those are, we should follow #1 and only store properties that are about the
>>>>>>>>>>>>>>>>> ptr, not the content/metadata.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it.
>>>>>>>>>>>>>>>>> Can someone confirm that it is true? And that it is possible/not
>>>>>>>>>>>>>>>>> problematic?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking
>>>>>>>>>>>>>>>>>> about these problems as well, so this is a good time to have a discussion
>>>>>>>>>>>>>>>>>> about Hive config.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think that Hive configuration should work mostly like
>>>>>>>>>>>>>>>>>> other engines, where different configurations are used for different
>>>>>>>>>>>>>>>>>> purposes. Different purposes means that there is not a global configuration
>>>>>>>>>>>>>>>>>> priority. Hopefully, I can explain how we use the different config sources
>>>>>>>>>>>>>>>>>> elsewhere to clarify.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it
>>>>>>>>>>>>>>>>>> has a Hadoop Configuration, but it also has its own global configuration.
>>>>>>>>>>>>>>>>>> There are also Iceberg table properties, and all of the various Hive
>>>>>>>>>>>>>>>>>> properties if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The first step is to simplify where we can, so we
>>>>>>>>>>>>>>>>>> effectively eliminate 2 sources of config:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - The Hadoop Configuration is only used to
>>>>>>>>>>>>>>>>>> instantiate Hadoop classes, like FileSystem. Iceberg should not use it for
>>>>>>>>>>>>>>>>>> any other config.
>>>>>>>>>>>>>>>>>> - Config in the Hive MetaStore is only used to
>>>>>>>>>>>>>>>>>> identify that a table is Iceberg and point to its metadata location. All
>>>>>>>>>>>>>>>>>> other config in HMS is informational. For example, the input format is
>>>>>>>>>>>>>>>>>> FileInputFormat so that non-Iceberg readers cannot actually instantiate the
>>>>>>>>>>>>>>>>>> format (it’s abstract) but it is available so they also don’t fail trying
>>>>>>>>>>>>>>>>>> to load the class. Table-specific config should not be stored in table or
>>>>>>>>>>>>>>>>>> serde properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Iceberg differs from other tables because it is
>>>>>>>>>>>>>>>>>> opinionated: data configuration should be maintained at the table level.
>>>>>>>>>>>>>>>>>> This is cleaner for users because config is standardized across engines and
>>>>>>>>>>>>>>>>>> in one place. And it also enables services that analyze a table and update
>>>>>>>>>>>>>>>>>> its configuration to tune options that users almost never do, like row
>>>>>>>>>>>>>>>>>> group or stripe size in the columnar formats. Iceberg table configuration
>>>>>>>>>>>>>>>>>> is used to configure table-specific concerns and behavior.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns,
>>>>>>>>>>>>>>>>>> and runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to load
>>>>>>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target split size.
>>>>>>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, but this
>>>>>>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an option
>>>>>>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have different
>>>>>>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a
>>>>>>>>>>>>>>>>>> Hadoop Configuration. As a result, I think the right place to store
>>>>>>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using a
>>>>>>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg catalogs are
>>>>>>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use
>>>>>>>>>>>>>>>>>> its own MetaStore to track Hive table concerns. The MetaStore may have
>>>>>>>>>>>>>>>>>> tables created by an Iceberg HiveCatalog, and Hive also needs to be able to
>>>>>>>>>>>>>>>>>> load tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>>>>>> current MetaStore URI to be used for HiveCatalog tables tracked in the
>>>>>>>>>>>>>>>>>> MetaStore
>>>>>>>>>>>>>>>>>> - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>>>>>> - HMS table properties should be used to determine
>>>>>>>>>>>>>>>>>> how to load a table: using a Hadoop location, using the default metastore
>>>>>>>>>>>>>>>>>> catalog, or using an external Iceberg catalog
>>>>>>>>>>>>>>>>>> - If there is a metadata_location, then use the
>>>>>>>>>>>>>>>>>> HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>>>>>> - If there is a catalog property, then load that
>>>>>>>>>>>>>>>>>> catalog and use it to load the table identifier, or maybe an identifier
>>>>>>>>>>>>>>>>>> from HMS table properties
>>>>>>>>>>>>>>>>>> - If there is no catalog or metadata_location,
>>>>>>>>>>>>>>>>>> then use HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This would make it possible to access all types of
>>>>>>>>>>>>>>>>>> Iceberg tables in the same query, and would match how Spark and Flink
>>>>>>>>>>>>>>>>>> configure catalogs. Other than the configuration above, I don’t think that
>>>>>>>>>>>>>>>>>> config in HMS should be used at all, like how the other engines work.
>>>>>>>>>>>>>>>>>> Iceberg is the source of truth for table metadata, HMS stores how to load
>>>>>>>>>>>>>>>>>> the Iceberg table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This isn’t quite how configuration works right now.
>>>>>>>>>>>>>>>>>> Currently, the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will
>>>>>>>>>>>>>>>>>> be used to load table locations. If it is set, then that catalog will be
>>>>>>>>>>>>>>>>>> used to load all tables by name. This makes it impossible to load tables
>>>>>>>>>>>>>>>>>> from different catalogs at the same time. That’s why I think the Iceberg
>>>>>>>>>>>>>>>>>> catalog for a table should be stored in HMS table properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag,
>>>>>>>>>>>>>>>>>> but I think this is long enough for now.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>>>>>>> lpinter@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global configuration.
>>>>>>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>