You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Thijsvandepoll (via GitHub)" <gi...@apache.org> on 2023/05/10 09:08:32 UTC

[GitHub] [iceberg] Thijsvandepoll opened a new issue, #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Thijsvandepoll opened a new issue, #7574:
URL: https://github.com/apache/iceberg/issues/7574

   ### Apache Iceberg version
   
   1.2.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have been experiencing some (unexpected?) behavior where a catalog reference in Spark is not reflected in the Hive Metastore. I have followed the Spark configuration according to the [documentation](https://iceberg.apache.org/docs/latest/spark-configuration/), which looks like it should create a new catalog with the respective name. Everything works as expected, except for that the catalog is NOT being inserted in the Hive Metastore. This has some implications which I will showcase using an example.
   
   Here is sample script in PySpark:
   ```python
   import os
   from pyspark.sql import SparkSession
   
   deps = [
       "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
       "org.apache.iceberg:iceberg-aws:1.2.1",
       "software.amazon.awssdk:bundle:2.17.257",
       "software.amazon.awssdk:url-connection-client:2.17.257"
   ]
   os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
   os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
   os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
   os.environ["AWS_REGION"] = "eu-east-1"
   
   
   catalog = "hive_catalog"
   spark = SparkSession.\
       builder.\
       appName("Iceberg Reader").\
       config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
       config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
       config(f"spark.sql.catalog.{catalog}.type", "hive").\
       config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
       config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
       config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
       config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
       config("hive.metastore.uris", "thrift://localhost:9083").\
       enableHiveSupport().\
       getOrCreate()
   
   # Raises error
   spark.sql("CREATE NAMESPACE wrong_catalog.new_db;")
   
   # Correct creation of namespace
   spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
   
   # Create table
   spark.sql(f"CREATE TABLE {catalog}.new_db.new_table (col1 INT, col2 STRING);")
   
   # Insert data
   spark.sql(f"INSERT INTO {catalog}.new_db.new_table VALUES (1, 'first'), (2, 'second');")
   
   # Read data
   spark.sql(f"SELECT * FROM {catalog}.new_db.new_table;").show()
   #|col1|  col2|
   #+----+------+
   #|   1| first|
   #|   2|second|
   #+----+------+
   
   # Read metadata
   spark.sql(f"SELECT * FROM {catalog}.new_db.new_table.files;").show()
   #+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   #|content|           file_path|file_format|spec_id|record_count|file_size_in_bytes|      column_sizes|    value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    readable_metrics|
   #+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   #|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|               652|{1 -> 47, 2 -> 51}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 -> , 2 -> ...|{1 -> , 2 -> ...|        null|          [4]|        null|            0|{{47, 1, 0, null,...|
   #|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|               660|{1 -> 47, 2 -> 53}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 -> , 2 -> ...|{1 -> , 2 -> ...|        null|          [4]|        null|            0|{{47, 1, 0, null,...|
   #+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   ```
   
   Now, this seems all good. It created a namespace, table and inserted data in the table. Now, showing results from the Hive Metastore shows what is the problem (`CTLGS`):
   ```
   |CTLG_ID|NAME|DESC                    |LOCATION_URI    |
   |-------|----|------------------------|----------------|
   |1      |hive|Default catalog for Hive|s3a://lakehouse/|
   ```
   It does NOT insert a new catalog with the respective catalog name. We can see that the namespaces and tables actually have been inserted in the Hive Metastore though (`DBS` and `TBLS`):
   ```
   |DB_ID|DESC                 |DB_LOCATION_URI          |NAME   |OWNER_NAME    |OWNER_TYPE|CTLG_NAME|
   |-----|---------------------|-------------------------|-------|--------------|----------|---------|
   |1    |Default Hive database|s3a://lakehouse/         |default|public        |ROLE      |hive     |
   |2    |                     |s3a://lakehouse/new_db.db|new_db |thijsvandepoll|USER      |hive     |
   
   
   |TBL_ID|CREATE_TIME  |DB_ID|LAST_ACCESS_TIME|OWNER         |OWNER_TYPE|RETENTION    |SD_ID|TBL_NAME |TBL_TYPE      |VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|IS_REWRITE_ENABLED|
   |------|-------------|-----|----------------|--------------|----------|-------------|-----|---------|--------------|------------------|------------------|------------------|
   |1     |1.683.707.647|2    |80.467          |thijsvandepoll|USER      |2.147.483.647|1    |new_table|EXTERNAL_TABLE|                  |                  |0                 |
   ```
   
   This means that it uses the Hive default catalog instead of the provided name. I am not exactly sure if this is expected behavior or unexpected behavior. Everything else works fine up to now. However, the problem exists when we want to create another the same namespace but in another catalog:
   
   ```
   import os
   from pyspark.sql import SparkSession
   
   deps = [
       "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
       "org.apache.iceberg:iceberg-aws:1.2.1",
       "software.amazon.awssdk:bundle:2.17.257",
       "software.amazon.awssdk:url-connection-client:2.17.257"
   ]
   os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
   os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
   os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
   os.environ["AWS_REGION"] = "eu-east-1"
   
   catalog = "other_catalog"
   spark = SparkSession.\
       builder.\
       appName("Iceberg Reader").\
       config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
       config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
       config(f"spark.sql.catalog.{catalog}.type", "hive").\
       config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
       config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
       config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
       config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
       config("hive.metastore.uris", "thrift://localhost:9083").\
       enableHiveSupport().\
       getOrCreate()
   
   # Error that catalog already exists
   spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
   # pyspark.sql.utils.AnalysisException: Namespace 'new_db' already exists
   
   # Create another namespace
   spark.sql(f"CREATE NAMESPACE {catalog}.other_db;")
   
   # Try to access data from other catalog using current catalog
   spark.sql("SELECT * FROM {catalog}.new_db.new_table;").show()
   #|col1|  col2|
   #+----+------+
   #|   1| first|
   #|   2|second|
   #+----+------+
   ```
   
   Now we can see that even though we are referencing another catalog, it still uses the Hive default catalog implicitly. We can see that by viewing `DBS` in the Hive Metastore:
   ```
   |DB_ID|DESC                 |DB_LOCATION_URI            |NAME    |OWNER_NAME    |OWNER_TYPE|CTLG_NAME|
   |-----|---------------------|---------------------------|--------|--------------|----------|---------|
   |1    |Default Hive database|s3a://lakehouse/           |default |public        |ROLE      |hive     |
   |2    |                     |s3a://lakehouse/new_db.db  |new_db  |thijsvandepoll|USER      |hive     |
   |3    |                     |s3a://lakehouse/other_db.db|other_db|thijsvandepoll|USER      |hive     |
   ```
   
   
   Basically this means that Iceberg together with the Hive Metastore does not have a notion of a catalog. It is just a list of namespaces + tables which can be defined. It is actually a single catalog so it seems.
   
   
   Can anyone help me understand what is going on? Do I miss configurations? Is this expected behavior or a bug? Thanks in advance!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1543530638

   Hey @Thijsvandepoll thanks for the detailed description. The concept of a catalog exists in Spark and Hive, and they are independent. For example, you can rename the catalog in Spark, without impacting Hive. I'm not aware of any efforts to integrate these two, but this would be more of a discussion between Spark and Hive, rather than Iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1832897692

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Thijsvandepoll commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "Thijsvandepoll (via GitHub)" <gi...@apache.org>.
Thijsvandepoll commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1545648013

   Thanks a lot @dramaticlly! This solves the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1545914488

   @Thijsvandepoll can you share a bit more? What's happening now versus what's your expectation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1544908883

   I think I actually run into this problem before, it was due to the fact that spark by default use in-memory catalog instead of HMS. 
   
   Can you try again with this new spark conf `--conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1813498581

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default
URL: https://github.com/apache/iceberg/issues/7574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Thijsvandepoll commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "Thijsvandepoll (via GitHub)" <gi...@apache.org>.
Thijsvandepoll commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1546638310

   Yeah I thought it fixed it but mislooked. So the things what I am seeing is:
   - A single registered catalog in the HMS (see example above).
   - Rest works fine.
   - If do `CREATE NAMESPACE hive_catalog.new_namespace;` and then start a new session for another catalog and do create a new namespace there with the same name as the previous `CREATE other_catalog.new_namespace`, it gives a conflict that `new_space` already exists. Because actually `other_catalog` and `hive_catalog` point to the same catalog in the HMS called `hive`.
   
   What I am expecting is:
   - Create/register multiple catalogs by the provided name such that there is no name conflict across different catalogs.
   
   Hope this clarifies what I am seeing and expecting!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Thijsvandepoll commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "Thijsvandepoll (via GitHub)" <gi...@apache.org>.
Thijsvandepoll commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1554016183

   Yeah it looks like that!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1553608305

   thank you @Thijsvandepoll for the clarification. I guess in this case the catalog is really identified by its url behind it instead of by the name. So if you point both `hive_catalog` and `other_catalog` to the same hive URL such as localhost:9083, then they are essentially the same from hive point of view. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org