You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/17 20:50:21 UTC

[GitHub] [iceberg] Blake-Guo opened a new issue, #6613: Multiple SparkSessions interact with Iceberg Table

Blake-Guo opened a new issue, #6613:
URL: https://github.com/apache/iceberg/issues/6613

   ### Apache Iceberg version
   
   0.12.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I explored the multiple SparkSessions (to connect to different data sources/data clusters) to load the Iceberg Table a bit. And I found a wired behavior.
   
   If I use the **new** SparkSession (with some incorrect parameters like `spark.sql.catalog.mycatalog.uri`) to access the table created by the previous SparkSession through (1) `spark.read().*.load("*")`, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter). 
   
   The full test is given as below:
   
   ```
     @Test
     public void multipleSparkSessions() throws AnalysisException {
       // Create the 1st SparkSession
       String endpoint = String.format("http://localhost:%s/metastore", port);
   
       ctx = SparkSession
           .builder()
           .master("local")
           .config("spark.ui.enabled", false)
           .config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
           .config("spark.sql.catalog.mycatalog.type", "hive")
           .config("spark.sql.catalog.mycatalog.uri", endpoint)
           .config("spark.sql.catalog.mycatalog.cache-enabled", "false")
           .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
           .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
           .getOrCreate();
   
       // Create a table with the SparkSession
       String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
       ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
           + "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));
   
   
       // Create a new SparkSession
       SparkSession newSession = ctx.newSession();
       newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");
   
       // Access the created dataset above with the new SparkSession through session.read()...load()
       List<Row> dataset2 = newSession.read()
           .format("iceberg")
           .load(String.format("mycatalog.%s", tableName)).collectAsList();
       dataset2.forEach(r -> System.out.println(r));
   
       // Access the dataset through SQL
       newSession.sql(
           String.format("select * from mycatalog.%s", tableName)).collectAsList();
     }
   ```
   
   But if I use the new SparkSession to access the table through (1) `newSession.sql` first, the execution fails, and then (2) the `read().**.load("**")` will fail as well with error `java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679`.
   
   IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession shouldn't be able to locate that table.
   
   
   ```
     @Test
     public void multipleSparkSessions() throws AnalysisException {
       ..same as above...
   
   
       // Access the dataset through SQL first
       assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
           String.format("select * from mycatalog.%s", tableName)).collectAsList());
   
       // Access the created dataset above with the new SparkSession through session.read()...load()
       assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
           .format("iceberg")
           .load(String.format("mycatalog.%s", tableName)).collectAsList());
     }
   
   
   ```
   
   Any idea what could lead to these two different behaviors with `spark.read().load()` versus `spark.sql()` in different sequences?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6613:
URL: https://github.com/apache/iceberg/issues/6613#issuecomment-1386049515

   This is probably almost certainly a Spark SQLConf weirdness. I've seen stranger behaviors here too depending on whether you are using a notebook, repl, or application code. If you have this in a debugger I would put a breakpoint in the initialize code of the CatalogManager class to see when it's trigger and what conf it has when it's triggered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6613: Multiple SparkSessions interact with Iceberg Table
URL: https://github.com/apache/iceberg/issues/6613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Blake-Guo commented on issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by GitBox <gi...@apache.org>.

Blake-Guo commented on issue #6613:
URL: https://github.com/apache/iceberg/issues/6613#issuecomment-1386108491

   Thanks @RussellSpitzer . 
   So depending on if the user accesses the Iceberg Table through `sparkSession.sql` or `sparkSession.read().*.load(*)`, Spark's CatalogManager might load the Spark conf at different stages, which results in the wired behavior in this post, in other words, some bugs in the Spark side, is that what you mean? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6613:
URL: https://github.com/apache/iceberg/issues/6613#issuecomment-1386116704

   Well read load by default goes through the DataSource (IcebergSource) and the spark.sql version is going through "SparkCatalog". These should use the same underlying CatalogManager


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6613:
URL: https://github.com/apache/iceberg/issues/6613#issuecomment-1659375581

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6613: Multiple SparkSessions interact with Iceberg Table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6613:
URL: https://github.com/apache/iceberg/issues/6613#issuecomment-1637227111

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org