You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/24 16:12:41 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue #2374: SparkSessionCatalog Drop Issues

RussellSpitzer opened a new issue #2374:
URL: https://github.com/apache/iceberg/issues/2374


   Talking with @szehon-ho we found a few issues with the drop pathway for SparkSessionCatalog
   
   Normally when a drop table statement is parsed it goes through the following steps
   ```
   Parse DropTableStatement Object
   
   // Attempt to Resolve as non session Catalog
   Apply ResolveCatalog
     If the catalog of the resolved drop command is the session catalog 
        do nothing
      else
         Create DropTable plan (eventually hits V2CatalogResolution Rules and calls V2 Catalog.dropTable
    
    // Attempt to resolve as session Catalog
    Apply ResolveSessionCatalog
      Create DropTableCommand plan
      Calls spark.sessionState.catalog.dropTable(table)
    ```
    
    The problem with the pathway here is that the spark.sessionState.catalog will always invoke the underlying delegate's dropTable method. This is bad for a few reasons
    
    1. Treats the table as a HiveTable when deleting and will ignore Iceberg specific properties. This means files that are not in the default location for the table will be ignored. 
    2. The Iceberg Catalog never sees the drop operation so the Cache there is never cleared
    3. If the Delegate catalog is using a different HMS (or catalog) than the Iceberg Catalog, a completely different table could be dropped or you get "no table found" when attempting to drop the table.
    
    
    
    A quick repo for Delegate Catalog is Different than Iceberg Catalog
    
    Create a session where spark is using Derby and Iceberg is using HMS
    ```bash
    ./bin/spark-shell --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.spark_catalog.uri=thrift://localhost:9083
    ```
    
    ```scala
    scala> spark.sql("CREATE TABLE cold (ice int) USING iceberg LOCATION 'file:///Users/russellspitzer/cold'")
     scala> spark.sql("INSERT INTO cold VALUES (1)")
     scala> spark.sql("SELECT * from cold").show
   +---+
   |ice|
   +---+
   |  1|
   +---+
   scala> spark.sql("DROP TABLE cold")
   org.apache.spark.sql.AnalysisException: Table or view not found: cold;
   ```
   
   We also have an issue where we always delegate "showTables" which also is a problem if the delegate catalog is not the same underlying store that the SparkSessionCatalog is using.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2374: SparkSessionCatalog Drop Issues

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2374:
URL: https://github.com/apache/iceberg/issues/2374#issuecomment-805988043


   Yeah, that's unfortunate. I've noticed this some time but did not have time to look into this.
   
   @rdblue, any ideas why we delegate the drop to the v1 session catalog?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2374: SparkSessionCatalog Drop Issues

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2374:
URL: https://github.com/apache/iceberg/issues/2374#issuecomment-806006808


   @sunchao Pointed out that this issue is most likely resolved in Spark 3.1.1 which only handle V1Tables via the old catalog.
   
   https://github.com/apache/spark/pull/27550
   
   The code now only will apply to V1 Tables
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L327-L328
   
   We should verify this, and if so we can close out the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2374: SparkSessionCatalog Drop Issues

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2374:
URL: https://github.com/apache/iceberg/issues/2374#issuecomment-806582742


   Just a note if anyone hits this on Spark 3.0.x, for the cache issue (2 above) a workaround is:
   
   `spark.sessionState.catalogManager.catalog(catalogName).asInstanceOf[SparkSessionCatalog].dropTable(db, table)`
   
   Otherwise you are stuck, there are no way to 'refresh table' as Spark does a tableExists check.
   
   It doesn't fix the other issues though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2374: SparkSessionCatalog Drop Issues

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on issue #2374:
URL: https://github.com/apache/iceberg/issues/2374#issuecomment-806006808


   @sunchao Pointed out that this issue is most likely resolved in Spark 3.1.1 which only handles V1Tables via the old catalog.
   
   https://github.com/apache/spark/pull/27550
   
   The code now only will apply to V1 Tables
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L327-L328
   
   We should verify this, and if so we can close out the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org