You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:04:16 UTC
[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

     [ https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-24261:
---------------------------------
    Labels: bulk-closed  (was: )

> Spark cannot read renamed managed Hive table
> --------------------------------------------
>
>                 Key: SPARK-24261
>                 URL: https://issues.apache.org/jira/browse/SPARK-24261
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Suraj Nayak
>            Priority: Major
>              Labels: bulk-closed
>         Attachments: some_db.some_new_table.ddl, some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive makes sure the table name is changed and also the path is changed to new location. But it never updates the serdeproperties mentioned above. 
> *Steps to Reproduce:*
> 1. Save table using spark:
>  {{spark.sql("select * from some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run 
> {{alter table some_db.some_new_table rename to some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached. 
> This will create inconsistency and endusers will spend endless time in finding bug if data exists in both location, but spark reads it from different location while hive process writes the new data in new location. 
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org