You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jacek Laskowski (JIRA)" <ji...@apache.org> on 2018/01/16 10:14:00 UTC
[jira] [Commented] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

    [ https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326959#comment-16326959 ] 

Jacek Laskowski commented on SPARK-22457:
-----------------------------------------

 That should be fairly easy to fix _iff_ we want to restrict the formats to {{FileFormat}} (that the mentioned formats are subtypes of).

Care to submit a pull request with the places where {{path}} is used to limit their scope to {{FileFormats}} only? (that would help draw more attention to the issue).

> Tables are supposed to be MANAGED only taking into account whether a path is provided
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-22457
>                 URL: https://issues.apache.org/jira/browse/SPARK-22457
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: David Arroyo
>            Priority: Major
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>       CatalogTableType.EXTERNAL
>     } else {
>       CatalogTableType.MANAGED
>     }
> {code}
> This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below: 
> 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.
> 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists:
> {code:java}
> val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))
> {code}
> 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients)
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL
> Would be possible only to mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?
> P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: 
> {code:java}
>        if (tableMeta.tableType == CatalogTableType.MANAGED)
>        ...
>        // Delete the data/directory of the table
>         val dir = new Path(tableMeta.location)
>         try {
>           val fs = dir.getFileSystem(hadoopConfig)
>           fs.delete(dir, true)
>         } catch {
>           case e: IOException =>
>             throw new SparkException(s"Unable to drop table $table as failed " +
>               s"to delete its directory $dir", e)
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org