You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/07/23 19:55:00 UTC

[jira] [Commented] (SPARK-24814) Relationship between catalog and datasources

    [ https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553324#comment-16553324 ] 

Ryan Blue commented on SPARK-24814:
-----------------------------------

I've been implementing more logical plans (AppendData, DeleteFrom, CTAS, and RTAS) on top of my PR to add the proposed table catalog API. After thinking about this more, I don't think that we need #3. I think we should always go from a catalog to a table implementation (data source v2) instead of from a data source to a catalog.

For example, think about the "parquet" data source. Once we have multiple table catalogs, what table catalog should Parquet return? We could make it simply the "default", but then that restricts Spark from creating Parquet tables through other sources on some write paths. I think it makes no sense for a user to specify a CTAS for a Parquet table without also specifying a catalog in the table name (via name triple, {{catalog.db.table}}). TableIdentifier triples are supported through saveAsTable, insertIntoTable, and all SQL statements, so it is easy to specify the catalog nearly everywhere. The one write path that is left out is `df.write.save`, but that could require a `catalog` option like the `table` and `database` options.

> Relationship between catalog and datasources
> --------------------------------------------
>
>                 Key: SPARK-24814
>                 URL: https://issues.apache.org/jira/browse/SPARK-24814
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once multiple catalogs are supported) can determine the V2 datasource to use for reading and writing the data.
>  # Conversely, a V2 datasource can determine which catalog to use for resolution (e.g., if the user issues {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org